aredden / torch-cublas-hgemm
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
☆53Updated 2 months ago
Alternatives and similar repositories for torch-cublas-hgemm:
Users that are interested in torch-cublas-hgemm are comparing it to the libraries listed below
- ☆59Updated last month
- Triton kernels for Flux☆20Updated last month
- Writing FLUX in Triton☆32Updated 5 months ago
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆17Updated 3 months ago
- ☆16Updated 11 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆43Updated 7 months ago
- Context parallel attention that accelerates DiT model inference with dynamic caching☆189Updated this week
- ☆48Updated 11 months ago
- The official implementation of "CAME: Confidence-guided Adaptive Memory Optimization"☆84Updated 7 months ago
- ☆107Updated last month
- Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops☆26Updated 11 months ago
- extensible collectives library in triton☆83Updated 4 months ago
- Focused on fast experimentation and simplicity☆66Updated last month
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆89Updated this week
- ☆52Updated last year
- Experiment of using Tangent to autodiff triton☆75Updated last year
- QuIP quantization☆50Updated 11 months ago
- Patch convolution to avoid large GPU memory usage of Conv2D☆85Updated 3 weeks ago
- RWKV-7: Surpassing GPT☆79Updated 3 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated this week
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 6 months ago
- ☆21Updated 8 months ago
- A parallelism VAE avoids OOM for high resolution image generation☆53Updated 3 weeks ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆77Updated this week
- ☆49Updated 11 months ago
- FlexAttention w/ FlashAttention3 Support☆26Updated 4 months ago
- Faster generation with text-to-image diffusion models.☆210Updated 4 months ago