ai-compiler-study / triton-kernelsLinks
Triton kernels for Flux
☆21Updated last month
Alternatives and similar repositories for triton-kernels
Users that are interested in triton-kernels are comparing it to the libraries listed below
Sorting:
- Writing FLUX in Triton☆38Updated 10 months ago
- [WIP] Better (FP8) attention for Hopper☆32Updated 5 months ago
- ☆114Updated last year
- Simple implementation of muP, based on Spectral Condition for Feature Learning. The implementation is SGD only, dont use it for Adam☆84Updated last year
- PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu☆72Updated 8 months ago
- Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops☆30Updated last year
- ring-attention experiments☆146Updated 9 months ago
- FlexAttention w/ FlashAttention3 Support☆27Updated 10 months ago
- research impl of Native Sparse Attention (2502.11089)☆60Updated 5 months ago
- ☆39Updated 4 months ago
- 🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …☆78Updated last month
- These papers will provide unique insightful concepts that will broaden your perspective on neural networks and deep learning☆48Updated last year
- DPO, but faster 🚀☆44Updated 8 months ago
- ☆83Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆80Updated 11 months ago
- BFloat16 Fused Adam Operator for PyTorch☆15Updated 8 months ago
- Implementation of Diffusion Transformers and Rectified Flow in Jax☆25Updated last year
- Make triton easier☆47Updated last year
- Experiment of using Tangent to autodiff triton☆80Updated last year
- Learn CUDA with PyTorch☆33Updated 3 weeks ago
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆19Updated 8 months ago
- ☆73Updated 7 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆46Updated last year
- ☆17Updated 8 months ago
- A bunch of kernels that might make stuff slower 😉☆56Updated last week
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆152Updated last month
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆61Updated 9 months ago
- ☆26Updated last year
- Mixture of A Million Experts☆46Updated last year
- Load compute kernels from the Hub☆220Updated last week