ai-compiler-study / triton-kernelsLinks

Triton kernels for Flux

☆22

Alternatives and similar repositories for triton-kernels

Users that are interested in triton-kernels are comparing it to the libraries listed below

Sorting:

timudk / flux_triton
Writing FLUX in Triton
☆41Updated last year
WaveSpeedAI / QuantumAttention
[WIP] Better (FP8) attention for Hopper
☆33Updated 8 months ago
mgmalek / efficient_cross_entropy
☆121Updated last year
aredden / torch-cublas-hgemm
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
☆75Updated 10 months ago
sandyresearch / chipmunk
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …
☆87Updated last month
chengzeyi / piflux
(WIP) Parallel inference for black-forest-labs' FLUX model.
☆18Updated 11 months ago
gpu-mode / ring-attention
ring-attention experiments
☆155Updated last year
aredden / torch-bnb-fp4
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
☆29Updated last year
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆62Updated last week
KONAKONA666 / q8_kernels
☆76Updated 10 months ago
cloneofsimo / ezmup
Simple implementation of muP, based on Spectral Condition for Feature Learning. The implementation is SGD only, dont use it for Adam
☆85Updated last year
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
cloneofsimo / min-fsdp
☆91Updated last year
vedantroy / gpu_kernels
☆27Updated last year
UmerHA / triton_util
Make triton easier
☆48Updated last year
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆102Updated 2 weeks ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆197Updated 4 months ago
test-time-training / ttt-tk
☆41Updated last week
fal-ai-community / NativeSparseAttention
research impl of Native Sparse Attention (2502.11089)
☆62Updated 8 months ago
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated last year
mayank31398 / ladder-residual-inference
☆14Updated 3 months ago
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆45Updated 10 months ago
brianfitzgerald / jax-mmdit
Implementation of Diffusion Transformers and Rectified Flow in Jax
☆26Updated last year
graphcore-research / out-of-the-box-fp8-training
Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.
☆45Updated last year
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
fal-ai-community / nano-mdm
Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun
☆56Updated 7 months ago
meta-pytorch / superblock
A block oriented training approach for inference time optimization.
☆34Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆84Updated last year