leloykun / flash-attention-minimalLinks

Flash Attention in 300-500 lines of CUDA/C++

☆36

Alternatives and similar repositories for flash-attention-minimal

Users that are interested in flash-attention-minimal are comparing it to the libraries listed below

Sorting:

PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆91Updated 4 months ago
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆74Updated 9 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆106Updated last month
FasterDecoding / TEAL
☆154Updated 9 months ago
Dao-AILab / grouped-latent-attention
☆132Updated 6 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆86Updated last year
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆253Updated 2 months ago
Infini-AI-Lab / Sirius
Sirius, an efficient correction mechanism, which significantly boosts Contextual Sparsity models on reasoning tasks while maintaining its…
☆21Updated last year
linxihui / dkernel
☆20Updated 7 months ago
AnswerDotAI / cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆146Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆212Updated 5 months ago
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆73Updated last year
shawntan / stickbreaking-attention
Stick-breaking attention
☆61Updated 5 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
SqueezeAILab / SqueezedAttention
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆54Updated last year
00ffcc / chunkRWKV6
continous batching and parallel acceleration for RWKV6
☆22Updated last year
thunlp / Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆112Updated 8 months ago
epfml / dynamic-sparse-flash-attention
☆150Updated 2 years ago
Aleph-Alpha-Research / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆22Updated last year
Doraemonzzz / Awesome-Triton-Resources
Awesome Triton Resources
☆38Updated 7 months ago
Jingyu6 / speculative_prefill
☆47Updated 6 months ago
sustcsonglin / fla-tilelang
☆22Updated 8 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆131Updated 6 months ago
Edward-Sun / gpt-accelera
Simple and efficient pytorch-native transformer training and inference (batched)
☆78Updated last year
open-lm-engine / accelerated-model-architectures
A bunch of kernels that might make stuff slower 😉
☆65Updated this week
mgmalek / efficient_cross_entropy
☆121Updated last year
linfeng93 / BiTA
An innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
☆26Updated 7 months ago
sjelassi / transformers_ssm_copy
☆35Updated last year
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆148Updated 9 months ago
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 6 months ago