shreyansh26 / FlashAttention-PyTorchLinks

Implementation of FlashAttention in PyTorch

☆174

Alternatives and similar repositories for FlashAttention-PyTorch

Users that are interested in FlashAttention-PyTorch are comparing it to the libraries listed below

Sorting:

mdy666 / mdy_triton
☆149Updated 4 months ago
MuLabPKU / TransMLA
TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)
☆412Updated 2 months ago
madsys-dev / deepseekv2-profile
☆152Updated 8 months ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆112Updated 2 years ago
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆250Updated 3 months ago
dhcode-cpp / NSA-pytorch
DeepSeek Native Sparse Attention pytorch implementation
☆108Updated 3 weeks ago
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆334Updated 9 months ago
flagos-ai / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆283Updated last year
mdy666 / Qwen-Native-Sparse-Attention
qwen-nsa
☆84Updated last month
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆231Updated last week
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆169Updated last month
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆453Updated 6 months ago
kyegomez / FlashAttention20Triton
Triton implementation of Flash Attention2.0
☆44Updated 2 years ago
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆131Updated last month
feifeibear / long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
☆602Updated last month
qingkelab / qingketalk
青稞Talk
☆168Updated this week
fkodom / grouped-query-attention-pytorch
(Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …
☆184Updated last year
dhcode-cpp / easy-dualpipe
Pipeline-Parallel Lecture: Simplest Dualpipe Implementation.
☆27Updated 2 months ago
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆151Updated last year
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆97Updated 11 months ago
Dao-AILab / grouped-latent-attention
☆132Updated 6 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆131Updated 6 months ago
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆109Updated 7 months ago
OpenBMB / infllmv2_cuda_impl
☆73Updated last month
haochengxi / Train_Transformers_with_INT4
☆157Updated 2 years ago
flagos-ai / FlagScale
FlagScale is a large model toolkit based on open-sourced projects.
☆412Updated last week
qhliu26 / Dive-into-Big-Model-Training
📑 Dive into Big Model Training
☆116Updated 2 years ago
yaof20 / Flash-RL
Implementation for FP8/INT8 Rollout for RL training without performence drop.
☆275Updated 3 weeks ago
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆218Updated last year
pprp / Awesome-Efficient-MoE
Efficient Mixture of Experts for LLM Paper List
☆144Updated 2 months ago