OpenNLPLab / lightning-attentionLinks

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

☆323

Alternatives and similar repositories for lightning-attention

Users that are interested in lightning-attention are comparing it to the libraries listed below

Sorting:

fxmeng / TransMLA
TransMLA: Multi-Head Latent Attention Is All You Need
☆335Updated 3 weeks ago
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆130Updated last year
OpenNLPLab / TransnormerLLM
Official implementation of TransNormerLLM: A Faster and Better LLM
☆247Updated last year
JT-Ushio / MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
☆183Updated last month
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆221Updated last month
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆220Updated last month
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆106Updated 2 years ago
dhcode-cpp / NSA-pytorch
DeepSeek Native Sparse Attention pytorch implementation
☆83Updated 5 months ago
XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆186Updated 2 months ago
transformer-vq / transformer_vq
☆196Updated last year
tensorgi / TPA
The official implementation of TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)
☆380Updated last week
fkodom / grouped-query-attention-pytorch
(Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …
☆173Updated last year
haochengxi / Train_Transformers_with_INT4
☆153Updated 2 years ago
Dao-AILab / grouped-latent-attention
☆123Updated 2 months ago
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆211Updated last year
YuchuanTian / DiJiang
[ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…
☆102Updated last year
NVlabs / Fast-dLLM
Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"
☆320Updated this week
apple / ml-sigmoid-attention
☆293Updated 3 months ago
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆213Updated last month
mdy666 / Qwen-Native-Sparse-Attention
qwen-nsa
☆70Updated 3 months ago
ByteDance-Seed / decoupleQ
A quantization algorithm for LLM
☆141Updated last year
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆385Updated last month
SkyworkAI / Skywork-MoE
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
☆136Updated last year
shreyansh26 / FlashAttention-PyTorch
Implementation of FlashAttention in PyTorch
☆159Updated 6 months ago
kyegomez / Mixture-of-Depths
Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆103Updated last week
astramind-ai / Mixture-of-depths
Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆167Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆149Updated last month
TsinghuaC3I / Fourier-Position-Embedding
[ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization
☆82Updated 2 months ago
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆85Updated 7 months ago
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆128Updated 11 months ago