tilde-research / nsa-implLinks

An efficient implementation of the NSA (Native Sparse Attention) kernel

☆121

Alternatives and similar repositories for nsa-impl

Users that are interested in nsa-impl are comparing it to the libraries listed below

Sorting:

Dao-AILab / grouped-latent-attention
☆130Updated 5 months ago
HanGuo97 / log-linear-attention
☆251Updated 4 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆102Updated 2 weeks ago
OpenSparseLLMs / Linearization
☆61Updated 3 months ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆197Updated 4 months ago
sustcsonglin / linear-attention-and-beyond-slides
☆95Updated 8 months ago
Infini-AI-Lab / Kinetics
Kinetics: Rethinking Test-Time Scaling Laws
☆81Updated 3 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆52Updated 3 months ago
Infini-AI-Lab / Multiverse
☆100Updated last month
XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆241Updated 5 months ago
hao-ai-lab / Awesome-Video-Attention
A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and cach…
☆41Updated last month
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆71Updated 7 months ago
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆242Updated 3 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆148Updated 2 weeks ago
FasterDecoding / TEAL
☆145Updated 8 months ago
Infini-AI-Lab / gsm_infinite
☆55Updated 4 months ago
z-lab / sparselora
[ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
☆60Updated 3 months ago
qiuzh20 / gated_attention
The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink…
☆99Updated last month
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆266Updated last month
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆147Updated 3 months ago
horseee / dKV-Cache
[NeurIPS'25] dKV-Cache: The Cache for Diffusion Language Models
☆110Updated 5 months ago
OpenSparseLLMs / Linear-MoE
☆120Updated 4 months ago
sramshetty / mixture-of-depths
An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆36Updated last year
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆92Updated 3 months ago
jxiw / M1
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
☆44Updated 3 months ago
RLsys-Foundation / TritonForge
🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…
☆89Updated 3 weeks ago
metacarbon / shareAtt
Beyond KV Caching: Shared Attention for Efficient LLMs
☆19Updated last year
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆97Updated 10 months ago
sail-sg / SimLayerKV
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆49Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year