tilde-research / nsa-implLinks
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆114Updated 2 months ago
Alternatives and similar repositories for nsa-impl
Users that are interested in nsa-impl are comparing it to the libraries listed below
Sorting:
- ☆123Updated 3 months ago
- ☆240Updated 2 months ago
- ☆80Updated 6 months ago
- The evaluation framework for training-free sparse attention in LLMs☆91Updated 2 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆174Updated 2 months ago
- Kinetics: Rethinking Test-Time Scaling Laws☆79Updated last month
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆43Updated last month
- [ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity☆51Updated last month
- Efficient triton implementation of Native Sparse Attention.☆209Updated 3 months ago
- ☆140Updated 6 months ago
- ☆55Updated last month
- ☆87Updated last month
- Fast and memory-efficient exact attention☆70Updated 5 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring☆222Updated last month
- Odysseus: Playground of LLM Sequence Parallelism☆76Updated last year
- ☆52Updated 2 months ago
- A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and cach…☆36Updated last month
- Transformers components but in Triton☆34Updated 3 months ago
- Beyond KV Caching: Shared Attention for Efficient LLMs☆19Updated last year
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆136Updated 3 months ago
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆145Updated last month
- ☆92Updated 3 months ago
- 🔥 A minimal training framework for scaling FLA models☆233Updated last week
- 16-fold memory access reduction with nearly no loss☆104Updated 5 months ago
- Quantized Attention on GPU☆44Updated 9 months ago
- M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models☆37Updated last month
- LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification☆63Updated last month
- ☆21Updated 5 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆129Updated 8 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆48Updated 10 months ago