dhcode-cpp / NSA-pytorch
DeepSeek Native Sparse Attention pytorch implementation
☆61Updated last month
Alternatives and similar repositories for NSA-pytorch:
Users that are interested in NSA-pytorch are comparing it to the libraries listed below
- ☆116Updated this week
- qwen-nsa☆50Updated last week
- ☆178Updated last week
- TransMLA: Multi-Head Latent Attention Is All You Need☆236Updated last month
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆278Updated last month
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆75Updated last week
- Efficient triton implementation of Native Sparse Attention.☆136Updated last week
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆140Updated 3 weeks ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆85Updated this week
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆47Updated 9 months ago
- The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference☆72Updated 2 months ago
- A sparse attention kernel supporting mix sparse patterns☆192Updated 2 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆182Updated last week
- ☆131Updated last month
- 🔥 A minimal training framework for scaling FLA models☆101Updated last week
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆97Updated this week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆159Updated 9 months ago
- Efficient Mixture of Experts for LLM Paper List☆60Updated 4 months ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆66Updated this week
- ☆75Updated this week
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆160Updated this week
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆54Updated 3 weeks ago
- Implementation of FlashAttention in PyTorch☆141Updated 3 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆473Updated last week
- Awesome LLM pruning papers all-in-one repository with integrating all useful resources and insights.☆83Updated 4 months ago
- DeepSpeed教程 & 示例注释 & 学习笔记 (大模型高效训练)☆159Updated last year
- The official GitHub page for the survey paper "A Survey on Mixture of Experts in Large Language Models".☆327Updated last month
- Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?☆95Updated 6 months ago
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆123Updated 4 months ago
- ☆189Updated last year