mdy666 / Qwen-Native-Sparse-Attention
qwen-nsa
☆46Updated this week
Alternatives and similar repositories for Qwen-Native-Sparse-Attention:
Users that are interested in Qwen-Native-Sparse-Attention are comparing it to the libraries listed below
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆92Updated 2 weeks ago
- ☆73Updated last week
- ☆73Updated 2 weeks ago
- ☆146Updated 3 weeks ago
- TokenSkip: Controllable Chain-of-Thought Compression in LLMs☆107Updated 3 weeks ago
- More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression☆11Updated 2 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆170Updated last week
- [EMNLP 2024 Findings🔥] Official implementation of ": LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context In…☆92Updated 4 months ago
- LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification☆42Updated last month
- ☆111Updated this week
- The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference☆68Updated 2 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆129Updated last week
- A sparse attention kernel supporting mix sparse patterns☆177Updated last month
- Efficient triton implementation of Native Sparse Attention.☆132Updated this week
- DeepSeek Native Sparse Attention pytorch implementation☆58Updated last month
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆80Updated this week
- ☆8Updated 6 months ago
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆77Updated 9 months ago
- ☆39Updated 4 months ago
- ☆38Updated 2 weeks ago
- Reproducing R1 for Code with Reliable Rewards☆152Updated this week
- CoT-Valve: Length-Compressible Chain-of-Thought Tuning☆58Updated last month
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆52Updated last week
- Efficient Mixture of Experts for LLM Paper List☆51Updated 3 months ago
- [ICLR 2025] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models☆82Updated last month
- Code for paper "Patch-Level Training for Large Language Models"☆81Updated 4 months ago
- A generalized framework for subspace tuning methods in parameter efficient fine-tuning.☆132Updated last month
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection☆41Updated 5 months ago
- Official PyTorch implementation of "IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact"☆43Updated 10 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆67Updated 3 weeks ago