mdy666 / qwen-nsa
qwen-nsa
☆42Updated last week
Alternatives and similar repositories for qwen-nsa:
Users that are interested in qwen-nsa are comparing it to the libraries listed below
- ☆70Updated 2 weeks ago
- More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression☆11Updated 2 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆88Updated last week
- ☆71Updated last week
- ☆102Updated last week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆62Updated last week
- TokenSkip: Controllable Chain-of-Thought Compression in LLMs☆98Updated 2 weeks ago
- Efficient triton implementation of Native Sparse Attention.☆116Updated this week
- ☆139Updated 2 weeks ago
- The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference☆67Updated 2 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆102Updated this week
- [EMNLP 2024 Findings🔥] Official implementation of ": LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context In…☆92Updated 4 months ago
- LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification☆42Updated 3 weeks ago
- ☆36Updated last week
- A sparse attention kernel supporting mix sparse patterns☆168Updated last month
- ☆39Updated 4 months ago
- ☆18Updated last week
- ☆9Updated 6 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆168Updated last month
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆170Updated 3 weeks ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆52Updated last month
- Official PyTorch implementation of IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact☆43Updated 10 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆62Updated 2 weeks ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆45Updated 4 months ago
- Efficient Mixture of Experts for LLM Paper List☆47Updated 3 months ago
- ☆19Updated 3 months ago
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection☆40Updated 5 months ago
- DeepSeek Native Sparse Attention pytorch implementation☆46Updated 3 weeks ago
- 🚀LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training☆75Updated 3 months ago
- Code for paper "Patch-Level Training for Large Language Models"☆81Updated 4 months ago