itsdaniele / speculative_mambaLinks
☆15Updated last year
Alternatives and similar repositories for speculative_mamba
Users that are interested in speculative_mamba are comparing it to the libraries listed below
Sorting:
- ☆18Updated last year
- KV cache compression via sparse coding☆17Updated 2 months ago
- ☆11Updated last year
- Fast and memory-efficient exact attention☆74Updated 10 months ago
- The official repository of Quamba1 [ICLR 2025] & Quamba2 [ICML 2025]☆66Updated 6 months ago
- QJL: 1-Bit Quantized JL transform for KV Cache Quantization with Zero Overhead☆31Updated 11 months ago
- AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)☆32Updated 3 months ago
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference☆55Updated last year
- Official implementation for "Pruning Large Language Models with Semi-Structural Adaptive Sparse Training" (AAAI 2025)☆18Updated 6 months ago
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆153Updated last month
- [ICML 2024] Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆38Updated 11 months ago
- ☆22Updated 10 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆128Updated 6 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆83Updated last year
- Code for studying the super weight in LLM☆120Updated last year
- Kinetics: Rethinking Test-Time Scaling Laws☆85Updated 6 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆90Updated 5 months ago
- Sirius, an efficient correction mechanism, which significantly boosts Contextual Sparsity models on reasoning tasks while maintaining its…☆21Updated last year
- ☆157Updated 10 months ago
- HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…☆29Updated 10 months ago
- The evaluation framework for training-free sparse attention in LLMs☆108Updated 2 months ago
- ☆51Updated last year
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆111Updated last year
- ☆69Updated last month
- An extention to the GaLore paper, to perform Natural Gradient Descent in low rank subspace☆18Updated last year
- ☆31Updated last year
- ☆263Updated 7 months ago
- ☆83Updated 2 years ago
- Official code for the paper "HEXA-MoE: Efficient and Heterogeneous-Aware MoE Acceleration with Zero Computation Redundancy"☆15Updated 10 months ago
- ☆115Updated 2 weeks ago