itsdaniele / speculative_mambaLinks
☆15Updated 9 months ago
Alternatives and similar repositories for speculative_mamba
Users that are interested in speculative_mamba are comparing it to the libraries listed below
Sorting:
- ☆15Updated 11 months ago
- KV cache compression via sparse coding☆14Updated 4 months ago
- ☆35Updated last month
- Official implementation for "Pruning Large Language Models with Semi-Structural Adaptive Sparse Training" (AAAI 2025)☆13Updated 2 months ago
- Fast and memory-efficient exact attention☆69Updated 6 months ago
- The evaluation framework for training-free sparse attention in LLMs☆96Updated 3 months ago
- QJL: 1-Bit Quantized JL transform for KV Cache Quantization with Zero Overhead☆29Updated 7 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆89Updated 2 months ago
- Official code for the paper "HEXA-MoE: Efficient and Heterogeneous-Aware MoE Acceleration with Zero Computation Redundancy"☆13Updated 6 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆115Updated 3 months ago
- The official repository of Quamba1 [ICLR 2025] & Quamba2 [ICML 2025]☆59Updated 3 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆142Updated 4 months ago
- ☆142Updated 7 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆186Updated 3 months ago
- ☆59Updated 2 months ago
- ☆82Updated 8 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆129Updated 9 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring☆231Updated 2 months ago
- ☆18Updated 6 months ago
- Work in progress.☆73Updated 2 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆53Updated 10 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆140Updated 7 months ago
- ☆245Updated 3 months ago
- Flash Attention in 300-500 lines of CUDA/C++☆24Updated last month
- Stick-breaking attention☆60Updated 2 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆168Updated last year
- Code for studying the super weight in LLM☆119Updated 9 months ago
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆146Updated 2 months ago
- AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)☆21Updated 2 months ago
- Awesome Triton Resources☆33Updated 5 months ago