itsdaniele / speculative_mambaLinks
☆14Updated 8 months ago
Alternatives and similar repositories for speculative_mamba
Users that are interested in speculative_mamba are comparing it to the libraries listed below
Sorting:
- ☆14Updated 10 months ago
- KV cache compression via sparse coding☆12Updated 3 months ago
- The official repository of Quamba1 [ICLR 2025] & Quamba2 [ICML 2025]☆56Updated last month
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆62Updated 3 weeks ago
- QJL: 1-Bit Quantized JL transform for KV Cache Quantization with Zero Overhead☆29Updated 6 months ago
- Official implementation for "Pruning Large Language Models with Semi-Structural Adaptive Sparse Training" (AAAI 2025)☆12Updated last month
- Work in progress.☆70Updated last month
- [EMNLP 2024] Quantize LLM to extremely low-bit, and finetune the quantized LLMs☆13Updated last year
- The evaluation framework for training-free sparse attention in LLMs☆88Updated last month
- Official code for the paper "HEXA-MoE: Efficient and Heterogeneous-Aware MoE Acceleration with Zero Computation Redundancy"☆13Updated 5 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆110Updated last month
- Code for studying the super weight in LLM☆115Updated 8 months ago
- Fast and memory-efficient exact attention☆69Updated 5 months ago
- Transformers components but in Triton☆34Updated 3 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆109Updated 9 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆50Updated 8 months ago
- HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…☆18Updated 5 months ago
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆111Updated last month
- ☆42Updated 9 months ago
- ☆137Updated 5 months ago
- LLM Inference with Microscaling Format☆27Updated 9 months ago
- ☆25Updated 2 weeks ago
- ☆77Updated last month
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆145Updated last month
- ☆81Updated last year
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆128Updated 5 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆152Updated last month
- ☆19Updated 7 months ago
- Stick-breaking attention☆59Updated last month
- ☆51Updated last year