itsdaniele / speculative_mambaLinks
☆15Updated 9 months ago
Alternatives and similar repositories for speculative_mamba
Users that are interested in speculative_mamba are comparing it to the libraries listed below
Sorting:
- ☆14Updated 11 months ago
- Official implementation for "Pruning Large Language Models with Semi-Structural Adaptive Sparse Training" (AAAI 2025)☆12Updated 2 months ago
- KV cache compression via sparse coding☆12Updated 3 months ago
- Code for studying the super weight in LLM☆117Updated 9 months ago
- Experiments on Multi-Head Latent Attention☆95Updated last year
- Fast and memory-efficient exact attention☆69Updated 6 months ago
- 📄Small Batch Size Training for Language Models☆57Updated last week
- Official implementation for Training LLMs with MXFP4☆79Updated 4 months ago
- ☆57Updated 11 months ago
- Accelerated First Order Parallel Associative Scan☆187Updated last year
- ☆82Updated last year
- Normalized Transformer (nGPT)☆187Updated 9 months ago
- A MAD laboratory to improve AI architecture designs 🧪☆127Updated 8 months ago
- Official repo of dataset-decomposition paper [NeurIPS 2024]☆19Updated 7 months ago
- Work in progress.☆72Updated 2 months ago
- ☆298Updated 4 months ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆228Updated 4 months ago
- DeciMamba: Exploring the Length Extrapolation Potential of Mamba (ICLR 2025)☆30Updated 4 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆176Updated 2 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆209Updated 5 months ago
- The evaluation framework for training-free sparse attention in LLMs☆91Updated 2 months ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆97Updated 2 months ago
- The official repository of Quamba1 [ICLR 2025] & Quamba2 [ICML 2025]☆59Updated 2 months ago
- ☆240Updated 2 months ago
- Understand and test language model architectures on synthetic tasks.☆222Updated last month
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆85Updated last month
- ☆46Updated last year
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆114Updated 11 months ago
- Official code for the paper "HEXA-MoE: Efficient and Heterogeneous-Aware MoE Acceleration with Zero Computation Redundancy"☆13Updated 5 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆129Updated 9 months ago