itsdaniele / speculative_mambaLinks
☆15Updated 10 months ago
Alternatives and similar repositories for speculative_mamba
Users that are interested in speculative_mamba are comparing it to the libraries listed below
Sorting:
- ☆16Updated last year
- Fast and memory-efficient exact attention☆70Updated 7 months ago
- KV cache compression via sparse coding☆14Updated 5 months ago
- HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…☆23Updated 8 months ago
- Work in progress.☆74Updated 3 months ago
- Explore training for quantized models☆25Updated 3 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆114Updated 2 weeks ago
- ☆251Updated 4 months ago
- The evaluation framework for training-free sparse attention in LLMs☆101Updated 4 months ago
- Official implementation for "Pruning Large Language Models with Semi-Structural Adaptive Sparse Training" (AAAI 2025)☆15Updated 3 months ago
- Transformers components but in Triton☆34Updated 5 months ago
- ☆56Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆130Updated 10 months ago
- ☆145Updated 8 months ago
- Official implementation for Training LLMs with MXFP4☆97Updated 5 months ago
- ☆82Updated 8 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆119Updated 3 months ago
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference☆54Updated 10 months ago
- The official repository of Quamba1 [ICLR 2025] & Quamba2 [ICML 2025]☆59Updated 4 months ago
- ☆129Updated 4 months ago
- A bunch of kernels that might make stuff slower 😉☆61Updated last week
- ☆102Updated this week
- Normalized Transformer (nGPT)☆191Updated 11 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆91Updated 3 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆323Updated last month
- Accelerated First Order Parallel Associative Scan☆189Updated last year
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆108Updated last year
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆193Updated 4 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆83Updated last year
- LLM Inference with Microscaling Format☆31Updated 11 months ago