hao-ai-lab / Dynasor
Simple extension on vLLM to help you speed up reasoning model without training.
☆137Updated 2 weeks ago
Alternatives and similar repositories for Dynasor:
Users that are interested in Dynasor are comparing it to the libraries listed below
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆154Updated 9 months ago
- KV cache compression for high-throughput LLM inference☆117Updated last month
- ☆232Updated 10 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆110Updated 3 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆116Updated 9 months ago
- The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".☆62Updated last week
- ☆118Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆157Updated 8 months ago
- ☆47Updated 3 months ago
- ☆76Updated 2 months ago
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton☆23Updated last month
- EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).☆56Updated 9 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆141Updated 6 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆438Updated last month
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆281Updated 2 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆90Updated last week
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆148Updated last week
- ☆86Updated 5 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024☆198Updated 3 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆91Updated last year
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**☆176Updated last month
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆260Updated 4 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆145Updated this week
- Modular and structured prompt caching for low-latency LLM inference☆89Updated 4 months ago
- [NeurIPS 24 Spotlight] MaskLLM: Learnable Semi-structured Sparsity for Large Language Models☆158Updated 2 months ago
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆196Updated 3 months ago
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆80Updated last month
- Reward-guided Speculative Decoding (RSD) for efficiency and effectiveness.☆20Updated last week
- Explorations into some recent techniques surrounding speculative decoding☆250Updated 3 months ago
- ☆36Updated 7 months ago