BaohaoLiao / RSD
Reward-guided Speculative Decoding (RSD) for efficiency and effectiveness.
☆22Updated last week
Alternatives and similar repositories for RSD:
Users that are interested in RSD are comparing it to the libraries listed below
- Simple extension on vLLM to help you speed up reasoning model without training.☆139Updated 3 weeks ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆116Updated 9 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆111Updated 3 months ago
- KV cache compression for high-throughput LLM inference☆119Updated last month
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆166Updated 3 weeks ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆154Updated 9 months ago
- ☆122Updated last month
- ☆50Updated 5 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆140Updated this week
- Code repo for "CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs".☆13Updated 6 months ago
- ☆194Updated 3 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆157Updated 8 months ago
- ☆112Updated this week
- ☆76Updated 2 months ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆127Updated this week
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆148Updated 2 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆125Updated 3 months ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆162Updated last week
- ☆125Updated last year
- ☆111Updated last month
- The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".☆62Updated last week
- Explorations into some recent techniques surrounding speculative decoding☆250Updated 3 months ago
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆122Updated 3 months ago
- ☆37Updated 5 months ago
- [ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models☆86Updated 10 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆90Updated last week
- ☆36Updated 7 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆281Updated 2 months ago
- EvaByte: Efficient Byte-level Language Models at Scale☆85Updated last week
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆196Updated 3 months ago