BaohaoLiao / RSDLinks

[ICML 2025] Reward-guided Speculative Decoding (RSD) for efficiency and effectiveness.

☆35

Alternatives and similar repositories for RSD

Users that are interested in RSD are comparing it to the libraries listed below

Sorting:

hao-ai-lab / Dynasor
Simple extension on vLLM to help you speed up reasoning model without training.
☆166Updated last month
ruipeterpan / specreason
PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [arXiv '25]
☆41Updated 2 months ago
astramind-ai / Mixture-of-depths
Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆167Updated last year
cmu-l3 / l1
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
☆228Updated 2 months ago
NVlabs / MaskLLM
[NeurIPS 24 Spotlight] MaskLLM: Learnable Semi-structured Sparsity for Large Language Models
☆171Updated 6 months ago
StarDewXXX / O1-Pruner
Official repository for paper: O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
☆85Updated 4 months ago
Infini-AI-Lab / MagicPIG
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
☆225Updated 7 months ago
SalesforceAIResearch / GemFilter
☆80Updated 6 months ago
FasterDecoding / SnapKV
☆263Updated last week
CMU-AIRe / MRT
Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".
☆98Updated 4 months ago
ZihanWang314 / CoE
Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models
☆216Updated 3 weeks ago
hemingkx / SWIFT
[ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
☆52Updated 4 months ago
UNITES-Lab / MC-SMoE
[ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"
☆86Updated 3 weeks ago
facebookresearch / sweet_rl
Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks
☆223Updated 2 months ago
GeniusHTX / TALE
☆122Updated last month
FasterDecoding / TEAL
☆136Updated 5 months ago
QwenLM / ParScale
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
☆412Updated 2 months ago
nightdessert / Retrieval_Head
open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality
☆204Updated 11 months ago
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆139Updated this week
CASE-Lab-UMD / Unified-MoE-Compression
The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".
☆71Updated 3 months ago
jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆117Updated last year
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆132Updated last week
sunblaze-ucb / Intuitor
Code for the paper: "Learning to Reason without External Rewards"
☆319Updated last week
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆475Updated 5 months ago
eqimp / hogwild_llm
Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache
☆112Updated last week
da03 / Internalize_CoT_Step_by_Step
☆181Updated 2 months ago
henryzhongsc / longctx_bench
Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…
☆80Updated 4 months ago
OpenSparseLLMs / Linear-MoE
☆110Updated last month
FasterDecoding / BitDelta
☆199Updated 7 months ago
Lucky-Lance / Expert_Sparsity
[ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
☆94Updated last year