Infini-AI-Lab / TriForceLinks
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
β266Updated last year
Alternatives and similar repositories for TriForce
Users that are interested in TriForce are comparing it to the libraries listed below
Sorting:
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inferenceβ262Updated 5 months ago
- [ICLR 2025π₯] SVD-LLM & [NAACL 2025π₯] SVD-LLM V2β253Updated last month
- MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Predictionβ93Updated 11 months ago
- APOLLO: SGD-like Memory, AdamW-level Performance; MLSys'25 Oustanding Paper Honorable Mentionβ256Updated 5 months ago
- An acceleration library that supports arbitrary bit-width combinatorial quantization operationsβ234Updated last year
- β282Updated 2 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ128Updated 10 months ago
- 16-fold memory access reduction with nearly no lossβ105Updated 6 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ142Updated 4 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ339Updated 2 months ago
- The Official Implementation of Ada-KV [NeurIPS 2025]β105Updated last week
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β60Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ168Updated last year
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projectionβ140Updated 7 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β110Updated 6 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β202Updated 7 months ago
- β143Updated 7 months ago
- [ACL 2025 main] FR-Spec: Frequency-Ranked Speculative Samplingβ44Updated 2 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMsβ156Updated 2 weeks ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ117Updated 5 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.β49Updated 11 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ326Updated last week
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>β146Updated 2 months ago
- β43Updated 4 months ago
- [NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.β196Updated 4 months ago
- β119Updated 4 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β210Updated 3 weeks ago
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark oβ¦β85Updated 7 months ago
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]β52Updated this week
- β56Updated last year