Infini-AI-Lab / TriForceLinks
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
β274Updated last year
Alternatives and similar repositories for TriForce
Users that are interested in TriForce are comparing it to the libraries listed below
Sorting:
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inferenceβ278Updated 8 months ago
- [ICLR 2025π₯] SVD-LLM & [NAACL 2025π₯] SVD-LLM V2β270Updated 4 months ago
- MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Predictionβ94Updated last year
- β298Updated 5 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ135Updated last year
- An acceleration library that supports arbitrary bit-width combinatorial quantization operationsβ238Updated last year
- 16-fold memory access reduction with nearly no lossβ109Updated 9 months ago
- The Official Implementation of Ada-KV [NeurIPS 2025]β122Updated last month
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ160Updated 2 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ175Updated last year
- APOLLO: SGD-like Memory, AdamW-level Performance; MLSys'25 Oustanding Paper Honorable Mentionβ267Updated last month
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β64Updated last year
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.β52Updated last year
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β113Updated 9 months ago
- β157Updated 10 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β214Updated 10 months ago
- [NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.β216Updated 7 months ago
- [ACL 2025 main] FR-Spec: Frequency-Ranked Speculative Samplingβ49Updated 5 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMsβ182Updated 3 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ362Updated 5 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ144Updated 2 weeks ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β213Updated 3 months ago
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automatonβ39Updated 10 months ago
- β84Updated last year
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projectionβ151Updated 10 months ago
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark oβ¦β88Updated 10 months ago
- β64Updated last year
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ515Updated 10 months ago
- β49Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ346Updated last month