Infini-AI-Lab / TriForceLinks
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
β273Updated last year
Alternatives and similar repositories for TriForce
Users that are interested in TriForce are comparing it to the libraries listed below
Sorting:
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inferenceβ275Updated 7 months ago
- [ICLR 2025π₯] SVD-LLM & [NAACL 2025π₯] SVD-LLM V2β266Updated 3 months ago
- MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Predictionβ94Updated last year
- β293Updated 5 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β112Updated 8 months ago
- The Official Implementation of Ada-KV [NeurIPS 2025]β118Updated 2 weeks ago
- An acceleration library that supports arbitrary bit-width combinatorial quantization operationsβ238Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ133Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ170Updated last year
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projectionβ148Updated 9 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ356Updated 5 months ago
- APOLLO: SGD-like Memory, AdamW-level Performance; MLSys'25 Oustanding Paper Honorable Mentionβ264Updated 2 weeks ago
- β155Updated 9 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMsβ176Updated 2 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ156Updated 2 months ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β63Updated last year
- 16-fold memory access reduction with nearly no lossβ109Updated 8 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ341Updated 3 weeks ago
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automatonβ38Updated 10 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ135Updated last month
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark oβ¦β86Updated 9 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingβ331Updated last year
- Implementation for FP8/INT8 Rollout for RL training without performence drop.β279Updated last month
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β212Updated 3 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantizationβ394Updated last year
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variableβ200Updated last year
- β58Updated last year
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β213Updated 10 months ago
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generationβ243Updated 11 months ago
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>β154Updated 2 weeks ago