Infini-AI-Lab / TriForceLinks
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
β268Updated last year
Alternatives and similar repositories for TriForce
Users that are interested in TriForce are comparing it to the libraries listed below
Sorting:
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inferenceβ264Updated 6 months ago
- [ICLR 2025π₯] SVD-LLM & [NAACL 2025π₯] SVD-LLM V2β257Updated 2 months ago
- MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Predictionβ93Updated last year
- β285Updated 3 months ago
- An acceleration library that supports arbitrary bit-width combinatorial quantization operationsβ236Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ169Updated last year
- The Official Implementation of Ada-KV [NeurIPS 2025]β107Updated last month
- 16-fold memory access reduction with nearly no lossβ104Updated 7 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ130Updated 10 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ148Updated 2 weeks ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β60Updated last year
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ341Updated 3 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β111Updated 7 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ329Updated last month
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMsβ165Updated last month
- β145Updated 8 months ago
- APOLLO: SGD-like Memory, AdamW-level Performance; MLSys'25 Oustanding Paper Honorable Mentionβ258Updated 6 months ago
- β120Updated 4 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projectionβ143Updated 8 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β206Updated 8 months ago
- β60Updated 10 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ121Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ496Updated 8 months ago
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark oβ¦β86Updated 8 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β483Updated last year
- β58Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β210Updated last month
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Trainingβ216Updated last year
- β80Updated last year
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generationβ238Updated 10 months ago