Infini-AI-Lab / TriForceLinks
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
β276Updated last year
Alternatives and similar repositories for TriForce
Users that are interested in TriForce are comparing it to the libraries listed below
Sorting:
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inferenceβ283Updated 9 months ago
- [ICLR 2025π₯] SVD-LLM & [NAACL 2025π₯] SVD-LLM V2β280Updated 5 months ago
- MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Predictionβ94Updated last year
- β302Updated 6 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ370Updated 6 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ141Updated last year
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMsβ188Updated 4 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ176Updated last year
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ160Updated 3 months ago
- 16-fold memory access reduction with nearly no lossβ110Updated 10 months ago
- APOLLO: SGD-like Memory, AdamW-level Performance; MLSys'25 Oustanding Paper Honorable Mentionβ270Updated 2 months ago
- β158Updated 11 months ago
- The Official Implementation of Ada-KV [NeurIPS 2025]β125Updated 2 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β113Updated 10 months ago
- An acceleration library that supports arbitrary bit-width combinatorial quantization operationsβ240Updated last year
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generationβ248Updated last year
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ147Updated last month
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ356Updated 2 months ago
- KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches. EMNLP Findings 2024β88Updated 11 months ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β65Updated last year
- KV cache compression for high-throughput LLM inferenceβ151Updated last year
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projectionβ154Updated 11 months ago
- β49Updated last year
- Implementation for FP8/INT8 Rollout for RL training without performence drop.β289Updated 3 months ago
- [ACL 2025 main] FR-Spec: Frequency-Ranked Speculative Samplingβ49Updated 6 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.β52Updated last year
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β214Updated 11 months ago
- β129Updated 8 months ago
- β64Updated last year
- [NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.β218Updated 8 months ago