NJUNLP / MCSD
Multi-Candidate Speculative Decoding
β27Updated 4 months ago
Related projects: β
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β26Updated last month
- QAQ: Quality Adaptive Quantization for LLM KV Cacheβ42Updated 5 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β127Updated 3 months ago
- β25Updated last month
- Ouroboros: Speculative Decoding with Large Model Enhanced Draftingβ60Updated 6 months ago
- β37Updated 5 months ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)β51Updated 3 months ago
- [ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.β15Updated 6 months ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β156Updated 3 months ago
- OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structureβ15Updated last month
- π° Must-read papers on KV Cache Compression (constantly updating π€).β34Updated this week
- An innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.β21Updated 7 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ134Updated 2 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ69Updated 6 months ago
- β22Updated last month
- β60Updated last month
- β38Updated 4 months ago
- Official PyTorch implementation of IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intactβ25Updated 3 months ago
- β29Updated 3 weeks ago
- PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".β70Updated last year
- Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Modelsβ27Updated last week
- Official implementation for the paper *π―DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*β57Updated 3 weeks ago
- Unofficial implementations of block/layer-wise pruning methods for LLMs.β45Updated 4 months ago
- Codes for our paper "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation" (EMNLP 2023 Findings)β31Updated 9 months ago
- β164Updated 4 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ161Updated 2 months ago
- 16-fold memory access reduction with nearly no lossβ35Updated last month
- Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exitingβ39Updated 2 months ago
- Code for Palu: Compressing KV-Cache with Low-Rank Projectionβ39Updated this week
- Repository of LV-Eval Benchmarkβ41Updated 2 weeks ago