hmarkc / parallel-prompt-decodingLinks

Efficient LLM Inference Acceleration using Prompting

☆51

Alternatives and similar repositories for parallel-prompt-decoding

Users that are interested in parallel-prompt-decoding are comparing it to the libraries listed below

Sorting:

yxli2123 / LoSparse
☆61Updated 2 years ago
raymin0223 / fast_robust_early_exit
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)
☆64Updated last year
hdong920 / GRIFFIN
☆39Updated last year
kssteven418 / BigLittleDecoder
[NeurIPS'23] Speculative Decoding with Big Little Decoder
☆95Updated last year
ScalingIntelligence / CATS
☆31Updated last year
henryzhongsc / longctx_bench
Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…
☆86Updated 8 months ago
fmfi-compbio / admm-pruning
☆30Updated last year
SqueezeAILab / SqueezedAttention
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆54Updated last year
CASE-Lab-UMD / Unified-MoE-Compression
The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".
☆83Updated 8 months ago
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆49Updated 3 months ago
hahnyuan / ASVD4LLM
Activation-aware Singular Value Decomposition for Compressing Large Language Models
☆80Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
mengxiayu / LLMSuperWeight
Code for studying the super weight in LLM
☆120Updated 11 months ago
uw-mad-dash / decoding-speculative-decoding
☆14Updated last year
falcon-xu / early-exit-papers
A curated list of early exiting (LLM, CV, NLP, etc)
☆68Updated last year
luuyin / OWL
Official Pytorch Implementation of "Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity"
☆74Updated 4 months ago
SNU-ARC / any-precision-llm
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
☆120Updated 4 months ago
linfeng93 / BiTA
An innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
☆26Updated 7 months ago
Jingyu6 / speculative_prefill
☆46Updated 6 months ago
IntelLabs / Hardware-Aware-Automated-Machine-Learning
☆71Updated 3 months ago
FasterDecoding / TEAL
☆148Updated 9 months ago
zyxxmu / cam
Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference
☆47Updated last year
linxihui / dkernel
☆20Updated 7 months ago
machilusZ / FastGen
This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
☆41Updated last year
sail-sg / SimLayerKV
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆50Updated last year
HazyResearch / fly
☆220Updated 2 years ago
thunlp / Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆112Updated 8 months ago
BaiTheBest / SparseLLM
Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)
☆67Updated 7 months ago
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆63Updated last year
Qualcomm-AI-research / llm-surgeon
☆34Updated last year