rayleizhu / vllm-raLinks

[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts

☆40

Alternatives and similar repositories for vllm-ra

Users that are interested in vllm-ra are comparing it to the libraries listed below

Sorting:

feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
pprp / Pruner-Zero
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
☆94Updated 10 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆46Updated 3 months ago
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆147Updated 3 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆105Updated 6 months ago
OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆87Updated last year
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
sail-sg / SimLayerKV
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆49Updated last year
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 11 months ago
metacarbon / shareAtt
Beyond KV Caching: Shared Attention for Efficient LLMs
☆19Updated last year
thunlp / Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆110Updated 7 months ago
antgroup / OmniKV
Dynamic Context Selection for Efficient Long-Context LLMs
☆40Updated 5 months ago
Infini-AI-Lab / gsm_infinite
☆55Updated 4 months ago
Qualcomm-AI-research / lr-qat
☆47Updated 11 months ago
microsoft / AttentionEngine
☆101Updated 5 months ago
TianjinYellow / StableSPAM
☆25Updated 6 months ago
machilusZ / FastGen
This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
☆40Updated last year
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆60Updated last year
VITA-Group / llm-kick
[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.
☆24Updated 6 months ago
tridao / flash-attention-wheels
☆57Updated last year
Infini-AI-Lab / Kinetics
Kinetics: Rethinking Test-Time Scaling Laws
☆81Updated 3 months ago
FasterDecoding / TEAL
☆145Updated 8 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆119Updated 4 months ago
hahnyuan / ASVD4LLM
Activation-aware Singular Value Decomposition for Compressing Large Language Models
☆80Updated last year
MayDomine / Burst-Attention
Distributed IO-aware Attention algorithm
☆21Updated last month
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆147Updated last week