eth-easl / deltazipLinks

Compression for Foundation Models

☆35

Alternatives and similar repositories for deltazip

Users that are interested in deltazip are comparing it to the libraries listed below

Sorting:

tyler-griggs / melange-release
☆48Updated last year
metacarbon / shareAtt
Beyond KV Caching: Shared Attention for Efficient LLMs
☆20Updated last year
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆80Updated 8 months ago
amazon-science / piperag
PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)
☆27Updated last year
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆48Updated 3 months ago
microsoft / AttentionEngine
☆109Updated 6 months ago
ScalingIntelligence / hydragen
Hydragen: High-Throughput LLM Inference with Shared Prefixes
☆44Updated last year
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
Infini-AI-Lab / APE
☆34Updated 9 months ago
antgroup / OmniKV
Dynamic Context Selection for Efficient Long-Context LLMs
☆42Updated 6 months ago
zenrran4nlp / Awesome-LLM-Inference-Serving
☆44Updated 6 months ago
RLsys-Foundation / APRIL
APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation. A system-level optimization for scalable LLM tra…
☆43Updated last month
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆52Updated last year
Ying1123 / llm-caching-multiplexing
☆20Updated 2 years ago
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆124Updated 4 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆51Updated 4 months ago
NoakLiu / PiKV
PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]
☆42Updated last month
AutonomicPerfectionist / PipeInfer
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
☆30Updated last year
lfsszd / CS-Drafting
Cascade Speculative Drafting
☆32Updated last year
GATECH-EIC / ShiftAddLLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
☆110Updated last year
sgl-project / tensorrt-demo
TensorRT LLM Benchmark Configuration
☆13Updated last year
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆85Updated last year
vedantroy / gpu_kernels
☆27Updated last year
tile-ai / AttentionEngine
☆50Updated 6 months ago
ScalingIntelligence / CATS
☆30Updated last year