NVIDIA / kvpressLinks

LLM KV cache compression made easy

☆566

Alternatives and similar repositories for kvpress

Users that are interested in kvpress are comparing it to the libraries listed below

Sorting:

NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆385Updated last month
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆289Updated 2 months ago
ppl-ai / pplx-kernels
Perplexity GPU Kernels
☆418Updated 3 weeks ago
ScalingIntelligence / KernelBench
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
☆505Updated last week
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆338Updated last week
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆730Updated 5 months ago
HazyResearch / Megakernels
kernels, of the mega variety
☆466Updated 2 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆856Updated 3 weeks ago
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆258Updated last week
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆365Updated 11 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆206Updated last week
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆480Updated 5 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆247Updated 6 months ago
NVIDIA-NeMo / RL
Scalable toolkit for efficient model reinforcement
☆578Updated this week
Deep-Learning-Profiling-Tools / triton-viz
☆227Updated last week
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆388Updated this week
jy-yuan / KIVI
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆312Updated 6 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆405Updated 2 months ago
zhuzilin / ring-flash-attention
Ring attention implementation with flash attention
☆828Updated last week
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆411Updated 3 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆311Updated 3 weeks ago
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆565Updated last week
October2001 / Awesome-KV-Cache-Compression
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆500Updated last week
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆318Updated last year
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆203Updated this week
facebookresearch / LayerSkip
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
☆324Updated 3 months ago
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆205Updated 3 months ago
Infini-AI-Lab / MagicPIG
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
☆229Updated 7 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆142Updated 4 months ago
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆400Updated 2 months ago