NVIDIA / kvpressLinks
LLM KV cache compression made easy
☆586Updated this week
Alternatives and similar repositories for kvpress
Users that are interested in kvpress are comparing it to the libraries listed below
Sorting:
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆735Updated 5 months ago
- Efficient LLM Inference over Long Sequences☆389Updated 2 months ago
- Perplexity GPU Kernels☆444Updated 3 weeks ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆527Updated last week
- Fast low-bit matmul kernels in Triton☆353Updated last week
- A throughput-oriented high-performance serving framework for LLMs☆876Updated 2 weeks ago
- kernels, of the mega variety☆476Updated 2 months ago
- Applied AI experiments and examples for PyTorch☆291Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆487Updated 6 months ago