NVIDIA / kvpress
LLM KV cache compression made easy
☆356Updated this week
Alternatives and similar repositories for kvpress:
Users that are interested in kvpress are comparing it to the libraries listed below
- Efficient LLM Inference over Long Sequences☆349Updated last month
- Applied AI experiments and examples for PyTorch☆215Updated last week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆221Updated last week
- This repository contains the experimental PyTorch native float8 training UX☆219Updated 5 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆327Updated 5 months ago
- Fast low-bit matmul kernels in Triton☆199Updated last week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆186Updated last week
- Cataloging released Triton kernels.☆157Updated 2 weeks ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆216Updated this week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆492Updated this week
- Collection of kernels written in Triton language☆91Updated 3 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆235Updated 2 months ago
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆498Updated 3 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆291Updated 6 months ago
- ☆217Updated 8 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆269Updated last week
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆112Updated 5 months ago
- ring-attention experiments☆119Updated 3 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆257Updated 3 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆180Updated 2 months ago
- scalable and robust tree-based speculative decoding algorithm☆331Updated this week
- Minimalistic 4D-parallelism distributed training framework for education purpose☆670Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆689Updated 4 months ago
- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆420Updated last week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆511Updated this week
- Triton-based implementation of Sparse Mixture of Experts.☆194Updated 2 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆273Updated last month
- ☆171Updated this week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆261Updated 2 weeks ago
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆321Updated 2 months ago