tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆71Updated 6 months ago
Alternatives and similar repositories for paged-attention-minimal:
Users that are interested in paged-attention-minimal are comparing it to the libraries listed below
- ☆187Updated 8 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆103Updated this week
- ☆100Updated 6 months ago
- A minimal implementation of vllm.☆35Updated 7 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆68Updated 6 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆100Updated 8 months ago
- ☆188Updated 3 weeks ago
- Cataloging released Triton kernels.☆191Updated 2 months ago
- ☆73Updated 4 months ago
- Fast low-bit matmul kernels in Triton☆257Updated last week
- ☆87Updated 6 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆201Updated last year
- ☆64Updated last month
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆188Updated this week
- extensible collectives library in triton☆83Updated 5 months ago
- Applied AI experiments and examples for PyTorch☆243Updated this week
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆57Updated 6 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆70Updated 4 months ago
- ☆74Updated 3 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆308Updated 3 weeks ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆116Updated last year
- Benchmark code for the "Online normalizer calculation for softmax" paper☆85Updated 6 years ago
- Implement Flash Attention using Cute.☆71Updated 2 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆298Updated 8 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆66Updated 8 months ago
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆145Updated last week
- ring-attention experiments☆127Updated 4 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆238Updated 4 months ago
- A Easy-to-understand TensorOp Matmul Tutorial☆326Updated 5 months ago
- ☆53Updated 2 months ago