tspeterkim / paged-attention-minimalLinks

a minimal cache manager for PagedAttention, on top of llama3.

☆126

Alternatives and similar repositories for paged-attention-minimal

Users that are interested in paged-attention-minimal are comparing it to the libraries listed below

Sorting:

gpu-mode / triton-index
Cataloging released Triton kernels.
☆274Updated 2 months ago
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆401Updated last week
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆307Updated 3 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆256Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆169Updated 7 months ago
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆294Updated this week
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆244Updated 6 months ago
gpu-mode / ring-attention
ring-attention experiments
☆160Updated last year
MDK8888 / vllmini
A minimal implementation of vllm.
☆61Updated last year
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆217Updated last week
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆331Updated last year
ColfaxResearch / cutlass-kernels
☆246Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆131Updated 6 months ago
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆272Updated 4 months ago
pranjalssh / fast.cu
Fastest kernels written from scratch
☆400Updated 2 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆123Updated last year
flagos-ai / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆283Updated last year
open-lm-engine / accelerated-model-architectures
A bunch of kernels that might make stuff slower 😉
☆65Updated this week
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆86Updated last year
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆143Updated 3 weeks ago
triton-lang / kernels
☆94Updated last year
stanford-futuredata / stk
☆113Updated last year
sgl-project / sglang-jax
JAX backend for SGL
☆185Updated this week
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆391Updated last year
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆444Updated 6 months ago
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆261Updated last month
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆253Updated 2 months ago
HazyResearch / Megakernels
kernels, of the mega variety
☆614Updated 2 months ago
cchan / tccl
extensible collectives library in triton
☆91Updated 8 months ago