MDK8888 / vllminiLinks

A minimal implementation of vllm.

☆50

Alternatives and similar repositories for vllmini

Users that are interested in vllmini are comparing it to the libraries listed below

Sorting:

tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆98Updated 11 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆106Updated 2 months ago
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆289Updated 2 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆230Updated 8 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆247Updated 6 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆165Updated last year
Deep-Learning-Profiling-Tools / triton-viz
☆227Updated last week
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆216Updated last year
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆318Updated last year
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆215Updated 3 weeks ago
stanford-futuredata / stk
☆107Updated 11 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆206Updated last week
gpu-mode / ring-attention
ring-attention experiments
☆146Updated 9 months ago
microsoft / chunk-attention
☆78Updated 3 months ago
triton-lang / kernels
☆85Updated 8 months ago
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆128Updated 5 months ago
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆212Updated 11 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆139Updated 4 months ago
sgl-project / genai-bench
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…
☆180Updated this week
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆102Updated 4 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆123Updated 8 months ago
InternLM / turbomind
☆92Updated 4 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆193Updated this week
InternLM / Awesome-LLM-Training-System
☆42Updated 11 months ago
FasterDecoding / SnapKV
☆268Updated 3 weeks ago
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆114Updated this week
Dao-AILab / grouped-latent-attention
☆123Updated 2 months ago
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆400Updated 2 months ago
thu-pacman / FasterMoE
☆86Updated 3 years ago
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆365Updated 11 months ago