mlsys-io / kv.run

A model serving framework for various research and production scenarios. Seamlessly built upon the PyTorch and HuggingFace ecosystem.

☆23

Alternatives and similar repositories for kv.run:

Users that are interested in kv.run are comparing it to the libraries listed below

flashinfer-ai / cutlass-viz
☆55Updated 2 weeks ago
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆34Updated last week
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆67Updated 3 weeks ago
WukLab / preble
Stateful LLM Serving
☆63Updated last month
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆119Updated 3 months ago
uccl-project / uccl
Ultra | Ultimate | Unified CCL
☆59Updated 2 months ago
InternLM / turbomind
☆82Updated last month
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆154Updated 7 months ago
tyler-griggs / melange-release
☆45Updated 10 months ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆45Updated 9 months ago
xlite-dev / hgemm-tensorcores-mma
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆73Updated 3 weeks ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆46Updated 5 months ago
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆33Updated last month
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆108Updated 7 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆45Updated 5 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆36Updated 3 weeks ago
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆36Updated last month
hao-ai-lab / MuxServe
☆59Updated 10 months ago
LoongServe / LoongServe
☆95Updated 5 months ago
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆162Updated 9 months ago
chenhongyu2048 / LLM-inference-optimization-paper
Summary of some awesome work for optimizing LLM inference
☆69Updated 2 weeks ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆97Updated last year
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆77Updated 2 months ago
BBuf / tensorrt-llm-moe
☆28Updated 2 months ago
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆78Updated 5 months ago
FlagOpen / FlagCX
☆49Updated this week
mlc-ai / mlc-python
☆31Updated this week
microsoft / AttentionEngine
☆67Updated this week
Chtholly-Boss / swizzle
A practical way of learning Swizzle
☆18Updated 2 months ago
sgl-project / tensorrt-demo
TensorRT LLM Benchmark Configuration
☆13Updated 9 months ago