tyler-griggs / melange-releaseLinks

☆47

Alternatives and similar repositories for melange-release

Users that are interested in melange-release are comparing it to the libraries listed below

Sorting:

project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆79Updated 8 months ago
hao-ai-lab / MuxServe
☆74Updated last week
Hsword / SpotServe
SpotServe: Serving Generative Large Language Models on Preemptible Instances
☆129Updated last year
WukLab / preble
Stateful LLM Serving
☆87Updated 7 months ago
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆186Updated last year
UChi-JCL / CacheGen
☆136Updated last year
eth-easl / deltazip
Compression for Foundation Models
☆35Updated 3 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆60Updated 11 months ago
yale-sys / prompt-cache
Modular and structured prompt caching for low-latency LLM inference
☆100Updated 11 months ago
zhengzangw / Sequence-Scheduling
PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".
☆92Updated 2 years ago
SymbioticLab / Oobleck
A resilient distributed training framework
☆95Updated last year
AutonomicPerfectionist / PipeInfer
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
☆30Updated 11 months ago
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆238Updated 11 months ago
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆59Updated 3 weeks ago
microsoft / chunk-attention
☆78Updated 6 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆142Updated 8 months ago
LoongServe / LoongServe
☆124Updated 11 months ago
eddiegaoo / Apt-Serve
☆18Updated 4 months ago
kvcache-ai / TrEnv-X
☆63Updated last month
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆63Updated last month
zenrran4nlp / Awesome-LLM-Inference-Serving
☆43Updated 5 months ago
microsoft / RetrievalAttention
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆88Updated last month
alibaba / ServeGen
A framework for generating realistic LLM serving workloads
☆71Updated 2 weeks ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆75Updated 4 months ago
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆252Updated last week
ovg-project / kvcached
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
☆104Updated this week
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
tonyzhao-jt / LLM-PQ
Official Repo for "SplitQuant / LLM-PQ: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and …
☆34Updated last month
ruipeterpan / marconi
Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]
☆39Updated 7 months ago