cornserve-ai / cornserveLinks

Easy, Fast, and Scalable Multimodal AI

☆73

Alternatives and similar repositories for cornserve

Users that are interested in cornserve are comparing it to the libraries listed below

Sorting:

Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
tyler-griggs / melange-release
☆48Updated last year
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆143Updated 9 months ago
hao-ai-lab / LookaheadReasoning
[NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning
☆52Updated 3 weeks ago
vllm-project / speculators
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
☆70Updated this week
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆80Updated 9 months ago
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆206Updated 5 months ago
zenrran4nlp / Awesome-LLM-Inference-Serving
☆46Updated 6 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
RLsys-Foundation / TritonForge
🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…
☆99Updated 2 weeks ago
wdlctc / headinfer
☆60Updated 6 months ago
SJTU-IPADS / Bamboo
Bamboo-7B Large Language Model
☆92Updated last year
radixark / miles
☆199Updated this week
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆196Updated last year
yale-sys / prompt-cache
Modular and structured prompt caching for low-latency LLM inference
☆103Updated last year
bentoml / BentoLMDeploy
Self-host LLMs with LMDeploy and BentoML
☆21Updated 4 months ago
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆109Updated 6 months ago
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆300Updated last week
microsoft / RetrievalAttention
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆99Updated 2 months ago
sgl-project / sgl-flash-attn
Fast and memory-efficient exact attention
☆14Updated last week
vllm-project / dashboard
vLLM performance dashboard
☆37Updated last year
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆62Updated 3 weeks ago
microsoft / AttentionEngine
☆109Updated 6 months ago
eth-easl / deltazip
Compression for Foundation Models
☆34Updated 4 months ago
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
Infini-AI-Lab / MagicPIG
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
☆240Updated 11 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆107Updated 7 months ago
UCB-ADRS / ADRS
AI-Driven Research Systems (ADRS)
☆74Updated this week
deepseek-ai / LPLB
An early research stage MoE load balancer based on inear programming.
☆228Updated this week