bentoml / BentoLMDeployLinks

Self-host LLMs with LMDeploy and BentoML

☆21

Alternatives and similar repositories for BentoLMDeploy

Users that are interested in BentoLMDeploy are comparing it to the libraries listed below

Sorting:

jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆117Updated last year
wdlctc / headinfer
☆58Updated 4 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆196Updated 4 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
samchaineau / llm_slerp_generation
Repo hosting codes and materials related to speeding LLMs' inference using token merging.
☆36Updated this week
nyunAI / PruneGPT
☆51Updated last year
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆26Updated 2 years ago
Zoeyyao27 / SirLLM
This repository contains the code for the paper: SirLLM: Streaming Infinite Retentive LLM
☆60Updated last year
wdlctc / mini-s
☆52Updated 11 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆141Updated 8 months ago
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆155Updated last year
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆45Updated 10 months ago
lfsszd / CS-Drafting
Cascade Speculative Drafting
☆31Updated last year
golololologol / LLM-Distillery
A pipeline for LLM knowledge distillation
☆109Updated 6 months ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆49Updated 5 months ago
vllm-project / speculators
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
☆53Updated this week
zenrran4nlp / Awesome-LLM-Inference-Serving
☆43Updated 5 months ago
itsnamgyu / block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
☆161Updated 5 months ago
LLM360 / crystalcoder-data-prep
Data preparation code for CrystalCoder 7B LLM
☆45Updated last year
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆83Updated last week
snu-mllab / KVzip
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
☆114Updated last week
18907305772 / FuseAI
FuseAI Project
☆87Updated 8 months ago
UbiquitousLearning / SLM_Survey
☆97Updated last year
whyNLP / LCKV
Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…
☆155Updated 6 months ago
kyegomez / Infini-attention
Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTO…
☆56Updated this week
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆168Updated last year
SalesforceAIResearch / CodeTree
Code for the paper: CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models
☆28Updated 6 months ago
GreenBitAI / low_bit_llama
Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs
☆110Updated last year