bentoml / BentoLMDeployLinks
Self-host LLMs with LMDeploy and BentoML
☆22Updated last month
Alternatives and similar repositories for BentoLMDeploy
Users that are interested in BentoLMDeploy are comparing it to the libraries listed below
Sorting:
- ☆55Updated 3 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆118Updated last year
- From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…☆47Updated 4 months ago
- Simple extension on vLLM to help you speed up reasoning model without training.☆175Updated 2 months ago
- Data preparation code for CrystalCoder 7B LLM☆45Updated last year
- ☆38Updated 10 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆36Updated last month
- The official implementation for paper "Agentic-R1: Distilled Dual-Strategy Reasoning"☆92Updated last month
- ☆51Updated last year
- ☆95Updated 10 months ago
- FuseAI Project☆87Updated 6 months ago
- This repository contains the code for the paper: SirLLM: Streaming Infinite Retentive LLM☆59Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆124Updated 8 months ago
- Cascade Speculative Drafting☆29Updated last year
- ☆51Updated 9 months ago
- A pipeline for LLM knowledge distillation☆108Updated 4 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆135Updated last year
- Data preparation code for Amber 7B LLM☆91Updated last year
- QuIP quantization☆57Updated last year
- ☆85Updated 7 months ago
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆174Updated last year
- ☆110Updated 3 months ago
- KV cache compression for high-throughput LLM inference☆134Updated 6 months ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆61Updated 10 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆136Updated last year
- ☆94Updated 8 months ago
- ☆41Updated 3 months ago
- ☆127Updated last year
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆160Updated 4 months ago
- Repository for CPU Kernel Generation for LLM Inference☆26Updated 2 years ago