LLM-inference-router / vllm-router
vLLM Router
☆28Updated last year
Alternatives and similar repositories for vllm-router:
Users that are interested in vllm-router are comparing it to the libraries listed below
- Benchmark suite for LLMs from Fireworks.ai☆72Updated this week
- ☆84Updated last month
- A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.☆65Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆116Updated 5 months ago
- vLLM performance dashboard☆27Updated last year
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆132Updated 11 months ago
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆51Updated 6 months ago
- ☆50Updated 5 months ago
- ☆188Updated this week
- ☆253Updated this week
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆126Updated last week
- ☆54Updated this week
- Easy and Efficient Quantization for Transformers☆197Updated 3 months ago
- KV cache compression for high-throughput LLM inference☆126Updated 3 months ago
- ☆53Updated 11 months ago
- Data preparation code for CrystalCoder 7B LLM☆44Updated last year
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆83Updated last month
- ☆118Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆263Updated 7 months ago
- Modular and structured prompt caching for low-latency LLM inference☆92Updated 6 months ago
- ☆73Updated 5 months ago
- Self-host LLMs with LMDeploy and BentoML☆18Updated last month
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆243Updated last year
- vLLM adapter for a TGIS-compatible gRPC server.☆27Updated this week
- ☆45Updated 2 weeks ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆109Updated this week
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆86Updated this week
- Fast LLM Training CodeBase With dynamic strategy choosing [Deepspeed+Megatron+FlashAttention+CudaFusionKernel+Compiler];☆37Updated last year
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆168Updated last month
- Simple extension on vLLM to help you speed up reasoning model without training.☆149Updated last week