bentoml / llm-bench
☆45Updated 3 months ago
Alternatives and similar repositories for llm-bench:
Users that are interested in llm-bench are comparing it to the libraries listed below
- ☆52Updated 5 months ago
- ☆172Updated 4 months ago
- Benchmark suite for LLMs from Fireworks.ai☆66Updated last week
- The driver for LMCache core to run in vLLM☆29Updated 2 weeks ago
- ☆117Updated 11 months ago
- LLM Serving Performance Evaluation Harness☆68Updated this week
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆112Updated last week
- ☆67Updated 2 months ago
- A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.☆62Updated 10 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated 11 months ago
- A low-latency & high-throughput serving engine for LLMs☆312Updated 3 weeks ago
- ☆224Updated this week
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆35Updated 3 months ago
- Materials for learning SGLang☆265Updated 2 weeks ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆107Updated 2 months ago
- Modular and structured prompt caching for low-latency LLM inference☆87Updated 3 months ago
- Manages vllm-nccl dependency☆17Updated 8 months ago
- Comparison of Language Model Inference Engines☆204Updated 2 months ago
- Easy and Efficient Quantization for Transformers☆193Updated 2 weeks ago
- Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.☆92Updated 6 months ago
- ☆59Updated 2 weeks ago
- ☆117Updated 9 months ago
- ☆43Updated 7 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 5 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆87Updated 11 months ago
- Efficient and easy multi-instance LLM serving☆295Updated this week
- Dynamic Memory Management for Serving LLMs without PagedAttention☆290Updated this week
- KV cache compression for high-throughput LLM inference☆115Updated 2 weeks ago
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆132Updated this week