bentoml / llm-bench
☆34Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for llm-bench
- Benchmark suite for LLMs from Fireworks.ai☆58Updated 2 weeks ago
- ☆47Updated 2 months ago
- ☆111Updated 8 months ago
- Materials for learning SGLang☆96Updated this week
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆103Updated last week
- ☆157Updated last month
- LLM Serving Performance Evaluation Harness☆56Updated 2 months ago
- Ultra-Fast and Cheaper Long-Context LLM Inference☆233Updated this week
- ☆191Updated this week
- A low-latency & high-throughput serving engine for LLMs☆245Updated 2 months ago
- Comparison of Language Model Inference Engines☆190Updated 2 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆238Updated last week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- Modular and structured prompt caching for low-latency LLM inference☆66Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆50Updated this week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆78Updated this week
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆75Updated 8 months ago
- ☆33Updated this week
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆305Updated 3 months ago
- ☆114Updated 6 months ago
- ☆99Updated last month
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆278Updated 4 months ago
- Efficient and easy multi-instance LLM serving☆213Updated this week
- A large-scale simulation framework for LLM inference☆277Updated last month
- KV cache compression for high-throughput LLM inference☆87Updated this week
- IBM development fork of https://github.com/huggingface/text-generation-inference☆57Updated last month
- ☆38Updated 4 months ago
- OpenAI compatible API for TensorRT LLM triton backend☆177Updated 3 months ago
- ☆188Updated 6 months ago