openshift-psap / auto-tuning-vllmLinks
Auto-tuning for vllm. Getting the best performance out of your LLM deployment (vllm+guidellm+optuna)
☆32Updated this week
Alternatives and similar repositories for auto-tuning-vllm
Users that are interested in auto-tuning-vllm are comparing it to the libraries listed below
Sorting:
- ☆51Updated 5 months ago
- 🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.☆56Updated last month
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆819Updated this week
- IBM development fork of https://github.com/huggingface/text-generation-inference☆63Updated 4 months ago
- Taxonomy tree that will allow you to create models tuned with your data☆290Updated 4 months ago
- InstructLab Training Library - Efficient Fine-Tuning with Message-Format Data☆47Updated this week
- Python library for Synthetic Data Generation☆52Updated last month
- ☆278Updated last week
- vLLM adapter for a TGIS-compatible gRPC server.☆50Updated this week
- GitHub bot to assist with the taxonomy contribution workflow☆17Updated last year
- Examples for building and running LLM services and applications locally with Podman☆190Updated 5 months ago
- Self-host LLMs with vLLM and BentoML☆167Updated last week
- Synthetic Data Generation Toolkit for LLMs☆88Updated last week
- Route LLM requests to the best model for the task at hand.☆171Updated 2 weeks ago
- Inference server benchmarking tool☆141Updated 3 months ago
- Python library for Evaluation☆16Updated this week
- llm-d benchmark scripts and tooling☆42Updated this week
- A collection of all available inference solutions for the LLMs☆94Updated 10 months ago
- A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.☆78Updated last year
- Kubernetes enhancements for Network Topology Aware Gang Scheduling & Autoscaling☆155Updated last week
- A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM☆205Updated last week
- Distributed Model Serving Framework☆182Updated 4 months ago
- ClearML Fractional GPU - Run multiple containers on the same GPU with driver level memory limitation ✨ and compute time-slicing☆88Updated 2 months ago
- Accelerating your LLM training to full speed! Made with ❤️ by ServiceNow Research☆282Updated this week
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆365Updated this week
- GenAI Studio is a low code platform to enable users to construct, evaluate, and benchmark GenAI applications. The platform also provide c…☆58Updated 2 weeks ago
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆379Updated last week
- ☆56Updated last year
- GenAI components at micro-service level; GenAI service composer to create mega-service☆193Updated last week
- ☆23Updated 10 months ago