run-ai / runai-model-streamerLinks
☆254Updated 2 weeks ago
Alternatives and similar repositories for runai-model-streamer
Users that are interested in runai-model-streamer are comparing it to the libraries listed below
Sorting:
- Module, Model, and Tensor Serialization/Deserialization☆268Updated last month
- CUDA checkpoint and restore utility☆371Updated 2 weeks ago
- Inference server benchmarking tool☆112Updated 5 months ago
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆279Updated this week
- ☆314Updated last year
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆596Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆266Updated 11 months ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆410Updated this week
- A Lossless Compression Library for AI pipelines☆283Updated 3 months ago
- Benchmark suite for LLMs from Fireworks.ai☆83Updated this week
- OpenAI compatible API for TensorRT LLM triton backend☆215Updated last year
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆436Updated this week
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆132Updated last week
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆267Updated this week
- High-performance safetensors model loader☆62Updated 2 months ago
- ☆40Updated this week
- A tool to configure, launch and manage your machine learning experiments.☆193Updated last week
- NVIDIA NCCL Tests for Distributed Training☆111Updated last week
- ☆298Updated this week
- Common recipes to run vLLM☆146Updated this week
- ☆56Updated 10 months ago
- Google TPU optimizations for transformers models☆120Updated 8 months ago
- A collection of all available inference solutions for the LLMs☆91Updated 7 months ago
- JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs wel…☆380Updated 3 months ago
- GPU environment and cluster management with LLM support☆641Updated last year
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated last year
- A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM☆50Updated this week
- ClearML Fractional GPU - Run multiple containers on the same GPU with driver level memory limitation ✨ and compute time-slicing☆80Updated last year
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆315Updated last week
- A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.☆72Updated last year