mosecorg / mosec
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
☆808Updated this week
Alternatives and similar repositories for mosec:
Users that are interested in mosec are comparing it to the libraries listed below
- ☆411Updated last year
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.☆763Updated last month
- Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.☆576Updated this week
- Model Deployment at Scale on Kubernetes 🦄️☆791Updated 8 months ago
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆1,942Updated last month
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,777Updated 11 months ago
- This repository contains tutorials and examples for Triton Inference Server☆623Updated this week
- LLMPerf is a library for validating and benchmarking LLMs☆703Updated last month
- Serving multiple LoRA finetuned LLM as one☆1,012Updated 8 months ago
- ☆215Updated this week
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…☆446Updated this week
- The Triton TensorRT-LLM Backend☆745Updated last week
- ggml implementation of BERT☆474Updated 10 months ago
- Common source, scripts and utilities for creating Triton backends.☆305Updated this week
- A blazing fast inference solution for text embeddings models☆3,043Updated last week
- Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀☆1,671Updated 2 months ago
- RayLLM - LLMs on Ray☆1,247Updated 7 months ago
- LLM Inference benchmark☆377Updated 5 months ago
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,152Updated 3 months ago
- Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali☆1,714Updated this week
- Fast Inference Solutions for BLOOM☆563Updated 3 months ago
- A throughput-oriented high-performance serving framework for LLMs☆692Updated 3 months ago
- Efficient, Flexible and Portable Structured Generation☆570Updated this week
- Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.☆585Updated this week
- A library for building and serving multi-node distributed faiss indices.☆260Updated last year
- Large-scale model inference.☆628Updated last year
- Autoscale LLM (vLLM, SGLang, LMDeploy) inferences on Kubernetes (and others)☆247Updated last year
- OpenAI compatible API for TensorRT LLM triton backend☆186Updated 5 months ago
- 🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization…☆2,667Updated this week
- Automatically split your PyTorch models on multiple GPUs for training & inference☆643Updated last year