mosecorg / mosec
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
☆840Updated last week
Alternatives and similar repositories for mosec:
Users that are interested in mosec are comparing it to the libraries listed below
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.☆790Updated 2 months ago
- RayLLM - LLMs on Ray (Archived). Read README for more info.☆1,261Updated last month
- ☆411Updated last year
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆2,009Updated last month
- The Triton TensorRT-LLM Backend☆832Updated this week
- Fast Inference Solutions for BLOOM☆561Updated 7 months ago
- Large-scale model inference.☆629Updated last year
- Serving multiple LoRA finetuned LLM as one☆1,058Updated last year
- Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.☆608Updated this week
- A high-performance inference system for large language models, designed for production environments.☆437Updated this week
- Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀☆1,683Updated 6 months ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆2,145Updated last week
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆1,386Updated last year
- LLMPerf is a library for validating and benchmarking LLMs☆900Updated 5 months ago
- This repository contains tutorials and examples for Triton Inference Server☆695Updated 3 weeks ago
- ☆462Updated last month
- Efficient AI Inference & Serving☆470Updated last year
- Bagua Speeds up PyTorch☆883Updated 9 months ago
- Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali☆2,138Updated 3 weeks ago
- ☆1,026Updated last year
- Model Deployment at Scale on Kubernetes 🦄️☆811Updated last year
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…☆474Updated 2 weeks ago
- Finetuning Large Language Models on One Consumer GPU in 2 Bits☆723Updated 11 months ago
- INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model☆1,514Updated last month
- A Python vector database you just need - no more, no less.☆610Updated last year
- OpenAI compatible API for TensorRT LLM triton backend☆205Updated 9 months ago
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration☆2,991Updated this week
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training☆1,792Updated this week
- Open Academic Research on Improving LLaMA to SOTA LLM☆1,620Updated last year
- LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions☆820Updated 2 years ago