mosecorg / mosec
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
☆747Updated last week
Related projects: ⓘ
- ☆411Updated 10 months ago
- Serving multiple LoRA finetuned LLM as one☆946Updated 4 months ago
- Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.☆523Updated this week
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆1,843Updated last week
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.☆715Updated last month
- FlashInfer: Kernel Library for LLM Serving☆1,143Updated last week
- Fast Inference Solutions for BLOOM☆556Updated last month
- A Data Streaming Library for Efficient Neural Network Training☆1,077Updated this week
- LLMPerf is a library for validating and benchmarking LLMs☆578Updated last month
- Large-scale model inference.☆630Updated last year
- The Triton TensorRT-LLM Backend☆654Updated this week
- ☆170Updated this week
- 🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools☆2,461Updated this week
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,702Updated 8 months ago
- RayLLM - LLMs on Ray☆1,222Updated 3 months ago
- A throughput-oriented high-performance serving framework for LLMs☆470Updated this week
- A high-performance inference system for large language models, designed for production environments.☆370Updated last week
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆1,317Updated 6 months ago
- This repository contains tutorials and examples for Triton Inference Server☆527Updated this week
- ☆1,167Updated last week
- LLM Inference benchmark☆331Updated last month
- Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀☆1,643Updated 10 months ago
- Official repository for LongChat and LongEval☆505Updated 3 months ago
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…☆419Updated 2 weeks ago
- A lightweight version of Milvus☆267Updated 2 weeks ago
- Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.☆547Updated this week
- A Survey of AI startups☆391Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,099Updated 7 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆562Updated 2 weeks ago
- A blazing fast inference solution for text embeddings models☆2,599Updated this week