mosecorg / mosecLinks
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
☆843Updated last week
Alternatives and similar repositories for mosec
Users that are interested in mosec are comparing it to the libraries listed below
Sorting:
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆2,020Updated 2 months ago
- Serving multiple LoRA finetuned LLM as one☆1,066Updated last year
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,835Updated last year
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.☆801Updated 4 months ago
- ☆411Updated last year
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,169Updated 8 months ago
- 🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization…☆2,942Updated this week
- Autoscale LLM (vLLM, SGLang, LMDeploy) inferences on Kubernetes (and others)☆269Updated last year
- A fast llama2 decoder in pure Rust.☆1,051Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,258Updated 3 months ago
- RayLLM - LLMs on Ray (Archived). Read README for more info.☆1,261Updated 3 months ago
- Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀☆1,688Updated 8 months ago
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training☆1,803Updated this week
- A high-performance inference system for large language models, designed for production environments.☆448Updated this week
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs☆3,024Updated last month
- Automatically split your PyTorch models on multiple GPUs for training & inference☆655Updated last year
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆304Updated 3 weeks ago
- Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackab…☆1,571Updated last year
- LLMPerf is a library for validating and benchmarking LLMs☆940Updated 6 months ago
- The Triton TensorRT-LLM Backend☆851Updated this week
- Minimalistic large language model 3D-parallelism training☆1,926Updated last week
- ggml implementation of BERT☆493Updated last year
- Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.☆619Updated last week
- A throughput-oriented high-performance serving framework for LLMs☆825Updated 2 weeks ago
- ☆543Updated 6 months ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆2,193Updated last month
- a fast cross platform AI inference engine 🤖 using Rust 🦀 and WebGPU 🎮☆451Updated 5 months ago
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads☆2,549Updated 11 months ago
- Efficient AI Inference & Serving☆471Updated last year
- A Python-level JIT compiler designed to make unmodified PyTorch programs faster.☆1,051Updated last year