triton-inference-server / vllm_backend
☆191Updated this week
Related projects ⓘ
Alternatives and complementary repositories for vllm_backend
- OpenAI compatible API for TensorRT LLM triton backend☆177Updated 3 months ago
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- The Triton TensorRT-LLM Backend☆706Updated this week
- ☆157Updated last month
- LLMPerf is a library for validating and benchmarking LLMs☆645Updated 3 months ago
- ☆111Updated 8 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆624Updated 2 months ago
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆257Updated this week
- A throughput-oriented high-performance serving framework for LLMs☆636Updated 2 months ago
- Comparison of Language Model Inference Engines☆190Updated 2 months ago
- ☆412Updated last year
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆149Updated last month
- The Triton backend for the ONNX Runtime.☆132Updated this week
- 🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.☆134Updated 3 months ago
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆236Updated 8 months ago
- experiments with inference on llama☆105Updated 5 months ago
- ☆34Updated 3 months ago
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆103Updated last week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆443Updated last week
- Common source, scripts and utilities for creating Triton backends.☆295Updated this week
- ☆47Updated 2 months ago
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆165Updated 2 weeks ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆253Updated last month
- A high-performance inference system for large language models, designed for production environments.☆392Updated 2 weeks ago
- Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.☆86Updated 3 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inference☆355Updated last week
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆305Updated 3 months ago
- Ultra-Fast and Cheaper Long-Context LLM Inference☆233Updated this week
- Benchmark suite for LLMs from Fireworks.ai☆58Updated 2 weeks ago
- A collection of all available inference solutions for the LLMs☆73Updated 2 months ago