OpenAI compatible API for TensorRT LLM triton backend
☆219Aug 1, 2024Updated last year
Alternatives and similar repositories for openai_trtllm
Users that are interested in openai_trtllm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- The Triton TensorRT-LLM Backend☆929Mar 17, 2026Updated last week
- ☆334Mar 17, 2026Updated last week
- High-level API for tar-based dataset☆12Feb 3, 2024Updated 2 years ago
- 大模型部署实战:TensorRT-LLM, Triton Inference Server, vLLM☆27Feb 26, 2024Updated 2 years ago
- AI Router☆14Aug 1, 2024Updated last year
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- This reference can be used with any existing OpenAI integrated apps to run with TRT-LLM inference locally on GeForce GPU on Windows inste…☆128Feb 29, 2024Updated 2 years ago
- ☆28Nov 6, 2024Updated last year
- TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…☆13,169Mar 23, 2026Updated last week
- The driver for LMCache core to run in vLLM☆63Feb 4, 2025Updated last year
- TensorRT-LLM server with Structured Outputs (JSON) built with Rust☆69Apr 25, 2025Updated 11 months ago
- Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inferen…☆74Mar 10, 2026Updated 2 weeks ago
- MERA (Multimodal Evaluation for Russian-language Architectures) is a new open benchmark for the Russian language for evaluating SOTA mode…☆41Mar 10, 2026Updated 2 weeks ago
- FlashInfer: Kernel Library for LLM Serving☆5,231Updated this week
- Experiment with NVIDIA Triton and Whisper☆15Apr 29, 2024Updated last year
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresse…☆2,258Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆17Jun 3, 2024Updated last year
- LMDeploy is a toolkit for compressing, deploying, and serving LLMs.☆7,711Mar 22, 2026Updated last week
- The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines. It includes tools for downloading pipelines a…☆16Aug 20, 2024Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Dec 4, 2025Updated 3 months ago
- JAX bindings for the flash-attention3 kernels☆22Jan 2, 2026Updated 2 months ago
- This repository contains tutorials and examples for Triton Inference Server☆826Mar 10, 2026Updated 2 weeks ago
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…☆3,977Updated this week
- The Triton Inference Server provides an optimized cloud and edge inferencing solution.☆10,472Updated this week
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- ☆621Jul 31, 2024Updated last year
- RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs☆20Feb 8, 2026Updated last month
- Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+…☆160Dec 8, 2025Updated 3 months ago
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆2,105Jun 30, 2025Updated 8 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆2,928Updated this week
- Open Source Text Embedding Models with OpenAI Compatible API☆166Jul 13, 2024Updated last year
- A throughput-oriented high-performance serving framework for LLMs☆950Oct 29, 2025Updated 5 months ago
- Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.☆220Feb 3, 2026Updated last month
- TensorDock CLI Client☆10Oct 14, 2022Updated 3 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- ☆290Mar 19, 2026Updated last week
- fast-embeddings-api☆16Nov 23, 2023Updated 2 years ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆111Apr 7, 2025Updated 11 months ago
- This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.☆18Dec 23, 2025Updated 3 months ago
- Effective LLM Alignment Toolkit☆152Jun 25, 2025Updated 9 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆119Mar 13, 2024Updated 2 years ago
- ggml学习笔记,ggml是一个机器学习的推理框架☆18Mar 24, 2024Updated 2 years ago