npuichigo / openai_trtllmView external linksLinks
OpenAI compatible API for TensorRT LLM triton backend
☆220Aug 1, 2024Updated last year
Alternatives and similar repositories for openai_trtllm
Users that are interested in openai_trtllm are comparing it to the libraries listed below
Sorting:
- The Triton TensorRT-LLM Backend☆918Updated this week
- High-level API for tar-based dataset☆12Feb 3, 2024Updated 2 years ago
- ☆329Feb 9, 2026Updated last week
- ☆28Nov 6, 2024Updated last year
- TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…☆12,867Updated this week
- TensorRT-LLM server with Structured Outputs (JSON) built with Rust☆67Apr 25, 2025Updated 9 months ago
- JAX bindings for the flash-attention3 kernels☆20Jan 2, 2026Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Dec 4, 2025Updated 2 months ago
- The driver for LMCache core to run in vLLM☆60Feb 4, 2025Updated last year
- A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresse…☆1,964Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆17Jun 3, 2024Updated last year
- FlashInfer: Kernel Library for LLM Serving☆4,935Updated this week
- Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inferen…☆73Feb 9, 2026Updated last week
- ☆281Feb 4, 2026Updated last week
- LMDeploy is a toolkit for compressing, deploying, and serving LLMs.☆7,606Updated this week
- bge推理优化相关脚本☆29Jan 23, 2024Updated 2 years ago
- RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs☆19Feb 8, 2026Updated last week
- This repository contains tutorials and examples for Triton Inference Server☆822Feb 9, 2026Updated last week
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…☆3,888Feb 9, 2026Updated last week
- An easy-to-use package for implementing SmoothQuant for LLMs☆110Apr 7, 2025Updated 10 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆2,737Updated this week
- MERA (Multimodal Evaluation for Russian-language Architectures) is a new open benchmark for the Russian language for evaluating SOTA mode…☆39Feb 3, 2026Updated last week
- The Triton Inference Server provides an optimized cloud and edge inferencing solution.☆10,361Updated this week
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆2,093Jun 30, 2025Updated 7 months ago
- Proxy server for triton gRPC server that inferences embedding model in Rust☆21Aug 10, 2024Updated last year
- A throughput-oriented high-performance serving framework for LLMs☆945Oct 29, 2025Updated 3 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,183Sep 30, 2025Updated 4 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆1,011Sep 4, 2024Updated last year
- ☆21Feb 27, 2024Updated last year
- Official implementation of Half-Quadratic Quantization (HQQ)☆913Dec 18, 2025Updated last month
- Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.☆218Feb 3, 2026Updated last week
- Deployment a light and full OpenAI API for production with vLLM to support /v1/embeddings with all embeddings models.☆44Jul 16, 2024Updated last year
- LLMPerf is a library for validating and benchmarking LLMs☆1,084Dec 9, 2024Updated last year
- Репозиторий измеряет качество Yandexgpt, Gigachat, T-Pro, Saiga, Vikhr, Ruadapt на популярных англоязычных бенчмарках: MGSM, MATH, HumanE…☆23Apr 16, 2025Updated 10 months ago
- Open Source Text Embedding Models with OpenAI Compatible API☆167Jul 13, 2024Updated last year
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆222Aug 19, 2024Updated last year
- Easy and Efficient Quantization for Transformers☆205Jan 28, 2026Updated 2 weeks ago
- SGLang is a high-performance serving framework for large language models and multimodal models.☆23,439Feb 9, 2026Updated last week
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Feb 29, 2024Updated last year