npuichigo / openai_trtllm
OpenAI compatible API for TensorRT LLM triton backend
☆177Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for openai_trtllm
- ☆193Updated this week
- ☆158Updated last month
- Comparison of Language Model Inference Engines☆190Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆253Updated last month
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- A throughput-oriented high-performance serving framework for LLMs☆640Updated 2 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆150Updated 2 months ago
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆167Updated 2 weeks ago
- The Triton TensorRT-LLM Backend☆710Updated this week
- 🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.☆134Updated 3 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆627Updated 2 months ago
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆126Updated this week
- LLMPerf is a library for validating and benchmarking LLMs☆648Updated 3 months ago
- ☆115Updated 7 months ago
- A high-performance inference system for large language models, designed for production environments.☆394Updated this week
- Materials for learning SGLang☆110Updated this week
- ☆111Updated 8 months ago
- scalable and robust tree-based speculative decoding algorithm☆318Updated 3 months ago
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆236Updated 8 months ago
- [NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces in…☆796Updated this week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆691Updated this week
- experiments with inference on llama☆105Updated 5 months ago
- Making Long-Context LLM Inference 10x Faster and 10x Cheaper☆240Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆305Updated 3 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆130Updated 4 months ago
- An Open Source Toolkit For LLM Distillation☆359Updated 2 months ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆194Updated this week
- A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.☆54Updated 7 months ago
- A bagel, with everything.☆312Updated 7 months ago