triton-inference-server / perf_analyzer
☆43Updated this week
Alternatives and similar repositories for perf_analyzer:
Users that are interested in perf_analyzer are comparing it to the libraries listed below
- Common source, scripts and utilities for creating Triton backends.☆307Updated last week
- ☆172Updated 4 months ago
- ☆224Updated this week
- ☆140Updated 9 months ago
- The Triton backend for the ONNX Runtime.☆138Updated this week
- Common source, scripts and utilities shared across all Triton repositories.☆68Updated last week
- ☆67Updated 2 months ago
- A low-latency & high-throughput serving engine for LLMs☆312Updated 3 weeks ago
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…☆454Updated this week
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆35Updated 5 months ago
- Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.☆196Updated last month
- Easy and Efficient Quantization for Transformers☆193Updated 2 weeks ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆290Updated this week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆101Updated this week
- ☆81Updated 5 months ago
- ☆127Updated last month
- Efficient and easy multi-instance LLM serving☆295Updated this week
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated 11 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆56Updated this week
- ☆142Updated last month
- ☆117Updated 11 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆104Updated 5 months ago
- ☆33Updated last year
- ☆52Updated 5 months ago
- The core library and APIs implementing the Triton Inference Server.☆115Updated last week
- The Triton backend for TensorRT.☆69Updated this week
- PyTorch distributed training acceleration framework☆39Updated last week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆297Updated this week
- Materials for learning SGLang☆265Updated 2 weeks ago