triton-inference-server / perf_analyzer
☆24Updated this week
Related projects ⓘ
Alternatives and complementary repositories for perf_analyzer
- ☆191Updated this week
- ☆140Updated 6 months ago
- Efficient and easy multi-instance LLM serving☆213Updated this week
- A low-latency & high-throughput serving engine for LLMs☆245Updated 2 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆238Updated last week
- Transformer related optimization, including BERT, GPT☆60Updated last year
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆203Updated last week
- ☆123Updated 2 weeks ago
- Disaggregated serving system for Large Language Models (LLMs).☆359Updated 3 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆29Updated 2 months ago
- ☆167Updated 4 months ago
- ☆32Updated last month
- ☆79Updated 2 months ago
- ☆157Updated last month
- A tool for bandwidth measurements on NVIDIA GPUs.☆321Updated last month
- ☆79Updated 8 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆45Updated this week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆165Updated this week
- ☆57Updated 2 weeks ago
- ☆138Updated 2 weeks ago
- The Triton TensorRT-LLM Backend☆706Updated this week
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- Experimental projects related to TensorRT☆81Updated this week
- ☆48Updated this week
- OpenAI compatible API for TensorRT LLM triton backend☆177Updated 3 months ago
- ☆47Updated 2 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆250Updated this week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆87Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆42Updated this week