bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and versatility of software and hardware.
☆229Updated this week
Alternatives and similar repositories for ByteMLPerf:
Users that are interested in ByteMLPerf are comparing it to the libraries listed below
- ☆145Updated 2 months ago
- ☆139Updated 11 months ago
- A model compilation solution for various hardware☆415Updated last week
- ☆45Updated this week
- ☆127Updated 3 months ago
- ☆58Updated 4 months ago
- A low-latency & high-throughput serving engine for LLMs☆327Updated last month
- Dynamic Memory Management for Serving LLMs without PagedAttention☆317Updated this week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆311Updated this week
- PyTorch distributed training acceleration framework☆46Updated last month
- A collection of memory efficient attention operators implemented in the Triton language.☆253Updated 9 months ago
- Yinghan's Code Sample☆313Updated 2 years ago
- ☆88Updated 6 months ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆78Updated 4 months ago
- FlagGems is an operator library for large language models implemented in Triton Language.☆457Updated this week
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated last year
- ☆110Updated 3 months ago
- GLake: optimizing GPU memory management and IO transmission.☆445Updated 3 months ago
- ☆191Updated 8 months ago
- A benchmark suited especially for deep learning operators☆42Updated 2 years ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆108Updated last week
- Efficient and easy multi-instance LLM serving☆339Updated this week
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆35Updated 3 weeks ago
- Development repository for the Triton-Linalg conversion☆180Updated last month
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆240Updated 4 months ago
- Disaggregated serving system for Large Language Models (LLMs).☆507Updated 7 months ago
- A baseline repository of Auto-Parallelism in Training Neural Networks☆143Updated 2 years ago
- A Easy-to-understand TensorOp Matmul Tutorial☆331Updated 6 months ago
- ☆36Updated 3 months ago
- ☆90Updated 2 weeks ago