bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and versatility of software and hardware.
☆226Updated this week
Alternatives and similar repositories for ByteMLPerf:
Users that are interested in ByteMLPerf are comparing it to the libraries listed below
- ☆144Updated 2 months ago
- ☆139Updated 10 months ago
- PyTorch distributed training acceleration framework☆44Updated last month
- A model compilation solution for various hardware☆409Updated last week
- Development repository for the Triton-Linalg conversion☆176Updated last month
- ☆58Updated 3 months ago
- FlagGems is an operator library for large language models implemented in Triton Language.☆447Updated this week
- ☆45Updated this week
- ☆127Updated 2 months ago
- A benchmark suited especially for deep learning operators☆42Updated 2 years ago
- Shared Middle-Layer for Triton Compilation☆230Updated this week
- Examples of CUDA implementations by Cutlass CuTe☆143Updated last month
- Yinghan's Code Sample☆313Updated 2 years ago
- ☆105Updated 3 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆308Updated 3 weeks ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆361Updated 6 months ago
- ☆132Updated 2 months ago
- ☆87Updated 6 months ago
- A collection of memory efficient attention operators implemented in the Triton language.☆250Updated 9 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated last year
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆91Updated last year
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆35Updated 2 weeks ago
- heterogeneity-aware-lowering-and-optimization☆254Updated last year
- ☆21Updated 3 weeks ago
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆82Updated last year
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆176Updated last month
- ☆194Updated last year
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆78Updated 3 months ago
- A low-latency & high-throughput serving engine for LLMs☆319Updated last month
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆469Updated 11 months ago