bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and versatility of software and hardware.
☆238Updated 2 weeks ago
Alternatives and similar repositories for ByteMLPerf:
Users that are interested in ByteMLPerf are comparing it to the libraries listed below
- A model compilation solution for various hardware☆429Updated this week
- ☆148Updated 3 months ago
- FlagGems is an operator library for large language models implemented in Triton Language.☆510Updated this week
- ☆139Updated last year
- Dynamic Memory Management for Serving LLMs without PagedAttention☆360Updated 2 weeks ago
- Development repository for the Triton-Linalg conversion☆185Updated 2 months ago
- GLake: optimizing GPU memory management and IO transmission.☆456Updated last month
- Yinghan's Code Sample☆323Updated 2 years ago
- A Easy-to-understand TensorOp Matmul Tutorial☆346Updated 7 months ago
- DeepSeek-V3/R1 inference performance simulator☆115Updated last month
- A collection of memory efficient attention operators implemented in the Triton language.☆266Updated 11 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆345Updated this week
- ☆127Updated 4 months ago
- ☆117Updated 5 months ago
- PyTorch distributed training acceleration framework☆48Updated 2 months ago
- A low-latency & high-throughput serving engine for LLMs☆351Updated 2 weeks ago
- ☆58Updated 5 months ago
- Distributed Triton for Parallel Systems☆618Updated last week
- ☆202Updated 9 months ago
- A benchmark suited especially for deep learning operators☆42Updated 2 years ago
- ☆50Updated this week
- heterogeneity-aware-lowering-and-optimization☆254Updated last year
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆473Updated last year
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆94Updated 2 years ago
- Efficient and easy multi-instance LLM serving☆398Updated this week
- Fast and memory-efficient exact attention☆68Updated last week
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆64Updated this week
- Shared Middle-Layer for Triton Compilation☆246Updated 2 weeks ago
- ☆93Updated 7 months ago
- Examples of CUDA implementations by Cutlass CuTe☆170Updated 3 months ago