bytedance / ByteMLPerfLinks

AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and versatility of software and hardware.

☆265

Alternatives and similar repositories for ByteMLPerf

Users that are interested in ByteMLPerf are comparing it to the libraries listed below

Sorting:

OpenPPL / ppl.llm.kernel.cuda
☆150Updated 9 months ago
OpenPPL / ppl.llm.serving
☆129Updated 9 months ago
OpenPPL / ppl.nn.llm
☆139Updated last year
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆112Updated 5 months ago
bytedance / byteir
A model compilation solution for various hardware
☆451Updated 2 months ago
FlagOpen / FlagCX
☆91Updated this week
antgroup / glake
GLake: optimizing GPU memory management and IO transmission.
☆481Updated 6 months ago
alibaba / TePDist
TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.
☆97Updated 2 years ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆181Updated last week
AyakaGEMM / Hands-on-GEMM
☆141Updated last year
OpenPPL / ppl.pmx
☆59Updated 11 months ago
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆96Updated 2 weeks ago
microsoft / mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆425Updated this week
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆353Updated 3 years ago
reed-lau / cute-gemm
☆136Updated 10 months ago
zartbot / shallowsim
DeepSeek-V3/R1 inference performance simulator
☆170Updated 6 months ago
MARD1NO / CUDA-PPT
☆109Updated 6 months ago
AlibabaPAI / torchacc
PyTorch distributed training acceleration framework
☆53Updated 2 months ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆241Updated 3 months ago
AlibabaPAI / FLASHNN
☆100Updated last year
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆115Updated last year
tlc-pack / relax
☆193Updated 2 years ago
nicolaswilde / cuda-tensorcore-hgemm
☆154Updated 9 months ago
bytedance / ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
☆479Updated last year
alibaba / heterogeneity-aware-lowering-and-optimization
heterogeneity-aware-lowering-and-optimization
☆256Updated last year
FlagTree / flagtree
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
☆90Updated this week
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆385Updated last week
Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆409Updated last year
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated last month
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆41Updated 7 months ago