bytedance / xpu-perf
View external linksLinks

AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and versatility of software and hardware.

☆298

Alternatives and similar repositories for xpu-perf

Users that are interested in xpu-perf are comparing it to the libraries listed below

Sorting:

bytedance / byteir
View on GitHub
A model compilation solution for various hardware
☆464Aug 20, 2025Updated 5 months ago
bytedance / ByteTransformer
View on GitHub
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
☆477Mar 15, 2024Updated last year
OAID / TengineInferPipe
View on GitHub
☆23Dec 8, 2022Updated 3 years ago
perplexityai / pplx-kernels
View on GitHub
Perplexity GPU Kernels
☆560Nov 7, 2025Updated 3 months ago
yester31 / Cutlass_EX
View on GitHub
study of cutlass
☆22Nov 10, 2024Updated last year
mit-han-lab / parallel-computing-tutorial
View on GitHub
☆177Aug 9, 2023Updated 2 years ago
quiver-team / quiver-feature
View on GitHub
High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph
☆56Jul 3, 2022Updated 3 years ago
bytedance / effective_transformer
View on GitHub
Running BERT without Padding
☆480Mar 18, 2022Updated 3 years ago
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,247Aug 28, 2025Updated 5 months ago
sgl-project / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆22Updated this week
TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆18Nov 19, 2024Updated last year
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆148May 10, 2025Updated 9 months ago
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆410Updated this week
volcengine / veScale
View on GitHub
Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs
☆929Nov 27, 2025Updated 2 months ago
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Sep 13, 2025Updated 5 months ago
UCLA-VAST / heterohalide
View on GitHub
HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration
☆15Sep 14, 2020Updated 5 years ago
zhuzilin / pytorch-malloc
View on GitHub
An external memory allocator example for PyTorch.
☆16Aug 10, 2025Updated 6 months ago
microsoft / nnfusion
View on GitHub
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
☆1,006Sep 19, 2024Updated last year
neuralmagic / AutoFP8
View on GitHub
☆206May 5, 2025Updated 9 months ago
cornell-zhang / heterocl
View on GitHub
HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing (FPGA'19 Best Paper)
☆341Apr 20, 2024Updated last year
InternLM / turbomind
View on GitHub
☆97Mar 26, 2025Updated 10 months ago
OpenPPL / ppl.nn
View on GitHub
A primitive library for neural network
☆1,368Nov 24, 2024Updated last year
flagos-ai / FlagGems
View on GitHub
FlagGems is an operator library for large language models implemented in the Triton Language.
☆898Updated this week
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆106Jun 28, 2025Updated 7 months ago
OpenPPL / ppl.llm.kernel.cuda
View on GitHub
☆152Jan 9, 2025Updated last year
Bruce-Lee-LY / cutlass_gemm
View on GitHub
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆20Aug 3, 2025Updated 6 months ago
NVIDIA / TransformerEngine
View on GitHub
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…
☆3,152Feb 7, 2026Updated last week
NVIDIA / nvbench
View on GitHub
CUDA Kernel Benchmarking Library
☆813Updated this week
deepspeedai / DeepSpeed-Kernels
View on GitHub
☆71Mar 26, 2025Updated 10 months ago
MegEngine / mgeconvert
View on GitHub
MegEngine到其他框架的转换器
☆69Apr 27, 2023Updated 2 years ago
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,350Updated this week
MegEngine / MegCC
View on GitHub
MegCC是一个运行时超轻量，高效，移植简单的深度学习模型编译器
☆488Oct 23, 2024Updated last year
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated 9 months ago
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆4,935Updated this week
merrymercy / awesome-tensor-compilers
View on GitHub
A list of awesome compiler projects and papers for tensor computation and deep learning.
☆2,728Oct 19, 2024Updated last year
hanchenye / polyaie
View on GitHub
An MLIR-based compiler from C/C++ to AMD-Xilinx Versal AIE
☆18Aug 5, 2022Updated 3 years ago
chhzh123 / ptc-tutorial
View on GitHub
PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo
☆17Mar 13, 2023Updated 2 years ago
cloudcores / CuAssembler
View on GitHub
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆567Apr 20, 2023Updated 2 years ago
flexflow / flexflow-train
View on GitHub
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
☆1,859Feb 7, 2026Updated last week

bytedance / xpu-perfView external linksLinks

Alternatives and similar repositories for xpu-perf

bytedance / xpu-perf
View external linksLinks