chsasank / device-benchmarks

Benchmarks of different devices I have come across

☆24

Alternatives and similar repositories for device-benchmarks:

Users that are interested in device-benchmarks are comparing it to the libraries listed below

salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆90Updated 3 months ago
pytorch-labs / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆40Updated last month
cchan / tccl
extensible collectives library in triton
☆85Updated 3 weeks ago
openxla / shardy
MLIR-based partitioning system
☆80Updated this week
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆105Updated 2 weeks ago
gevtushenko / llm.c
LLM training in simple, raw C/CUDA
☆92Updated 11 months ago
triton-lang / kernels
☆78Updated 5 months ago
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆57Updated last week
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆291Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆120Updated 3 weeks ago
gau-nernst / quantized-training
Explore training for quantized models
☆17Updated 3 months ago
albanD / subclass_zoo
☆163Updated 10 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆82Updated last week
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆165Updated last month
NVIDIA / Fuser
A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
☆318Updated this week
google / aqt
☆297Updated this week
Deep-Learning-Profiling-Tools / triton-viz
☆200Updated this week
ROCm / aiter
AI Tensor Engine for ROCm
☆180Updated this week
bertmaher / simplegemm
☆98Updated last month
pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated 8 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆116Updated this week
ROCm / TransformerEngine
☆29Updated this week
intel / intel-xpu-backend-for-triton
OpenAI Triton backend for Intel® GPUs
☆182Updated this week
gpu-mode / triton-index
Cataloging released Triton kernels.
☆217Updated 3 months ago
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆130Updated last year
vdesai2014 / inference-optimization-blog-post
☆87Updated last year
openxla / community
Stores documents and resources used by the OpenXLA developer community
☆120Updated 8 months ago
jax-ml / ml_dtypes
A stand-alone implementation of several NumPy dtype extensions used in machine learning.
☆258Updated this week
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆106Updated 9 months ago
daniel-geon-park / triton_bwd
Automatic differentiation for Triton Kernels
☆11Updated last month