chsasank / device-benchmarks
Benchmarks of different devices I have come across
☆22Updated 3 months ago
Alternatives and similar repositories for device-benchmarks:
Users that are interested in device-benchmarks are comparing it to the libraries listed below
- High-Performance SGEMM on CUDA devices☆88Updated 2 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆40Updated 2 weeks ago
- LLM training in simple, raw C/CUDA☆92Updated 11 months ago
- extensible collectives library in triton☆84Updated this week
- Collection of kernels written in Triton language☆117Updated last month
- ☆76Updated 4 months ago
- Explore training for quantized models☆17Updated 2 months ago
- ☆192Updated last week
- Fast low-bit matmul kernels in Triton☆275Updated this week
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆255Updated this week
- ☆21Updated last month
- ☆37Updated this week
- This repository contains the experimental PyTorch native float8 training UX☆222Updated 8 months ago
- An experimental CPU backend for Triton☆103Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆127Updated last year
- MLIR-based partitioning system☆76Updated this week
- ☆30Updated 2 months ago
- Ahead of Time (AOT) Triton Math Library☆56Updated last week
- ☆27Updated 2 months ago
- Inference Vision Transformer (ViT) in plain C/C++ with ggml☆265Updated 11 months ago
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆313Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆107Updated this week
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆37Updated 8 months ago
- OpenAI Triton backend for Intel® GPUs☆172Updated this week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆103Updated 8 months ago
- Fastest kernels written from scratch☆205Updated 3 weeks ago
- Cataloging released Triton kernels.☆213Updated 2 months ago
- ☆87Updated last year
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆62Updated last week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆236Updated last month