LaurieWired / BenchmarkCustomPTX
Custom PTX Instruction Benchmark
☆122Updated last month
Alternatives and similar repositories for BenchmarkCustomPTX:
Users that are interested in BenchmarkCustomPTX are comparing it to the libraries listed below
- Learning about CUDA by writing PTX code.☆128Updated last year
- High-Performance SGEMM on CUDA devices☆90Updated 3 months ago
- Attention in SRAM on Tenstorrent Grayskull☆33Updated 9 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆40Updated this week
- Nvidia Instruction Set Specification Generator☆256Updated 9 months ago
- An experimental CPU backend for Triton☆105Updated 2 weeks ago
- GPU documentation for humans☆44Updated this week
- Tenstorrent MLIR compiler☆120Updated this week
- pytorch from scratch in pure C/CUDA and python☆40Updated 6 months ago
- RDNA3 emulator☆54Updated last week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆40Updated last month
- General Matrix Multiplication using NVIDIA Tensor Cores☆13Updated 3 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆65Updated last month
- GPUOcelot: A dynamic compilation framework for PTX☆187Updated 2 months ago
- Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-s…☆43Updated this week
- LLM training in simple, raw C/CUDA☆92Updated 11 months ago
- Reference Kernels for the Leaderboard☆33Updated last week
- The HIP Environment and ROCm Kit - A lightweight open source build system for HIP and ROCm☆49Updated this week
- Fast low-bit matmul kernels in Triton☆291Updated this week
- AI Tensor Engine for ROCm☆180Updated this week
- CUDA Matrix Multiplication Optimization☆181Updated 9 months ago
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆91Updated this week
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆347Updated this week
- ctypes wrappers for HIP, CUDA, and OpenCL☆129Updated 9 months ago
- Visualization of cache-optimized matrix multiplication☆120Updated last month
- Fastest kernels written from scratch☆236Updated 3 weeks ago
- ☆27Updated last month
- ☆78Updated 5 months ago
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆88Updated last month
- A comprehensive tool for visualizing and analyzing model execution, offering interactive graphs, memory plots, tensor details, buffer ove…☆31Updated this week