LaurieWired / BenchmarkCustomPTXLinks
Custom PTX Instruction Benchmark
☆126Updated 3 months ago
Alternatives and similar repositories for BenchmarkCustomPTX
Users that are interested in BenchmarkCustomPTX are comparing it to the libraries listed below
Sorting:
- Learning about CUDA by writing PTX code.☆131Updated last year
- High-Performance SGEMM on CUDA devices☆94Updated 4 months ago
- An experimental CPU backend for Triton☆119Updated this week
- Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-s…☆59Updated this week
- Attention in SRAM on Tenstorrent Grayskull☆35Updated 10 months ago
- Reference Kernels for the Leaderboard☆49Updated last week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆44Updated this week
- ☆54Updated this week
- Nvidia Instruction Set Specification Generator☆271Updated 10 months ago
- Tenstorrent MLIR compiler☆132Updated this week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆67Updated 2 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆43Updated 2 months ago
- LLM training in simple, raw C/CUDA☆99Updated last year
- ☆80Updated 6 months ago
- extensible collectives library in triton☆87Updated 2 months ago
- GPU documentation for humans☆65Updated 3 weeks ago
- Fastest kernels written from scratch☆269Updated 2 months ago
- ☆105Updated 2 months ago
- pytorch from scratch in pure C/CUDA and python☆40Updated 7 months ago
- ☆72Updated last year
- ☆215Updated this week
- NVIDIA tools guide☆133Updated 4 months ago
- AI Tensor Engine for ROCm☆201Updated this week
- LLM training in simple, raw C/HIP for AMD GPUs☆50Updated 8 months ago
- Fast low-bit matmul kernels in Triton☆311Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆88Updated last week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆181Updated 3 weeks ago
- GPUOcelot: A dynamic compilation framework for PTX☆192Updated 3 months ago
- Super fast FP32 matrix multiplication on RDNA3☆61Updated 2 months ago
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆94Updated 2 months ago