LaurieWired / BenchmarkCustomPTXLinks
Custom PTX Instruction Benchmark
☆126Updated 6 months ago
Alternatives and similar repositories for BenchmarkCustomPTX
Users that are interested in BenchmarkCustomPTX are comparing it to the libraries listed below
Sorting:
- High-Performance SGEMM on CUDA devices☆97Updated 7 months ago
- Learning about CUDA by writing PTX code.☆134Updated last year
- Nvidia Instruction Set Specification Generator☆290Updated last year
- Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-s…☆102Updated this week
- Attention in SRAM on Tenstorrent Grayskull☆38Updated last year
- Super fast FP32 matrix multiplication on RDNA3☆71Updated 4 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆74Updated last week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 5 months ago
- Tenstorrent MLIR compiler☆174Updated last week
- My submission for the GPUMODE/AMD fp8 mm challenge☆27Updated 2 months ago
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels☆144Updated last week
- ☆49Updated 7 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆52Updated this week
- LLM training in simple, raw C/CUDA☆104Updated last year
- AI Tensor Engine for ROCm☆254Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆44Updated last week
- GPUOcelot: A dynamic compilation framework for PTX☆207Updated 6 months ago
- ☆33Updated last month
- GPU documentation for humans☆119Updated last week
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆64Updated 3 weeks ago
- ☆58Updated this week
- ctypes wrappers for HIP, CUDA, and OpenCL☆130Updated last year
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆299Updated this week
- An interactive web-based tool for exploring intermediate representations of PyTorch and Triton models☆48Updated last week
- RDNA3 emulator☆54Updated 4 months ago
- An experimental CPU backend for Triton☆145Updated 2 months ago
- pytorch from scratch in pure C/CUDA and python☆40Updated 10 months ago
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆60Updated 2 months ago
- General Matrix Multiplication using NVIDIA Tensor Cores☆20Updated 7 months ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆111Updated this week