LaurieWired / BenchmarkCustomPTX
Custom PTX Instruction Benchmark
☆120Updated last month
Alternatives and similar repositories for BenchmarkCustomPTX:
Users that are interested in BenchmarkCustomPTX are comparing it to the libraries listed below
- Learning about CUDA by writing PTX code.☆125Updated last year
- High-Performance SGEMM on CUDA devices☆87Updated 2 months ago
- pytorch from scratch in pure C/CUDA and python☆40Updated 5 months ago
- Attention in SRAM on Tenstorrent Grayskull☆32Updated 8 months ago
- Nvidia Instruction Set Specification Generator☆253Updated 8 months ago
- LLM training in simple, raw C/CUDA☆92Updated 10 months ago
- Fast low-bit matmul kernels in Triton☆272Updated this week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆60Updated last week
- An experimental CPU backend for Triton☆101Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆40Updated 2 weeks ago
- GPUOcelot: A dynamic compilation framework for PTX☆182Updated last month
- Write a fast kernel and run it on Discord. See how you compare against the best!☆35Updated this week
- General Matrix Multiplication using NVIDIA Tensor Cores☆13Updated 2 months ago
- Tenstorrent MLIR compiler☆107Updated this week
- Visualization of cache-optimized matrix multiplication☆105Updated 2 weeks ago
- ☆41Updated 3 weeks ago
- Fastest kernels written from scratch☆202Updated 3 weeks ago
- extensible collectives library in triton☆84Updated 6 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆102Updated 8 months ago
- Cataloging released Triton kernels.☆212Updated 2 months ago
- ☆73Updated 4 months ago
- Experimental GPU language with meta-programming☆22Updated 6 months ago
- ☆13Updated 3 weeks ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆154Updated 10 months ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆239Updated this week
- Applied AI experiments and examples for PyTorch☆250Updated last week
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆174Updated last year
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆154Updated last week
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆84Updated this week
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆76Updated 3 weeks ago