tgautam03 / tGeMMLinks
General Matrix Multiplication using NVIDIA Tensor Cores
☆27Updated 10 months ago
Alternatives and similar repositories for tGeMM
Users that are interested in tGeMM are comparing it to the libraries listed below
Sorting:
- High-Performance SGEMM on CUDA devices☆113Updated 10 months ago
- Custom PTX Instruction Benchmark☆136Updated 9 months ago
- Attention in SRAM on Tenstorrent Grayskull☆39Updated last year
- ☆81Updated 2 weeks ago
- Step by step implementation of a fast softmax kernel in CUDA☆59Updated 11 months ago
- Learning about CUDA by writing PTX code.☆150Updated last year
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆47Updated 4 months ago
- Automatic differentiation for Triton Kernels☆29Updated 4 months ago
- ☆14Updated last month
- ☆22Updated 5 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆64Updated this week
- ☆52Updated 7 months ago
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆178Updated last week
- LLM training in simple, raw C/CUDA☆108Updated last year
- ☆14Updated 9 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆105Updated 5 months ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Updated 6 months ago
- Quantized LLM training in pure CUDA/C++.☆221Updated this week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆174Updated 2 weeks ago
- coding CUDA everyday!☆71Updated last week
- Super fast FP32 matrix multiplication on RDNA3☆81Updated 8 months ago
- ☆97Updated last year
- ☆86Updated last month
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆83Updated 2 months ago
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆41Updated last year
- Personal solutions to the Triton Puzzles☆20Updated last year
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆133Updated this week
- A Data-Centric Compiler for Machine Learning☆85Updated this week
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Updated 3 months ago
- A bunch of kernels that might make stuff slower 😉☆65Updated 2 weeks ago