tgautam03 / tGeMMLinks
General Matrix Multiplication using NVIDIA Tensor Cores
☆18Updated 6 months ago
Alternatives and similar repositories for tGeMM
Users that are interested in tGeMM are comparing it to the libraries listed below
Sorting:
- High-Performance SGEMM on CUDA devices☆98Updated 6 months ago
- Attention in SRAM on Tenstorrent Grayskull☆37Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆48Updated last week
- ☆47Updated 7 months ago
- Custom PTX Instruction Benchmark☆126Updated 5 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆69Updated 3 weeks ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆27Updated 2 months ago
- Personal solutions to the Triton Puzzles☆19Updated last year
- Block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge. Additionally, this repo includes codes for …☆15Updated this week
- Learning about CUDA by writing PTX code.☆133Updated last year
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆43Updated 4 months ago
- LLM training in simple, raw C/CUDA☆103Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels☆139Updated this week
- ☆66Updated this week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 4 months ago
- A bunch of kernels that might make stuff slower 😉☆56Updated last week
- Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-s…☆96Updated this week
- making the official triton tutorials actually comprehensible☆53Updated 2 weeks ago
- Memory Optimizations for Deep Learning (ICML 2023)☆102Updated last year
- ☆33Updated 3 weeks ago
- Automatic differentiation for Triton Kernels☆11Updated this week
- Super fast FP32 matrix multiplication on RDNA3☆70Updated 4 months ago
- Samples of good AI generated CUDA kernels☆86Updated 2 months ago
- ☆110Updated 4 months ago
- ☆41Updated 3 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆93Updated last month
- coding CUDA everyday!☆53Updated 3 months ago
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆60Updated last week
- Machine Learning Agility (MLAgility) benchmark and benchmarking tools☆39Updated last week
- CUDA Matrix Multiplication Optimization☆213Updated last year