tgautam03 / tGeMMLinks

General Matrix Multiplication using NVIDIA Tensor Cores

☆18

Alternatives and similar repositories for tGeMM

Users that are interested in tGeMM are comparing it to the libraries listed below

Sorting:

salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆98Updated 6 months ago
moritztng / grayskull-attention
Attention in SRAM on Tenstorrent Grayskull
☆37Updated last year
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆48Updated last week
SzymonOzog / FastSoftmax
☆47Updated 7 months ago
LaurieWired / BenchmarkCustomPTX
Custom PTX Instruction Benchmark
☆126Updated 5 months ago
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆69Updated 3 weeks ago
Snektron / gpumode-amd-fp8-mm
My submission for the GPUMODE/AMD fp8 mm challenge
☆27Updated 2 months ago
alexzhang13 / Triton-Puzzles-Solutions
Personal solutions to the Triton Puzzles
☆19Updated last year
luongthecong123 / fp8-quant-matmul
Block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge. Additionally, this repo includes codes for …
☆15Updated this week
unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆133Updated last year
pytorch-labs / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆43Updated 4 months ago
gevtushenko / llm.c
LLM training in simple, raw C/CUDA
☆103Updated last year
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels
☆139Updated this week
SzymonOzog / GPU_Programming
☆66Updated this week
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 4 months ago
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆56Updated last week
tenstorrent / tt-forge
Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-s…
☆96Updated this week
evintunador / triton_docs_tutorials
making the official triton tutorials actually comprehensible
☆53Updated 2 weeks ago
facebookresearch / MODel_opt
Memory Optimizations for Deep Learning (ICML 2023)
☆102Updated last year
gpu-mode / popcorn-cli
☆33Updated 3 weeks ago
daniel-geon-park / triton_bwd
Automatic differentiation for Triton Kernels
☆11Updated this week
seb-v / fp32_sgemm_amd
Super fast FP32 matrix multiplication on RDNA3
☆70Updated 4 months ago
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆86Updated 2 months ago
bertmaher / simplegemm
☆110Updated 4 months ago
ademeure / cuda-side-boost
☆41Updated 3 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆93Updated last month
aryagxr / cuda
coding CUDA everyday!
☆53Updated 3 months ago
RadeonFlow / RadeonFlow_Kernels
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
☆60Updated last week
groq / mlagility
Machine Learning Agility (MLAgility) benchmark and benchmarking tools
☆39Updated last week
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆213Updated last year