Snektron / gpumode-amd-fp8-mmLinks
My submission for the GPUMODE/AMD fp8 mm challenge
☆29Updated 6 months ago
Alternatives and similar repositories for gpumode-amd-fp8-mm
Users that are interested in gpumode-amd-fp8-mm are comparing it to the libraries listed below
Sorting:
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Updated 3 months ago
- coding CUDA everyday!☆72Updated 3 weeks ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 9 months ago
- ☆114Updated last month
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆179Updated this week
- ☆84Updated 2 weeks ago
- Learning about CUDA by writing PTX code.☆150Updated last year
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆74Updated last month
- High-Performance SGEMM on CUDA devices☆114Updated 11 months ago
- ☆84Updated 3 weeks ago
- ☆23Updated 5 months ago
- Quantized LLM training in pure CUDA/C++.☆226Updated this week
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning☆252Updated 2 weeks ago
- Experimental GPU language with meta-programming☆24Updated last year
- Custom PTX Instruction Benchmark☆137Updated 10 months ago
- Super fast FP32 matrix multiplication on RDNA3☆81Updated 9 months ago
- Nvidia Instruction Set Specification Generator☆306Updated last year
- Samples of good AI generated CUDA kernels☆96Updated 7 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆182Updated last week
- It is an LLM-based AI agent, which can write correct and efficient gpu kernels automatically.☆49Updated last week
- pytorch from scratch in pure C/CUDA and python☆41Updated last year
- Fast and Furious AMD Kernels☆327Updated this week
- General Matrix Multiplication using NVIDIA Tensor Cores☆27Updated 11 months ago
- Learn CUDA with PyTorch☆154Updated last week
- ☆86Updated last month
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆188Updated 5 months ago
- Low overhead tracing library and trace visualizer for pipelined CUDA kernels☆127Updated last month
- ☆32Updated 5 months ago
- Step by step implementation of a fast softmax kernel in CUDA☆58Updated 11 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆153Updated last month