luongthecong123 / fp8-quant-matmulLinks

Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.

☆16

Alternatives and similar repositories for fp8-quant-matmul

Users that are interested in fp8-quant-matmul are comparing it to the libraries listed below

Sorting:

Snektron / gpumode-amd-fp8-mm
My submission for the GPUMODE/AMD fp8 mm challenge
☆29Updated 6 months ago
huggingface / hf-rocm-kernels
☆22Updated 4 months ago
IST-DASLab / Quartet
☆111Updated 2 weeks ago
AMD-AGI / GEAK-agent
It is an LLM-based AI agent, which can write correct and efficient gpu kernels automatically.
☆43Updated this week
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆92Updated 6 months ago
gpu-mode / popcorn-cli
☆75Updated 3 weeks ago
aryagxr / cuda
coding CUDA everyday!
☆71Updated 3 weeks ago
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆113Updated 10 months ago
ademeure / QuickRunCUDA
☆14Updated last month
Anonymous1252022 / fp4-all-the-way
☆38Updated 6 months ago
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆61Updated last week
cherichy / tilecute
☆31Updated 5 months ago
Cornell-RelaxML / yaqa-quantization
☆64Updated 5 months ago
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
☆177Updated this week
Libraries-Openly-Fused / FusedKernelLibrary
Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.
☆32Updated this week
RadeonFlow / RadeonFlow_Kernels
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
☆73Updated 2 weeks ago
abdelfattah-lab / nitro
Lightweight Python Wrapper for OpenVINO, enabling LLM inference on NPUs
☆25Updated 11 months ago
SzymonOzog / Penny
Hand-Rolled GPU communications library
☆72Updated last week
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆102Updated 5 months ago
deepreinforce-ai / CUDA-L1
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
☆247Updated last month
unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆148Updated last year
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 8 months ago
IST-DASLab / llmq
Quantized LLM training in pure CUDA/C++.
☆220Updated this week
HazyResearch / HipKittens
Fast and Furious AMD Kernels
☆309Updated last week
aikitoria / nanotrace
Low overhead tracing library and trace visualizer for pipelined CUDA kernels
☆116Updated last week
Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆125Updated 8 months ago
flashinfer-ai / cutlass-viz
☆65Updated 7 months ago
microsoft / AttentionEngine
☆113Updated 6 months ago
tile-ai / AttentionEngine
☆51Updated 6 months ago
WaveSpeedAI / QuantumAttention
[WIP] Better (FP8) attention for Hopper
☆32Updated 9 months ago