luongthecong123 / fp8-quant-matmulLinks
Block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge. Additionally, this repo includes codes for quantizing Pytorch bf16 matmul with fp8.
☆15Updated this week
Alternatives and similar repositories for fp8-quant-matmul
Users that are interested in fp8-quant-matmul are comparing it to the libraries listed below
Sorting:
- My submission for the GPUMODE/AMD fp8 mm challenge☆27Updated 2 months ago
- Samples of good AI generated CUDA kernels☆86Updated 2 months ago
- General Matrix Multiplication using NVIDIA Tensor Cores☆18Updated 6 months ago
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels☆139Updated this week
- ☆75Updated last month
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆60Updated last week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆69Updated 3 weeks ago
- High-Performance SGEMM on CUDA devices☆98Updated 6 months ago
- coding CUDA everyday!☆53Updated 3 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆48Updated last week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 4 months ago
- Learning about CUDA by writing PTX code.☆133Updated last year
- LLM Inference on consumer devices☆123Updated 4 months ago
- ☆33Updated 3 weeks ago
- Custom PTX Instruction Benchmark☆126Updated 5 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆93Updated last month
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆131Updated this week
- Attention in SRAM on Tenstorrent Grayskull☆37Updated last year
- ☆47Updated 7 months ago
- ☆44Updated last month
- [WIP] Better (FP8) attention for Hopper☆32Updated 5 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆109Updated 9 months ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆30Updated 8 months ago
- ☆66Updated this week
- ☆60Updated 3 months ago
- ☆145Updated last month
- Framework to reduce autotune overhead to zero for well known deployments.☆79Updated 2 weeks ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆93Updated last month
- ☆41Updated 3 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆43Updated 4 months ago