RadeonFlow / RadeonFlow_KernelsLinks
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
☆75Updated 2 months ago
Alternatives and similar repositories for RadeonFlow_Kernels
Users that are interested in RadeonFlow_Kernels are comparing it to the libraries listed below
Sorting:
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆186Updated last week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆319Updated this week
- ☆277Updated this week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆194Updated this week
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆163Updated 2 months ago
- Helpful kernel tutorials and examples for tile-based GPU programming☆592Updated last week
- Autonomous GPU Kernel Generation via Deep Agents☆223Updated this week
- ☆101Updated last year
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆164Updated this week
- Fastest kernels written from scratch☆528Updated 4 months ago
- ☆128Updated 3 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆105Updated 7 months ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆459Updated 3 weeks ago
- kernels, of the mega variety☆657Updated 4 months ago
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆92Updated 3 weeks ago
- ☆258Updated last year
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆440Updated last month
- Low overhead tracing library and trace visualizer for pipelined CUDA kernels☆129Updated 2 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆250Updated 8 months ago
- ☆128Updated 5 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆145Updated 8 months ago
- Cataloging released Triton kernels.☆289Updated 4 months ago
- NVIDIA cuTile learn☆150Updated last month
- ☆117Updated 8 months ago
- Our first fully AI generated deep learning system☆247Updated last week
- CUTLASS and CuTe Examples☆117Updated 2 months ago
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆95Updated 4 months ago
- Accelerating MoE with IO and Tile-aware Optimizations☆563Updated last week
- An experimental CPU backend for Triton☆173Updated 2 months ago
- incubator repo for CUDA-TileIR backend☆86Updated 2 weeks ago