Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
☆75Feb 11, 2026Updated 2 weeks ago
Alternatives and similar repositories for RadeonFlow_Kernels
Users that are interested in RadeonFlow_Kernels are comparing it to the libraries listed below
Sorting:
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Jun 4, 2025Updated 8 months ago
- ☆32Jul 2, 2025Updated 8 months ago
- Wave: Python Domain-Specific Language for High Performance Machine Learning☆45Updated this week
- ☆18Jun 6, 2025Updated 8 months ago
- ☆15Updated this week
- ☆18Nov 11, 2025Updated 3 months ago
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Feb 9, 2026Updated 3 weeks ago
- ☆18Dec 2, 2024Updated last year
- ☆44Updated this week
- Ahead of Time (AOT) Triton Math Library☆92Updated this week
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆139Updated this week
- ☆118May 19, 2025Updated 9 months ago
- Hosting a tutorial documentation for running Isaac ROS Visual SLAM on Jetson device.☆26Feb 28, 2024Updated 2 years ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆147May 10, 2025Updated 9 months ago
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆127Nov 14, 2025Updated 3 months ago
- AI Tensor Engine for ROCm☆360Updated this week
- Samples of good AI generated CUDA kernels☆100May 30, 2025Updated 9 months ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆73Feb 2, 2025Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆195Updated this week
- ☆28Dec 3, 2025Updated 2 months ago
- 详细双语注释版word2vec源码,well-annotated word2vec☆10Oct 3, 2021Updated 4 years ago
- ☆261Jul 11, 2024Updated last year
- Official implementation of REArtGS (NeurIPS 2025)☆19Oct 24, 2025Updated 4 months ago
- ☆10Nov 1, 2021Updated 4 years ago
- Slimebound character mod for Slay the Spire☆14Jun 30, 2020Updated 5 years ago
- [ICML 2024] AutoOS: Make Your OS More Powerful by Exploiting Large Language Models☆14Dec 10, 2025Updated 2 months ago
- Simply drag and drop your PDF files into Preve to get started. Ask Preve questions about your document. Get Summaries, key points, specif…☆11Apr 5, 2025Updated 10 months ago
- 国科大研究生课程 操作系统高级教程2023年思考题☆12Dec 24, 2023Updated 2 years ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆40Jul 26, 2024Updated last year
- Fastest kernels written from scratch☆548Sep 18, 2025Updated 5 months ago
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆177Updated this week
- Stackfish is an open-source LLM-powered pipeline designed to automatically solve competitive programming problems.☆53Dec 14, 2024Updated last year
- Learning materials for Stanford Compiler course : CS143☆18Oct 19, 2021Updated 4 years ago
- Restores the "Run with graphics processor" option to the context menu☆16Nov 16, 2024Updated last year
- ☆39Oct 29, 2025Updated 4 months ago
- Scriptable interface to a powerful, multi-lingual language server☆32Feb 21, 2026Updated last week
- OpenGL Projects☆10Jan 7, 2023Updated 3 years ago
- Brax + Pufferlib + CARBS for gpu-accelerated robotics RL☆12Jun 12, 2025Updated 8 months ago
- See https://github.com/cuda-mode/triton-index/ instead!☆11May 8, 2024Updated last year