Fast and Furious AMD Kernels
☆372Feb 26, 2026Updated last week
Alternatives and similar repositories for HipKittens
Users that are interested in HipKittens are comparing it to the libraries listed below
Sorting:
- ☆23Jul 11, 2025Updated 7 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- AI Tensor Engine for ROCm☆367Updated this week
- Super fast FP32 matrix multiplication on RDNA3☆87Mar 30, 2025Updated 11 months ago
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆21Feb 9, 2026Updated last month
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.☆20Jan 24, 2025Updated last year
- ☆52May 19, 2025Updated 9 months ago
- Official Repository of Native Parallel Reasoner☆102Feb 5, 2026Updated last month
- Ahead of Time (AOT) Triton Math Library☆93Updated this week
- ☆53Feb 24, 2026Updated last week
- ☆65Apr 26, 2025Updated 10 months ago
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆165Feb 16, 2026Updated 2 weeks ago
- ☆262Jul 11, 2024Updated last year
- A practical way of learning Swizzle☆37Feb 3, 2025Updated last year
- Sample Codes using NVSHMEM on Multi-GPU☆30Jan 22, 2023Updated 3 years ago
- amdgpu example code in hip/asm☆56Updated this week
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Jul 4, 2025Updated 8 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆73Sep 8, 2024Updated last year
- Tile primitives for speedy kernels☆3,202Feb 24, 2026Updated last week
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning☆445Jan 8, 2026Updated 2 months ago
- ☆104Sep 9, 2024Updated last year
- Ship correct and fast LLM kernels to PyTorch☆144Jan 14, 2026Updated last month
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆179Updated this week
- Read-only mirror. Please submit merge requests / issues to https://gitlab.com/libvirt/libvirt-sandbox☆13Aug 22, 2023Updated 2 years ago
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Apr 7, 2023Updated 2 years ago
- ☆13Feb 10, 2026Updated 3 weeks ago
- ☆25Sep 19, 2025Updated 5 months ago
- MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI tr…☆55Mar 1, 2026Updated last week
- NVIDIA cuTile learn☆165Dec 9, 2025Updated 2 months ago
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆145Feb 23, 2026Updated last week
- ☆63Updated this week
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Feb 9, 2026Updated 3 weeks ago
- 🍏专门为 2024 书生·浦语大模型挑战赛 (春季赛) 准备的 Repo🍎收录了赫萝相关的微调源码☆12Sep 20, 2024Updated last year
- Mirage Persistent Kernel: Compiling LLMs into a MegaKernel☆2,148Feb 23, 2026Updated last week
- PeRL: Parameter-Efficient Reinforcement Learning☆71Feb 23, 2026Updated last week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆215Updated this week
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆92Jan 26, 2026Updated last month
- TensaLang is a Tensor-first programming language, compiler, and runtime that let you write the Model’s inference engine (e.g. LLMs) and s…☆71Feb 20, 2026Updated 2 weeks ago
- A Quirky Assortment of CuTe Kernels☆838Updated this week