a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA environments.
☆51Aug 25, 2024Updated last year
Alternatives and similar repositories for flash-attention-v2-RDNA3-minimal
Users that are interested in flash-attention-v2-RDNA3-minimal are comparing it to the libraries listed below
Sorting:
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆12Jun 24, 2024Updated last year
- LLM training in simple, raw C/HIP for AMD GPUs☆58Sep 23, 2024Updated last year
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 5 months ago
- A tiny implementation of in-place FFT. The performance is comparable to FFTW3 for length 2^17 to 2^20.☆15Jul 24, 2018Updated 7 years ago
- Fast and memory-efficient exact attention ported to rocm☆13Dec 1, 2023Updated 2 years ago
- Installation script for an AI applications using ROCm on Linux.☆39Updated this week
- Flash Attention in raw Cuda C beating PyTorch☆37May 14, 2024Updated last year
- A forked version of flux-fast that makes flux-fast even faster with cache-dit, 3.3x speedup on NVIDIA L20.☆24Jul 18, 2025Updated 7 months ago
- Image processing tool for ComfyUI☆12Aug 6, 2025Updated 6 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆16Aug 31, 2023Updated 2 years ago
- YOLOX with NCNN/MNN/TNN/ONNXRuntime C++.☆13Dec 18, 2021Updated 4 years ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 8 months ago
- Everything you need to setup on your AMD system for Machine Learning Stuff☆19Jul 31, 2025Updated 7 months ago
- Fast and memory-efficient exact attention☆221Updated this week
- NES emulator written in pure FreeBASIC with love by Blyss Sarania and Gavin Schulte(Nobbs66).☆21Oct 29, 2025Updated 4 months ago
- ☆160Sep 15, 2023Updated 2 years ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆113Updated this week
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆147May 10, 2025Updated 9 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Sep 10, 2024Updated last year
- Writing a CUDA software ray tracing renderer with Analysis-Driven Optimization from scratch: a python-importable, distributed parallel re…☆37Oct 5, 2025Updated 4 months ago
- 将YOLOv5-Lite代码中的head更换为YOLOX head☆22Mar 22, 2022Updated 3 years ago
- ☆67Oct 25, 2025Updated 4 months ago
- AI Tensor Engine for ROCm☆360Updated this week
- 8-bit CUDA functions for PyTorch☆70Sep 24, 2025Updated 5 months ago
- ComfyUI custom nodes for RVC related inference and image generation☆36Oct 15, 2025Updated 4 months ago
- ☆10Dec 25, 2022Updated 3 years ago
- Running ComfyUI with AMD + ZLUDA (Windows)☆37Nov 2, 2024Updated last year
- Hackable and optimized Transformers building blocks, supporting a composable construction.☆34Feb 24, 2026Updated last week
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆139Updated this week
- A complete package that provides you with all the components needed to get started of dive deeper into Machine Learning Workloads on Cons…☆50Updated this week
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆79Aug 12, 2024Updated last year
- Development repository for the Triton language and compiler☆141Updated this week
- ☆85Jan 23, 2025Updated last year
- The HIP Environment and ROCm Kit - A lightweight open source build system for HIP and ROCm☆804Updated this week
- ☆12Apr 17, 2025Updated 10 months ago
- A powerful ComfyUI node for rendering text with advanced styling options, including full support for Persian/Farsi and Arabic scripts.☆28May 23, 2025Updated 9 months ago
- Правильная установка ComfyUI☆12Aug 29, 2024Updated last year
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆251Feb 13, 2026Updated 2 weeks ago
- This is the repository for 4D mmWave Radar for Sensing Enhancement in Adverse Environments: Advances and Challenges☆17Jan 20, 2026Updated last month