a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA environments.
☆52Aug 25, 2024Updated last year
Alternatives and similar repositories for flash-attention-v2-RDNA3-minimal
Users that are interested in flash-attention-v2-RDNA3-minimal are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Simple monkeypatch to boost AMD Navi 3 GPUs☆51Apr 21, 2025Updated last year
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆12Jun 24, 2024Updated last year
- Fast and memory-efficient exact attention ported to rocm☆14Dec 1, 2023Updated 2 years ago
- A tiny implementation of in-place FFT. The performance is comparable to FFTW3 for length 2^17 to 2^20.☆15Jul 24, 2018Updated 7 years ago
- Official repository Flash Local Linear Attention☆36May 28, 2026Updated 2 weeks ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- ComfyUI custom nodes for DeepSeek, Qwen, GPT, and other OpenAI-compatible LLM APIs, with tools for chat, translation, vision, and JSON wo…☆26Apr 23, 2026Updated last month
- The small, fast game engine for Compose Multiplatform☆10Feb 1, 2025Updated last year
- Flash Attention in raw Cuda C beating PyTorch☆39May 14, 2024Updated 2 years ago
- Implement FlashAttention v2 with minimal code to learn.☆16Jun 12, 2024Updated 2 years ago
- Image processing tool for ComfyUI☆13Aug 6, 2025Updated 10 months ago
- Optimized FP16/BF16 x FP4 GPU kernels for AMD GPUs☆54May 29, 2026Updated 2 weeks ago
- Installation script for an AI applications using ROCm on Linux.☆48May 31, 2026Updated 2 weeks ago
- 8-bit CUDA functions for PyTorch Rocm compatible☆42Mar 26, 2024Updated 2 years ago
- ComfyUI custom nodes for RVC related inference and image generation☆39Oct 15, 2025Updated 7 months ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- A forked version of flux-fast that makes flux-fast even faster with cache-dit, 3.3x speedup on NVIDIA L20.☆24Jul 18, 2025Updated 10 months ago
- Everything you need to setup on your AMD system for Machine Learning Stuff☆19Jul 31, 2025Updated 10 months ago
- AI Tensor Engine for ROCm☆460Updated this week
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆47Jun 11, 2025Updated last year
- Fast and memory-efficient exact attention☆232Updated this week
- ☆165Sep 15, 2023Updated 2 years ago
- Lightweight Python Wrapper for OpenVINO, enabling LLM inference on NPUs☆29Dec 17, 2024Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆113Sep 10, 2024Updated last year
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆113Updated this week
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- RWKV, in easy to read code☆73Mar 25, 2025Updated last year
- Guides to hopefully simplify the process of using ROCm.☆12Sep 26, 2024Updated last year
- A convenient fast Text to Speech Whisper Speech by Collabora you can train a voice on the fly on ComfyUI☆44Mar 9, 2025Updated last year
- AutoHotKey script to translate Joystick movement to keypresses.☆12Jun 9, 2014Updated 12 years ago
- ☆14Feb 23, 2025Updated last year
- Development repository for the Triton language and compiler☆144Jun 5, 2026Updated last week
- ☆15Apr 14, 2026Updated 2 months ago
- YOLOX with NCNN/MNN/TNN/ONNXRuntime C++.☆13Dec 18, 2021Updated 4 years ago
- A low-cost, high-performance deep learning training framework that enables efficient 100B-scale model fine-tuning on a commodity server w…☆23Mar 21, 2025Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- ☆25May 23, 2026Updated 3 weeks ago
- CUDA Embedding Lookup Kernel Library☆47Feb 9, 2026Updated 4 months ago
- The HIP Environment and ROCm Kit - A lightweight open source build system for HIP and ROCm☆1,077Updated this week
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆95Updated this week
- CUDA on AMD GPUs☆611Feb 11, 2026Updated 4 months ago
- hipDF - GPU DataFrame Library☆16Mar 16, 2026Updated 2 months ago
- 本仓的样例基于 Orangepi ai pro 昇腾 310B 平台,对官方样例 USB 摄像头 yolov5 目标检测进行了在 ROS2 环境中的部署与优化。☆14Mar 14, 2024Updated 2 years ago