a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA environments.
☆51Aug 25, 2024Updated last year
Alternatives and similar repositories for flash-attention-v2-RDNA3-minimal
Users that are interested in flash-attention-v2-RDNA3-minimal are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Simple monkeypatch to boost AMD Navi 3 GPUs☆48Apr 21, 2025Updated 11 months ago
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆12Jun 24, 2024Updated last year
- Fast and memory-efficient exact attention ported to rocm☆13Dec 1, 2023Updated 2 years ago
- A tiny implementation of in-place FFT. The performance is comparable to FFTW3 for length 2^17 to 2^20.☆15Jul 24, 2018Updated 7 years ago
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 6 months ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- ComfyUI custom nodes for DeepSeek, Qwen, GPT, and other OpenAI-compatible LLM APIs, with tools for chat, translation, vision, and JSON wo…☆19Updated this week
- Optimized FP16/BF16 x FP4 GPU kernels for AMD GPUs☆46Feb 21, 2026Updated last month
- Implement FlashAttention v2 with minimal code to learn.☆16Jun 12, 2024Updated last year
- Image processing tool for ComfyUI☆13Aug 6, 2025Updated 8 months ago
- 8-bit CUDA functions for PyTorch Rocm compatible☆41Mar 26, 2024Updated 2 years ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆16Aug 31, 2023Updated 2 years ago
- ComfyUI custom nodes for RVC related inference and image generation☆37Oct 15, 2025Updated 5 months ago
- A forked version of flux-fast that makes flux-fast even faster with cache-dit, 3.3x speedup on NVIDIA L20.☆24Jul 18, 2025Updated 8 months ago
- AI Tensor Engine for ROCm☆402Updated this week
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 10 months ago
- NES emulator written in pure FreeBASIC with love by Blyss Sarania and Gavin Schulte(Nobbs66).☆21Oct 29, 2025Updated 5 months ago
- Fast and memory-efficient exact attention☆227Updated this week
- ☆162Sep 15, 2023Updated 2 years ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆113Sep 10, 2024Updated last year
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆112Apr 7, 2026Updated last week
- Guides to hopefully simplify the process of using ROCm.☆12Sep 26, 2024Updated last year
- A convenient fast Text to Speech Whisper Speech by Collabora you can train a voice on the fly on ComfyUI☆43Mar 9, 2025Updated last year
- ☆24Jul 16, 2025Updated 8 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Quick and easy Diffusers CLI☆15Apr 2, 2026Updated last week
- Development repository for the Triton language and compiler☆144Updated this week
- The HIP Environment and ROCm Kit - A lightweight open source build system for HIP and ROCm☆917Updated this week
- 8-bit CUDA functions for PyTorch☆72Sep 24, 2025Updated 6 months ago
- A low-cost, high-performance deep learning training framework that enables efficient 100B-scale model fine-tuning on a commodity server w…☆24Mar 21, 2025Updated last year
- ☆17Apr 30, 2025Updated 11 months ago
- ☆23May 22, 2024Updated last year
- ☆87Jan 23, 2025Updated last year
- hipDF - GPU DataFrame Library☆16Mar 16, 2026Updated 3 weeks ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.☆43Sep 29, 2025Updated 6 months ago
- 🤖 Telegram chatbot frontend for Searx.☆15Nov 25, 2018Updated 7 years ago
- Repo for Source files of Avent miroZed Carrier Boards☆12Jan 9, 2025Updated last year
- ☆66Updated this week
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆257Feb 13, 2026Updated 2 months ago
- Running PyTorch on Windows with AMD GPUs using alpha ROCm wheels. It's fast, it's fragile, and it hates you back.☆24Oct 31, 2025Updated 5 months ago
- Expert Specialization MoE Solution based on CUTLASS☆26Jan 19, 2026Updated 2 months ago