Repeerc / flash-attention-v2-RDNA3-minimal
a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA environments.
☆34Updated 4 months ago
Alternatives and similar repositories for flash-attention-v2-RDNA3-minimal:
Users that are interested in flash-attention-v2-RDNA3-minimal are comparing it to the libraries listed below
- Fast and memory-efficient exact attention☆151Updated this week
- Context parallel attention that accelerates DiT model inference with dynamic caching☆147Updated this week
- ☆99Updated 3 weeks ago
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆11Updated 6 months ago
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆73Updated this week
- Model Compression Toolbox for Large Language Models and Diffusion Models☆302Updated 3 weeks ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆99Updated 4 months ago
- ☆54Updated 3 weeks ago
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆17Updated 2 months ago
- llama.cpp fork with additional SOTA quants and improved performance☆126Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆230Updated 2 months ago
- 8-bit CUDA functions for PyTorch☆42Updated this week
- Development repository for the Triton language and compiler☆102Updated this week
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆237Updated 3 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆93Updated 6 months ago
- PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu☆47Updated last month
- ☆56Updated 3 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance☆43Updated this week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆94Updated last month
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆52Updated 5 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆219Updated this week
- A CUDA kernel for NHWC GroupNorm for PyTorch☆16Updated 2 months ago
- rocWMMA☆97Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆64Updated this week
- ☆54Updated last month
- ☆62Updated last month
- ☆154Updated 3 weeks ago
- 📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.☆49Updated this week
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆16Updated 3 months ago
- A parallelism VAE avoids OOM for high resolution image generation☆50Updated last week