Repeerc / flash-attention-v2-RDNA3-minimal
a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA environments.
☆22Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for flash-attention-v2-RDNA3-minimal
- Fast and memory-efficient exact attention☆138Updated this week
- Development repository for the Triton language and compiler☆93Updated this week
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆11Updated 4 months ago
- SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models☆246Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆44Updated this week
- Model Compression Toolbox for Large Language Models and Diffusion Models☆188Updated this week
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- [ECCV24] MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization☆30Updated 2 months ago
- 📖A small curated list of Awesome SD/DiT/ViT/Diffusion Inference with Distributed/Caching/Sampling: DistriFusion, PipeFusion, AsyncDiff, …☆89Updated 2 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆87Updated 4 months ago
- ☆82Updated last year
- A parallelism VAE avoids OOM for high resolution image generation☆40Updated last month
- rocWMMA☆91Updated this week
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆61Updated this week
- 8-bit CUDA functions for PyTorch☆38Updated this week
- Fast Inference of MoE Models with CPU-GPU Orchestration☆170Updated 2 weeks ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆196Updated 2 weeks ago
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆4Updated last week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆76Updated last month
- (ICML 2024) BiLLM: Pushing the Limit of Post-Training Quantization for LLMs☆193Updated 5 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆184Updated last month
- An algorithm for static activation quantization of LLMs☆68Updated this week
- Efficient 3bit/4bit quantization of LLaMA models☆19Updated last year
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆89Updated this week
- ☆156Updated last year
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆36Updated last year
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆52Updated 3 months ago
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators☆309Updated this week
- ☆46Updated last month