ROCm / xformers
Hackable and optimized Transformers building blocks, supporting a composable construction.
☆20Updated this week
Related projects ⓘ
Alternatives and complementary repositories for xformers
- 8-bit CUDA functions for PyTorch☆39Updated 2 weeks ago
- Fast and memory-efficient exact attention☆140Updated this week
- Development repository for the Triton language and compiler☆96Updated this week
- ☆13Updated this week
- AMD related optimizations for transformer models☆57Updated 3 weeks ago
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆63Updated this week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆26Updated 3 months ago
- Ahead of Time (AOT) Triton Math Library☆41Updated this week
- Simple and fast low-bit matmul kernels in CUDA / Triton☆147Updated this week
- [WIP] Context parallel attention that works with torch.compile☆52Updated this week
- ☆82Updated last year
- A parallelism VAE avoids OOM for high resolution image generation☆40Updated 2 months ago
- PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu☆39Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆45Updated this week
- ☆123Updated 11 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆212Updated 3 weeks ago
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆11Updated last week
- ☆47Updated 2 months ago
- ☆36Updated 2 weeks ago
- ☆12Updated 2 months ago
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆11Updated 5 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆57Updated 5 months ago
- KV cache compression for high-throughput LLM inference☆89Updated last week
- ☆151Updated this week
- Fast and memory-efficient exact attention☆30Updated last month
- ☆55Updated 6 months ago
- llama INT4 cuda inference with AWQ☆48Updated 4 months ago
- Implement Flash Attention using Cute.☆39Updated this week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆82Updated last week
- Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops☆23Updated 8 months ago