WaveSpeedAI / QuantumAttention
[WIP] Better (FP8) attention for Hopper
☆26Updated last month
Alternatives and similar repositories for QuantumAttention:
Users that are interested in QuantumAttention are comparing it to the libraries listed below
- ☆65Updated 3 months ago
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆18Updated 4 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆104Updated this week
- FlexAttention w/ FlashAttention3 Support☆26Updated 5 months ago
- Triton kernels for Flux☆20Updated 3 months ago
- PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu☆57Updated 3 months ago
- ☆67Updated 2 months ago
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆28Updated 4 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 9 months ago
- Writing FLUX in Triton☆32Updated 6 months ago
- A CUDA kernel for NHWC GroupNorm for PyTorch☆18Updated 4 months ago
- QuIP quantization☆52Updated last year
- Quantized Attention on GPU☆45Updated 4 months ago
- ☆23Updated last month
- Patch convolution to avoid large GPU memory usage of Conv2D☆84Updated 2 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated last month
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆44Updated 8 months ago
- Make triton easier☆47Updated 9 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆35Updated 3 weeks ago
- ☆46Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆77Updated 5 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 8 months ago
- Triton implement of bi-directional (non-causal) linear attention☆44Updated last month
- ☆25Updated last year
- A parallelism VAE avoids OOM for high resolution image generation☆57Updated 2 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆111Updated 3 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on disk☆92Updated this week
- ☆70Updated 4 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year