chengzeyi / piflux
(WIP) Parallel inference for black-forest-labs' FLUX model.
☆18Updated 4 months ago
Alternatives and similar repositories for piflux:
Users that are interested in piflux are comparing it to the libraries listed below
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆28Updated 3 months ago
- ☆64Updated 2 months ago
- [WIP] Better (FP8) attention for Hopper☆26Updated last month
- A parallelism VAE avoids OOM for high resolution image generation☆57Updated 2 months ago
- Writing FLUX in Triton☆32Updated 6 months ago
- PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu☆55Updated 3 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆43Updated 8 months ago
- A CUDA kernel for NHWC GroupNorm for PyTorch☆18Updated 4 months ago
- Context parallel attention that accelerates DiT model inference with dynamic caching☆228Updated this week
- ☆153Updated 2 months ago
- faster parallel inference of mochi-1 video generation model☆112Updated last month
- Implementation of SmoothCache, a project aimed at speeding-up Diffusion Transformer (DiT) based GenAI models with error-guided caching.☆40Updated this week
- Patch convolution to avoid large GPU memory usage of Conv2D☆84Updated 2 months ago
- Quantized Attention on GPU☆45Updated 4 months ago
- [ECCV24] MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization☆33Updated 3 months ago
- Official repository for VQDM:Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization paper☆33Updated 6 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆104Updated this week
- ☆130Updated this week
- ☆46Updated last year
- A light-weight and high-efficient training framework for accelerating diffusion tasks.☆46Updated 6 months ago
- Implementation of the proposed MaskBit from Bytedance AI☆75Updated 4 months ago
- [NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising☆192Updated last month
- ☆53Updated 2 years ago
- LoRA fine-tune directly on the quantized models.☆27Updated 4 months ago
- DeeperGEMM: crazy optimized version☆61Updated last week
- ☆49Updated last year
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆102Updated this week
- SealAI's stable diffusion implementation☆70Updated 3 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆110Updated 3 months ago
- [NeurIPS 2024] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching☆98Updated 8 months ago