xdit-project / DiTCacheAnalysis
An auxiliary project analysis of the characteristics of KV in DiT Attention.
☆25Updated 2 months ago
Alternatives and similar repositories for DiTCacheAnalysis:
Users that are interested in DiTCacheAnalysis are comparing it to the libraries listed below
- Quantized Attention on GPU☆34Updated 2 months ago
- A parallelism VAE avoids OOM for high resolution image generation☆53Updated 3 weeks ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆39Updated 6 months ago
- Patch convolution to avoid large GPU memory usage of Conv2D☆84Updated 3 weeks ago
- ☆144Updated last month
- ☆68Updated this week
- A WebUI for Side-by-Side Comparison of Media (Images/Videos) Across Multiple Folders☆19Updated 3 weeks ago
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆17Updated 3 months ago
- 16-fold memory access reduction with nearly no loss☆76Updated 3 months ago
- PyTorch code for Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers☆37Updated 5 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆64Updated 8 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆52Updated 2 weeks ago
- ☆61Updated 3 weeks ago
- Transformers components but in Triton☆31Updated 3 months ago
- [ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation☆55Updated last week
- Accelerating Diffusion Transformers with Token-wise Feature Caching☆62Updated 2 weeks ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- A sparse attention kernel supporting mix sparse patterns☆133Updated last week
- [NeurIPS 2024] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching☆92Updated 7 months ago
- ☆36Updated last month
- Triton implement of bi-directional (non-causal) linear attention☆41Updated 2 weeks ago
- Debug print operator for cudagraph debugging☆10Updated 6 months ago
- torch_quantizer is a out-of-box quantization tool for PyTorch models on CUDA backend, specially optimized for Diffusion Models.☆21Updated 10 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆38Updated 11 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆89Updated this week
- ☆19Updated 4 months ago
- A CUDA kernel for NHWC GroupNorm for PyTorch☆16Updated 3 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆29Updated 3 months ago
- Benchmark tests supporting the TiledCUDA library.☆15Updated 3 months ago