xdit-project / DiTCacheAnalysis
An auxiliary project analysis of the characteristics of KV in DiT Attention.
☆15Updated this week
Related projects ⓘ
Alternatives and complementary repositories for DiTCacheAnalysis
- Quantized Attention on GPU☆30Updated 2 weeks ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆32Updated 3 months ago
- ☆32Updated this week
- Patch convolution to avoid large GPU memory usage of Conv2D☆79Updated 5 months ago
- A parallelism VAE avoids OOM for high resolution image generation☆40Updated last month
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆11Updated this week
- Odysseus: Playground of LLM Sequence Parallelism☆57Updated 5 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆34Updated 8 months ago
- Beyond KV Caching: Shared Attention for Efficient LLMs☆13Updated 4 months ago
- Transformers components but in Triton☆27Updated this week
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- ☆100Updated 2 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆26Updated 5 months ago
- A light-weight and high-efficient training framework for accelerating diffusion tasks.☆41Updated 2 months ago
- GPTQ inference TVM kernel☆36Updated 6 months ago
- FlexAttention w/ FlashAttention3 Support☆27Updated last month
- TensorRT LLM Benchmark Configuration☆11Updated 3 months ago
- PyTorch code for Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers☆34Updated 2 months ago
- Debug print operator for cudagraph debugging☆10Updated 3 months ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆38Updated this week
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆38Updated last month
- ☆29Updated 5 months ago
- [NeurIPS 2024] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching☆75Updated 4 months ago
- ☆35Updated 2 weeks ago
- [WIP] Context parallel attention that works with torch.compile☆49Updated this week
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆49Updated last month
- Elixir: Train a Large Language Model on a Small GPU Cluster☆13Updated last year
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆41Updated 4 months ago
- A sparse attention kernel supporting mix sparse patterns☆58Updated last month