DefTruth / Awesome-SD-Inference
๐A small curated list of Awesome SD/DiT/ViT/Diffusion Inference with Distributed/Caching/Sampling: DistriFusion, PipeFusion, AsyncDiff, DeepCache, Block Caching etc.
โ64Updated 2 weeks ago
Related projects: โ
- Patch convolution to avoid large GPU memory usage of Conv2Dโ73Updated 3 months ago
- A parallelism VAE avoids OOM for high resolution image generationโ34Updated 2 months ago
- xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clustersโ488Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).โ173Updated 3 months ago
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papโฆโ153Updated last week
- mllm-npu: training multimodal large language models on Ascend NPUsโ77Updated 3 weeks ago
- โ102Updated 3 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceโ161Updated 2 months ago
- โ74Updated last week
- An easy-to-use package for implementing SmoothQuant for LLMsโ78Updated 4 months ago
- flash attention tutorial written in python, triton, cuda, cutlassโ159Updated 3 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMโ134Updated 2 months ago
- Summary of some awesome work for optimizing LLM inferenceโ26Updated this week
- โ67Updated last week
- A collection of memory efficient attention operators implemented in the Triton language.โ205Updated 3 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingโ258Updated 2 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.โ71Updated 6 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMsโ156Updated this week
- โ151Updated last year
- This repository contains integer operators on GPUs for PyTorch.โ172Updated 11 months ago
- [CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Modelsโ554Updated last month
- This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkitโฆโ227Updated this week
- PyTorch library for cost-effective, fast and easy serving of MoE models.โ90Updated last month
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inferenceโ106Updated 6 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clustersโ25Updated last month
- ไฝฟ็จ cutlass ไปๅบๅจ ada ๆถๆไธๅฎ็ฐ fp8 ็ flash attentionโ46Updated last month
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.โ20Updated 2 weeks ago
- Tutorials for writing high-performance GPU operators in AI frameworks.โ118Updated last year
- Odysseus: Playground of LLM Sequence Parallelismโ50Updated 3 months ago
- โ83Updated 3 weeks ago