DefTruth / Awesome-SD-Inference

📖A small curated list of Awesome SD/DiT/ViT/Diffusion Inference with Distributed/Caching/Sampling: DistriFusion, PipeFusion, AsyncDiff, DeepCache, Block Caching etc.

☆64

Related projects: ⓘ

mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆73Updated 3 months ago
xdit-project / DistVAE
A parallelism VAE avoids OOM for high resolution image generation
☆34Updated 2 months ago
xdit-project / xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
☆488Updated this week
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆173Updated 3 months ago
galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆153Updated last week
TencentARC / mllm-npu
mllm-npu: training multimodal large language models on Ascend NPUs
☆77Updated 3 weeks ago
mit-han-lab / lmquant
☆102Updated 3 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆161Updated 2 months ago
thu-nics / DiTFastAttn
☆74Updated last week
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆78Updated 4 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆159Updated 3 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆134Updated 2 months ago
chenhongyu2048 / LLM-inference-optimization-paper
Summary of some awesome work for optimizing LLM inference
☆26Updated this week
AlibabaPAI / FLASHNN
☆67Updated last week
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆205Updated 3 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆258Updated 2 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆71Updated 6 months ago
HanGuo97 / flute
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
☆156Updated this week
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆151Updated last year
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆172Updated 11 months ago
mit-han-lab / distrifuser
[CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
☆554Updated last month
ModelTC / llmc
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit…
☆227Updated this week
TorchMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆90Updated last month
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆106Updated 6 months ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆25Updated last month
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆46Updated last month
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆20Updated 2 weeks ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆118Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆50Updated 3 months ago
stanford-futuredata / stk
☆83Updated 3 weeks ago