jt-zhang / Sparse_SageAttention_APILinks

☆20

Alternatives and similar repositories for Sparse_SageAttention_API

Users that are interested in Sparse_SageAttention_API are comparing it to the libraries listed below

Sorting:

vipshop / cache-dit
🤗CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers🔥
☆61Updated this week
shawnricecake / draft-attention
Code for Draft Attention
☆72Updated last month
octoml / deformable-attention-kernel
TVMScript kernel for deformable attention
☆25Updated 3 years ago
DeepLink-org / CVFusion
CVFusion is an open-source deep learning compiler to fuse the OpenCV operators.
☆29Updated 2 years ago
MingXiangL / Teacache-xDiT
Combining Teacache with xDiT to Accelerate Visual Generation Models
☆25Updated 2 months ago
thu-nics / MixDQ
[ECCV24] MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization
☆42Updated 6 months ago
NoakLiu / FastCache-xDiT
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]
☆24Updated 3 weeks ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 7 months ago
yester31 / Cutlass_EX
study of cutlass
☆21Updated 7 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆80Updated last month
mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆88Updated 5 months ago
thu-nics / DiTFastAttn
☆167Updated 5 months ago
megvii-research / IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA
☆18Updated last year
pzhao-eng / FlashMLA
☆48Updated last month
xdit-project / DistVAE
A parallelism VAE avoids OOM for high resolution image generation
☆64Updated 5 months ago
BBuf / tensorrt-llm-moe
☆29Updated 4 months ago
OpenBMB / infllmv2_cuda_impl
☆21Updated 2 weeks ago
flashinfer-ai / cutlass-viz
☆60Updated 2 months ago
Roblox / SmoothCache
Implementation of SmoothCache, a project aimed at speeding-up Diffusion Transformer (DiT) based GenAI models with error-guided caching.
☆44Updated 3 months ago
li199603 / sgemm_with_cuda
SGEMM optimization with cuda step by step
☆19Updated last year
AlibabaPAI / FLASHNN
☆96Updated 9 months ago
Tele-AI / TeleTron
To pioneer training long-context multi-modal transformer models
☆40Updated last week
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆54Updated this week
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆71Updated 10 months ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆92Updated 3 weeks ago
sgl-project / DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆16Updated 2 weeks ago
caibucai22 / awesome-cuda
Awesome code, projects, books, etc. related to CUDA
☆17Updated last week
xdit-project / DiTCacheAnalysis
An auxiliary project analysis of the characteristics of KV in DiT Attention.
☆31Updated 6 months ago
CalebDu / Awesome-Cute
☆77Updated last month
INT-FlashAttention2024 / INT-FlashAttention
☆75Updated 5 months ago