aredden / torch-cublas-hgemmLinks

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

☆72

Alternatives and similar repositories for torch-cublas-hgemm

Users that are interested in torch-cublas-hgemm are comparing it to the libraries listed below

Sorting:

WaveSpeedAI / QuantumAttention
[WIP] Better (FP8) attention for Hopper
☆32Updated 5 months ago
sekstini / gpupoor
☆17Updated 8 months ago
chengzeyi / piflux
(WIP) Parallel inference for black-forest-labs' FLUX model.
☆19Updated 8 months ago
KONAKONA666 / q8_kernels
☆73Updated 7 months ago
sandyresearch / chipmunk
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …
☆78Updated last month
chengzeyi / ParaAttention
https://wavespeed.ai/ Context parallel attention that accelerates DiT model inference with dynamic caching
☆350Updated last month
huggingface / flux-fast
Making Flux go brrr on GPUs.
☆124Updated 3 weeks ago
timudk / flux_triton
Writing FLUX in Triton
☆38Updated 10 months ago
sayakpaul / diffusers-torchao
End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).
☆372Updated 2 months ago
aredden / flux-fp8-api
Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x fast…
☆273Updated 9 months ago
nunchaku-tech / deepcompressor
Model Compression Toolbox for Large Language Models and Diffusion Models
☆578Updated 4 months ago
ethansmith2000 / ImprovedTokenMerge
☆49Updated last year
huggingface / diffusion-fast
Faster generation with text-to-image diffusion models.
☆224Updated last month
yangluo7 / CAME
[ACL 2023] The official implementation of "CAME: Confidence-guided Adaptive Memory Optimization"
☆92Updated 4 months ago
jt-zhang / Sparse_SageAttention_API
☆43Updated 3 weeks ago
sekstini / basedxl
☆18Updated last year
Stability-AI / sd3-ref
☆130Updated last year
NovelAI / t5
Model code for inferencing T5
☆65Updated 4 months ago
ai-compiler-study / triton-kernels
Triton kernels for Flux
☆21Updated last month
xdit-project / mochi-xdit
faster parallel inference of mochi-1 video generation model
☆124Updated 5 months ago
snap-research / BitsFusion
☆126Updated 5 months ago
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆142Updated this week
Repeerc / flash-attention-v2-RDNA3-minimal
a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…
☆44Updated 11 months ago
SwayStar123 / microdiffusion
☆47Updated 5 months ago
mag- / gpu_benchmark
Gpu benchmark
☆65Updated 6 months ago
xdit-project / DistVAE
A parallelism VAE avoids OOM for high resolution image generation
☆70Updated this week
czg1225 / AsyncDiff
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
☆202Updated 5 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆55Updated last year
mit-han-lab / radial-attention
Radial Attention Official Implementation
☆470Updated this week
latentCall145 / channels-last-groupnorm
A CUDA kernel for NHWC GroupNorm for PyTorch
☆19Updated 8 months ago