Karbo123 / pytorch_grouped_gemmLinks

High Performance Grouped GEMM in PyTorch

☆31

Alternatives and similar repositories for pytorch_grouped_gemm

Users that are interested in pytorch_grouped_gemm are comparing it to the libraries listed below

Sorting:

ColfaxResearch / cutlass-kernels
☆241Updated last year
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Updated last month
AlibabaPAI / FLASHNN
☆100Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆125Updated 4 months ago
thunlp / Seq1F1B
Sequence-level 1F1B schedule for LLMs.
☆32Updated 2 months ago
yifuwang / symm-mem-recipes
☆141Updated 10 months ago
nox-410 / tvm.tl
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆51Updated last year
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆186Updated 9 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆181Updated 2 weeks ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆102Updated 7 years ago
Victarry / PP-Schedule-Visualization
Pipeline Parallelism Emulation and Visualization
☆69Updated 4 months ago
CalebDu / Awesome-Cute
☆107Updated 5 months ago
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆139Updated last month
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆117Updated last month
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆77Updated last year
flashinfer-ai / cubloaty
a size profiler for cuda binary
☆51Updated 3 weeks ago
zhuohan123 / terapipe
☆75Updated 4 years ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆112Updated last year
triton-lang / kernels
☆93Updated 11 months ago
leimao / CUTLASS-Examples
CUTLASS and CuTe Examples
☆93Updated last week
parasailteam / coconet
☆83Updated 2 years ago
ParCIS / Chimera
Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.
☆67Updated 7 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆124Updated 5 months ago
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆80Updated 11 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆41Updated 8 months ago
ColfaxResearch / cfx-article-src
☆151Updated 5 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆266Updated 3 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 10 months ago
facebookexperimental / triton
Github mirror of trition-lang/triton repo.
☆92Updated this week
thu-pacman / PET
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
☆122Updated 3 years ago