UDC-GAC / venomLinks

A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores

☆52

Alternatives and similar repositories for venom

Users that are interested in venom are comparing it to the libraries listed below

Sorting:

microsoft / SparTA
☆150Updated last year
ParCIS / Magicube
Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.
☆89Updated 2 years ago
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆137Updated 2 years ago
humuyan / Korch
ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch
☆38Updated 4 months ago
sunlex0717 / DissectingTensorCores
☆106Updated last year
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆91Updated 2 years ago
sjfeng1999 / gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
☆103Updated 3 years ago
HPMLL / DTC-SpMM_ASPLOS24
☆33Updated last year
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆154Updated last month
UofT-EcoSystem / DietCode
DietCode Code Release
☆64Updated 3 years ago
apuaaChen / EVT_AE
Artifacts of EVT ASPLOS'24
☆26Updated last year
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆50Updated 4 months ago
pku-liang / AMOS
Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators
☆114Updated 2 years ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 6 months ago
nicolaswilde / cuda-tensorcore-hgemm
☆149Updated 7 months ago
ColfaxResearch / cfx-article-src
☆127Updated 2 months ago
dgSPARSE / dgSPARSE-Lib
PyTorch-Based Fast and Efficient Processing for Various Machine Learning Applications with Diverse Sparsity
☆114Updated 2 weeks ago
hgyhungry / ShflBW_Sparse_NN
☆16Updated 2 years ago
CalebDu / Awesome-Cute
☆89Updated 2 months ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆214Updated last month
yifuwang / symm-mem-recipes
☆101Updated 7 months ago
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆56Updated last week
apuaaChen / vectorSparse
☆32Updated 2 years ago
abhibambhaniya / GenZ-LLM-Analyzer
LLM Inference analyzer for different hardware platforms
☆80Updated 3 weeks ago
parasailteam / coconet
☆80Updated 2 years ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆369Updated 10 months ago
SJTU-ReArch-Group / Paper-Reading-List
☆117Updated last week
lenLRX / AmpereSparseMatmul
study of Ampere' Sparse Matmul
☆18Updated 4 years ago
microsoft / microxcaling
PyTorch emulation library for Microscaling (MX)-compatible data formats
☆262Updated last month
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆138Updated 4 years ago