sandeepkumar-skb / pytorch_custom_opLinks

End to End steps for adding custom ops in PyTorch.

☆23

Alternatives and similar repositories for pytorch_custom_op

Users that are interested in pytorch_custom_op are comparing it to the libraries listed below

Sorting:

sunlex0717 / DissectingTensorCores
☆108Updated last year
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆98Updated 3 months ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆106Updated last year
yalue / cuda_scheduling_examiner_mirror
A tool for examining GPU scheduling behavior.
☆88Updated last year
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆186Updated 8 months ago
parasailteam / coconet
☆83Updated 2 years ago
HPMLL / NVIDIA-Hopper-Benchmark
☆57Updated 4 months ago
nox-410 / tvm.tl
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆51Updated last year
osayamenja / FlashMoE
Distributed MoE in a Single Kernel [NeurIPS '25]
☆85Updated 2 weeks ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆120Updated 5 months ago
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆138Updated last month
yifuwang / symm-mem-recipes
☆124Updated 9 months ago
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆143Updated 5 years ago
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆138Updated 2 years ago
facebookexperimental / triton
Github mirror of trition-lang/triton repo.
☆84Updated last week
triton-lang / kernels
☆92Updated 11 months ago
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆153Updated this week
lixiuhong / batched_gemm
☆39Updated 5 years ago
toyaix / triton-runner
Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.
☆72Updated last week
lenLRX / AmpereSparseMatmul
study of Ampere' Sparse Matmul
☆18Updated 4 years ago
ColfaxResearch / cfx-article-src
☆148Updated 5 months ago
eniac / paella
Paella: Low-latency Model Serving with Virtualized GPU Scheduling
☆62Updated last year
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆72Updated 5 months ago
ParCIS / Magicube
Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.
☆89Updated 2 years ago
apuaaChen / EVT_AE
Artifacts of EVT ASPLOS'24
☆26Updated last year
UDC-GAC / venom
A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
☆53Updated last year
apache / tvm-ffi
TVM FFI
☆67Updated last week
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated last month
UofT-EcoSystem / hotline
☆32Updated 2 years ago
ademeure / cuda-side-boost
☆45Updated 5 months ago