manishucsd / py-codegenLinks

☆16

Alternatives and similar repositories for py-codegen

Users that are interested in py-codegen are comparing it to the libraries listed below

Sorting:

microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆93Updated last month
ademeure / cuda-side-boost
☆41Updated 3 months ago
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆74Updated last week
pytorch-labs / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆200Updated this week
iree-org / iree-nvgpu
☆50Updated last year
cchan / tccl
extensible collectives library in triton
☆88Updated 4 months ago
openxla / shardy
MLIR-based partitioning system
☆115Updated this week
makslevental / nelli
A lightweight, Pythonic, frontend for MLIR
☆80Updated last year
Jokeren / triton-samples
☆28Updated 6 months ago
parasj / checkmate
Training neural networks in TensorFlow 2.0 with 5x less memory
☆132Updated 3 years ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆113Updated last year
google-research / sputnik
A library of GPU kernels for sparse matrix operations.
☆270Updated 4 years ago
triton-lang / kernels
☆85Updated 9 months ago
YulhwaKim / cutlass_tilesparse
CUDA templates for tile-sparse matrix multiplication based on CUTLASS.
☆51Updated 7 years ago
pytorch-labs / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆43Updated 4 months ago
hpcgarage / cuASR
cuASR: CUDA Algebra for Semirings
☆36Updated 2 years ago
YashasSamaga / ConvolutionBuildingBlocks
GEMM and Winograd based convolutions using CUTLASS
☆26Updated 5 years ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆199Updated this week
facebookresearch / fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …
☆65Updated 3 years ago
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆137Updated 2 years ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆99Updated last year
NVIDIA / jaxpp
JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training
☆52Updated 3 weeks ago
roastduck / FreeTensor
A language and compiler for irregular tensor programs.
☆149Updated 8 months ago
pytorch-labs / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels
☆138Updated this week
mmperf / mmperf
MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.
☆134Updated last year
spcl / daceml
A Data-Centric Compiler for Machine Learning
☆84Updated last year
enp1s0 / ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
☆77Updated 4 months ago
jansel / pytorch-jit-paritybench
☆40Updated 7 months ago
spcl / sten
Sparsity support for PyTorch
☆35Updated 4 months ago
Jokeren / GPA
GPU Performance Advisor
☆65Updated 3 years ago