facebookexperimental / tritonLinks

Github mirror of trition-lang/triton repo.

☆48

Alternatives and similar repositories for triton

Users that are interested in triton are comparing it to the libraries listed below

Sorting:

ColfaxResearch / cfx-article-src
☆127Updated 2 months ago
yifuwang / symm-mem-recipes
☆102Updated 7 months ago
ColfaxResearch / cutlass-kernels
☆227Updated last year
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆154Updated last month
microsoft / SparTA
☆150Updated last year
parasailteam / coconet
☆80Updated 2 years ago
nox-410 / tvm.tl
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆50Updated last year
CalebDu / Awesome-Cute
☆89Updated 2 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆369Updated 10 months ago
osayamenja / Kleos
Complete GPU residency for ML.
☆37Updated last week
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 6 months ago
microsoft / triton-shared
Shared Middle-Layer for Triton Compilation
☆260Updated this week
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆114Updated this week
ParCIS / Chimera
Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.
☆67Updated 4 months ago
zhuohan123 / terapipe
☆75Updated 4 years ago
zhaiyi000 / tlm
☆42Updated last year
ParCIS / Magicube
Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.
☆89Updated 2 years ago
AlibabaPAI / FLASHNN
☆96Updated 10 months ago
triton-lang / kernels
☆85Updated 8 months ago
sunlex0717 / DissectingTensorCores
☆106Updated last year
thu-pacman / PET
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
☆122Updated 3 years ago
gty111 / GEMM_MMA
Optimize GEMM with tensorcore step by step
☆31Updated last year
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆79Updated 8 months ago
apuaaChen / vectorSparse
☆32Updated 2 years ago
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆56Updated last week
apuaaChen / EVT_AE
Artifacts of EVT ASPLOS'24
☆26Updated last year
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆214Updated last month
ConnollyLeon / awesome-Auto-Parallelism
A baseline repository of Auto-Parallelism in Training Neural Networks
☆144Updated 3 years ago
reed-lau / cute-gemm
☆128Updated 7 months ago
UofT-EcoSystem / DietCode
DietCode Code Release
☆64Updated 3 years ago