ptillet / triton-llvm-releasesLinks

☆22

Alternatives and similar repositories for triton-llvm-releases

Users that are interested in triton-llvm-releases are comparing it to the libraries listed below

Sorting:

TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆16Updated 7 months ago
iree-org / iree-nvgpu
☆50Updated last year
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆26Updated 9 months ago
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆24Updated 3 weeks ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
NVIDIA / HMM_sample_code
CUDA 12.2 HMM demos
☆19Updated 11 months ago
YashasSamaga / ConvolutionBuildingBlocks
GEMM and Winograd based convolutions using CUTLASS
☆26Updated 5 years ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆79Updated last week
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆91Updated 2 weeks ago
pytorch / tlparse
TORCH_LOGS parser for PT2
☆46Updated this week
Aleph-Alpha-Research / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆22Updated last year
tile-ai / TileOPs
☆42Updated this week
hazan-lab / flash-stu
PyTorch implementation of the Flash Spectral Transform Unit.
☆17Updated 9 months ago
facebookresearch / fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …
☆65Updated 3 years ago
Harry-Chen / InfMoE
Inference framework for MoE layers based on TensorRT with Python binding
☆41Updated 4 years ago
deepspeedai / DeepSpeed-Kernels
☆74Updated 3 months ago
pytorch-labs / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆43Updated 4 months ago
K-Wu / pytorch-direct
Code for Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture (accepted by PVLDB).The outdated wr…
☆9Updated 2 years ago
open-lm-engine / cute-kernels
A bunch of kernels that might make stuff slower 😉
☆55Updated this week
chips-compilers-mlsys-21 / chips-compilers-mlsys-21.github.io
☆11Updated 4 years ago
manishucsd / py-codegen
☆16Updated 9 months ago
lernapparat / torchhacks
Hacks for PyTorch
☆19Updated 2 years ago
Ryu1845 / hyena-jax
Implementation of Hyena Hierarchy in JAX
☆10Updated 2 years ago
microsoft / AttentionEngine
☆75Updated last month
octoml / synr
A library for syntactically rewriting Python programs, pronounced (sinner).
☆69Updated 3 years ago
jansel / pytorch-jit-paritybench
☆40Updated 7 months ago
Dao-AILab / gemm-cublas
☆21Updated 2 months ago
BBuf / flash-rwkv
☆31Updated last year
emalach / LinearLM
Code for the paper: https://arxiv.org/pdf/2309.06979.pdf
☆19Updated 11 months ago
hpcgarage / cuASR
cuASR: CUDA Algebra for Semirings
☆36Updated 2 years ago