IBM / triton-dejavuLinks

Framework to reduce autotune overhead to zero for well known deployments.

☆88

Alternatives and similar repositories for triton-dejavu

Users that are interested in triton-dejavu are comparing it to the libraries listed below

Sorting:

microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆102Updated 5 months ago
flashinfer-ai / cutlass-viz
☆65Updated 7 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 7 months ago
triton-lang / kernels
☆94Updated last year
microsoft / AttentionEngine
☆113Updated 6 months ago
tile-ai / AttentionEngine
☆51Updated 6 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated last year
tile-ai / TileOPs
☆60Updated 2 weeks ago
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆172Updated this week
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 10 months ago
KuangjuX / AttnLink
An experimental communicating attention kernel based on DeepEP.
☆35Updated 4 months ago
cchan / tccl
extensible collectives library in triton
☆91Updated 8 months ago
flagos-ai / libtriton_jit
A Triton JIT runtime and ffi provider in C++
☆29Updated this week
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆52Updated last year
HanGuo97 / hilt
☆39Updated last week
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆123Updated last year
ByteDance-Seed / cudaLLM
☆125Updated 3 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆132Updated 6 months ago
flashinfer-ai / debug-print
Debug print operator for cudagraph debugging
☆14Updated last year
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆297Updated this week
LeiWang1999 / Stream-k.tvm
☆19Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆86Updated last year
ademeure / cuda-side-boost
☆52Updated 7 months ago
toyaix / triton-runner
Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.
☆76Updated last week
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆190Updated 10 months ago
thunlp / TritonBench
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
☆98Updated 5 months ago
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆82Updated this week
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆134Updated 6 months ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆102Updated 7 years ago