mayank31398 / cute-kernels

A bunch of kernels that might make stuff slower 😉

☆40

Alternatives and similar repositories for cute-kernels

Users that are interested in cute-kernels are comparing it to the libraries listed below

Sorting:

pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆124Updated this week
cchan / tccl
extensible collectives library in triton
☆86Updated last month
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆122Updated last month
triton-lang / kernels
☆79Updated 6 months ago
pytorch-labs / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆132Updated this week
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆265Updated 2 weeks ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆109Updated 10 months ago
gpu-mode / ring-attention
ring-attention experiments
☆142Updated 7 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆204Updated 3 weeks ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆221Updated 4 months ago
simveit / effective_transpose
Effective transpose on Hopper GPU
☆18Updated 2 weeks ago
stanford-futuredata / stk
☆104Updated 8 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆73Updated 8 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆70Updated this week
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆85Updated this week
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆89Updated 2 weeks ago
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆87Updated 8 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆299Updated this week
INT-FlashAttention2024 / INT-FlashAttention
☆70Updated 3 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆69Updated last week
ColfaxResearch / cutlass-kernels
☆202Updated 10 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆214Updated 5 months ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated last year
yifuwang / symm-mem-recipes
☆76Updated 4 months ago
pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆224Updated 9 months ago
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆173Updated last week
deepspeedai / DeepSpeed-Kernels
☆69Updated last month
microsoft / AttentionEngine
☆70Updated last week
graphcore-research / out-of-the-box-fp8-training
Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.
☆45Updated 10 months ago
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆63Updated this week