open-lm-engine / cute-kernelsLinks

A bunch of kernels that might make stuff slower 😉

☆54

Alternatives and similar repositories for cute-kernels

Users that are interested in cute-kernels are comparing it to the libraries listed below

Sorting:

cchan / tccl
extensible collectives library in triton
☆87Updated 3 months ago
gpu-mode / ring-attention
ring-attention experiments
☆144Updated 8 months ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆281Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆136Updated 3 months ago
triton-lang / kernels
☆83Updated 8 months ago
stanford-futuredata / stk
☆106Updated 10 months ago
pytorch-labs / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆187Updated this week
Deep-Learning-Profiling-Tools / triton-viz
☆225Updated this week
gpu-mode / triton-index
Cataloging released Triton kernels.
☆242Updated 6 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆330Updated this week
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 10 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆79Updated last week
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆281Updated last month
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆184Updated this week
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 2 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆112Updated last year
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆79Updated last year
pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆224Updated 11 months ago
andylolu2 / simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
☆37Updated 11 months ago
Jokeren / triton-samples
☆28Updated 5 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆90Updated 2 weeks ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆224Updated 7 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆101Updated last month
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆195Updated 2 months ago
huggingface / kernels
Load compute kernels from the Hub
☆203Updated this week
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆157Updated last year
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆93Updated 10 months ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆138Updated last month
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆135Updated this week
Dao-AILab / gemm-cublas
☆21Updated 2 months ago