andylolu2 / simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
☆34Updated 6 months ago
Alternatives and similar repositories for simpleGEMM:
Users that are interested in simpleGEMM are comparing it to the libraries listed below
- Cataloging released Triton kernels.☆164Updated last month
- extensible collectives library in triton☆82Updated 4 months ago
- Fastest kernels written from scratch☆139Updated 2 months ago
- ☆175Updated this week
- Fast low-bit matmul kernels in Triton☆231Updated this week
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 6 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆97Updated 7 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆86Updated this week
- ☆67Updated 3 months ago
- ☆180Updated 7 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆141Updated 8 months ago
- ☆42Updated last month
- ☆99Updated 5 months ago
- ☆157Updated last year
- Accelerated First Order Parallel Associative Scan☆171Updated 5 months ago
- Applied AI experiments and examples for PyTorch☆223Updated this week
- ring-attention experiments☆123Updated 3 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆228Updated this week
- Collection of kernels written in Triton language☆97Updated this week
- Triton-based implementation of Sparse Mixture of Experts.☆196Updated 2 months ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆166Updated this week
- ☆159Updated 7 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 5 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆66Updated 6 years ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆75Updated 2 months ago
- Experiment of using Tangent to autodiff triton☆75Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆64Updated 3 months ago
- CUDA Matrix Multiplication Optimization☆159Updated 6 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆61Updated 2 weeks ago