andylolu2 / simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
☆34Updated 9 months ago
Alternatives and similar repositories for simpleGEMM:
Users that are interested in simpleGEMM are comparing it to the libraries listed below
- ☆202Updated 9 months ago
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 9 months ago
- Collection of kernels written in Triton language☆120Updated last month
- Cataloging released Triton kernels.☆220Updated 3 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆183Updated 11 months ago
- A bunch of kernels that might make stuff slower 😉☆40Updated this week
- Fast low-bit matmul kernels in Triton☆295Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆120Updated this week
- ☆157Updated last year
- ☆202Updated last week
- extensible collectives library in triton☆85Updated last month
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆106Updated 9 months ago
- Fastest kernels written from scratch☆252Updated last month
- ☆78Updated 5 months ago
- Applied AI experiments and examples for PyTorch☆262Updated last week
- ☆70Updated 4 months ago
- Accelerated First Order Parallel Associative Scan☆182Updated 8 months ago
- ☆104Updated 8 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆87Updated last week
- ring-attention experiments☆132Updated 6 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆84Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆169Updated last month
- Triton-based implementation of Sparse Mixture of Experts.☆212Updated 5 months ago
- ☆165Updated 10 months ago
- ☆102Updated last month
- Benchmark code for the "Online normalizer calculation for softmax" paper☆91Updated 6 years ago
- CUDA Matrix Multiplication Optimization☆184Updated 9 months ago
- A library for unit scaling in PyTorch☆125Updated 5 months ago
- High-Performance SGEMM on CUDA devices☆90Updated 3 months ago
- DeeperGEMM: crazy optimized version☆68Updated this week