SzymonOzog / FastSoftmax
☆28Updated 2 months ago
Alternatives and similar repositories for FastSoftmax:
Users that are interested in FastSoftmax are comparing it to the libraries listed below
- ☆42Updated 2 weeks ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆127Updated last year
- High-Performance SGEMM on CUDA devices☆86Updated 2 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆153Updated this week
- Collection of kernels written in Triton language☆114Updated last month
- Cataloging released Triton kernels.☆208Updated 2 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆234Updated last month
- Fast low-bit matmul kernels in Triton☆267Updated this week
- extensible collectives library in triton☆84Updated 6 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆34Updated this week
- The simplest but fast implementation of matrix multiplication in CUDA.☆34Updated 7 months ago
- ☆191Updated this week
- ☆73Updated 4 months ago
- ☆151Updated last year
- ring-attention experiments☆128Updated 5 months ago
- This repository contains the experimental PyTorch native float8 training UX☆222Updated 7 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆167Updated this week
- Experiment of using Tangent to autodiff triton☆78Updated last year
- Fastest kernels written from scratch☆199Updated 2 weeks ago
- Applied AI experiments and examples for PyTorch☆249Updated this week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆237Updated this week
- Learning about CUDA by writing PTX code.☆124Updated last year
- Mixed precision training from scratch with Tensors and CUDA☆21Updated 10 months ago
- ☆101Updated 6 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆71Updated 6 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆104Updated this week
- Experimental GPU language with meta-programming☆21Updated 6 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆152Updated 10 months ago
- ☆86Updated last year
- A bunch of kernels that might make stuff slower 😉☆28Updated this week