SzymonOzog / FastSoftmaxLinks

Step by step implementation of a fast softmax kernel in CUDA

☆55

Alternatives and similar repositories for FastSoftmax

Users that are interested in FastSoftmax are comparing it to the libraries listed below

Sorting:

gpu-mode / triton-index
Cataloging released Triton kernels.
☆272Updated 2 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆250Updated last week
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆398Updated last week
bertmaher / simplegemm
☆126Updated last month
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆160Updated 2 weeks ago
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆244Updated 6 months ago
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆294Updated this week
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆239Updated last year
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆112Updated 10 months ago
pranjalssh / fast.cu
Fastest kernels written from scratch
☆400Updated 2 months ago
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆151Updated 2 years ago
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆117Updated this week
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆307Updated 3 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆169Updated 7 months ago
IST-DASLab / llmq
Quantized LLM training in pure CUDA/C++.
☆218Updated this week
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆140Updated 2 weeks ago
HazyResearch / Megakernels
kernels, of the mega variety
☆614Updated 2 months ago
triton-lang / kernels
☆94Updated last year
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
☆171Updated last week
unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆147Updated last year
aryagxr / cuda
coding CUDA everyday!
☆71Updated last week
NVIDIA / compute-eval
Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…
☆75Updated last week
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆122Updated last year
SzymonOzog / GPU_Programming
☆85Updated 2 weeks ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆675Updated last week
gpu-mode / profiling-cuda-in-torch
☆177Updated last year
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆126Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆160Updated last year
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆164Updated 2 weeks ago
ColfaxResearch / cutlass-kernels
☆246Updated last year