pranjalssh / fast.cuLinks

Fastest kernels written from scratch

☆533

Alternatives and similar repositories for fast.cu

Users that are interested in fast.cu are comparing it to the libraries listed below

Sorting:

MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆251Updated 9 months ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆781Updated last week
wangzyon / NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
☆428Updated 3 years ago
ColfaxResearch / cutlass-kernels
☆259Updated last year
NVIDIA / TileGym
Helpful kernel tutorials and examples for tile-based GPU programming
☆630Updated this week
Deep-Learning-Profiling-Tools / triton-viz
☆286Updated last week
gpu-mode / triton-index
Cataloging released Triton kernels.
☆292Updated 5 months ago
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆427Updated last week
bertmaher / simplegemm
☆130Updated 3 months ago
HazyResearch / Megakernels
kernels, of the mega variety
☆672Updated 2 weeks ago
ColfaxResearch / cfx-article-src
☆175Updated 9 months ago
pytorch / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆739Updated last week
perplexityai / pplx-kernels
Perplexity GPU Kernels
☆560Updated 3 months ago
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆256Updated last year
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆315Updated 5 months ago
siboehm / SGEMM_CUDA
Fast CUDA matrix multiplication from scratch
☆1,046Updated 5 months ago
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆326Updated this week
yifuwang / symm-mem-recipes
☆159Updated last year
leimao / CUTLASS-Examples
CUTLASS and CuTe Examples
☆127Updated 2 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆404Updated last week
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆201Updated this week
facebookexperimental / triton
Github mirror of trition-lang/triton repo.
☆128Updated this week
ROCm / iris
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
☆168Updated this week
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆200Updated this week
NVIDIA / nvshmem
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆462Updated last month
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆484Updated 3 weeks ago
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆522Updated last year
Dao-AILab / sonic-moe
Accelerating MoE with IO and Tile-aware Optimizations
☆569Updated 3 weeks ago
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆174Updated 3 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆178Updated 2 weeks ago