SzymonOzog / PennyLinks
Hand-Rolled GPU communications library
☆27Updated this week
Alternatives and similar repositories for Penny
Users that are interested in Penny are comparing it to the libraries listed below
Sorting:
- How to ensure correctness and ship LLM generated kernels in PyTorch☆60Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆57Updated this week
- High-Performance SGEMM on CUDA devices☆103Updated 8 months ago
- Learning about CUDA by writing PTX code.☆135Updated last year
- ☆42Updated last week
- train with kittens!☆62Updated 11 months ago
- A parallel framework for training deep neural networks☆63Updated 6 months ago
- Custom PTX Instruction Benchmark☆127Updated 7 months ago
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆362Updated this week
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆29Updated this week
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆121Updated 2 weeks ago
- Effective transpose on Hopper GPU☆23Updated 3 weeks ago
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer (WIP) for Triton Kernels☆150Updated last week
- A bunch of kernels that might make stuff slower 😉☆59Updated this week
- AMD RAD's experimental RMA library for Triton.☆74Updated last week
- extensible collectives library in triton☆88Updated 5 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 6 months ago
- ☆217Updated 8 months ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆318Updated last week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆96Updated last week
- SIMD quantization kernels☆87Updated 2 weeks ago
- NSA Triton Kernels written with GPT5 and Opus 4.1☆65Updated last month
- Samples of good AI generated CUDA kernels☆90Updated 3 months ago
- ☆35Updated this week
- LLM training in simple, raw C/CUDA☆104Updated last year
- Make triton easier☆47Updated last year
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆142Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆194Updated 3 months ago
- ring-attention experiments☆152Updated 11 months ago
- FlexAttention w/ FlashAttention3 Support☆27Updated 11 months ago