SzymonOzog / PennyLinks

Hand-Rolled GPU communications library

☆72

Alternatives and similar repositories for Penny

Users that are interested in Penny are comparing it to the libraries listed below

Sorting:

IST-DASLab / llmq
Quantized LLM training in pure CUDA/C++.
☆220Updated this week
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆112Updated 10 months ago
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆61Updated this week
unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆148Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆160Updated last year
meta-pytorch / BackendBench
Ship correct and fast LLM kernels to PyTorch
☆124Updated 2 weeks ago
huggingface / kernel-builder
👷 Build compute kernels
☆190Updated this week
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆401Updated last week
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆164Updated this week
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆244Updated 6 months ago
open-lm-engine / accelerated-model-architectures
A bunch of kernels that might make stuff slower 😉
☆65Updated this week
NVIDIA / tilus
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
☆408Updated this week
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 8 months ago
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆151Updated 2 years ago
HazyResearch / HipKittens
Fast and Furious AMD Kernels
☆298Updated last week
deepseek-ai / LPLB
An early research stage MoE load balancer based on inear programming.
☆415Updated 2 weeks ago
PrimeIntellect-ai / pccl
PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP
☆138Updated 2 months ago
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
☆175Updated last week
MekkCyber / TritonAcademy
A repository to unravel the language of GPUs, making their kernel conversations easy to understand
☆196Updated 6 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆169Updated 7 months ago
cchan / tccl
extensible collectives library in triton
☆91Updated 8 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆256Updated last week
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆117Updated last week
pytorch / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆640Updated this week
apple / ml-recurrent-drafter
☆219Updated 10 months ago
axonn-ai / axonn
Parallel framework for training and fine-tuning deep neural networks
☆70Updated 3 weeks ago
HazyResearch / train-tk
train with kittens!
☆63Updated last year
NVIDIA / nvshmem
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆402Updated 2 weeks ago
linjames0 / Transformer-CUDA
An implementation of the transformer architecture onto an Nvidia CUDA kernel
☆195Updated 2 years ago
HazyResearch / Megakernels
kernels, of the mega variety
☆614Updated 2 months ago