Quentin-Anthony / nanoMPILinks

Simple MPI implementation for prototyping or learning

☆272

Alternatives and similar repositories for nanoMPI

Users that are interested in nanoMPI are comparing it to the libraries listed below

Sorting:

unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆133Updated last year
MekkCyber / TritonAcademy
A repository to unravel the language of GPUs, making their kernel conversations easy to understand
☆188Updated 2 months ago
jax-ml / scaling-book
Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUs
☆445Updated last week
EurekaLabsAI / tensor
The Tensor (or Array)
☆441Updated 11 months ago
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆567Updated this week
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆205Updated 3 months ago
pytorch-labs / monarch
PyTorch Single Controller
☆345Updated this week
rkinas / triton-resources
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
☆383Updated 4 months ago
clu0 / unet.cu
UNet diffusion model in pure CUDA
☆613Updated last year
hkproj / triton-flash-attention
☆184Updated 7 months ago
Maharshi-Pandya / cudacodes
Learnings and programs related to CUDA
☆415Updated last month
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆137Updated last year
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆69Updated 3 weeks ago
ulrichstern / cuda-convnet
Alex Krizhevsky's original code from Google Code
☆195Updated 9 years ago
huggingface / picotron_tutorial
☆208Updated 5 months ago
lucasdelimanogueira / PyNorch
Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
☆151Updated last year
gpu-mode / profiling-cuda-in-torch
☆162Updated last year
ScalingIntelligence / KernelBench
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
☆505Updated last week
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆98Updated 6 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆227Updated this week
gpu-mode / triton-index
Cataloging released Triton kernels.
☆247Updated 6 months ago
pytorch / torchft
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
☆377Updated this week
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 4 months ago
ash-01xor / bpe.c
Simple Byte pair Encoding mechanism used for tokenization process . written purely in C
☆136Updated 8 months ago
linjames0 / Transformer-CUDA
An implementation of the transformer architecture onto an Nvidia CUDA kernel
☆189Updated last year
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆339Updated this week
gpu-mode / ring-attention
ring-attention experiments
☆146Updated 9 months ago
rwitten / HighPerfLLMs2024
☆519Updated last year
bertmaher / simplegemm
☆110Updated 4 months ago
andrewkchan / yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
☆396Updated 2 months ago