danielvegamyhre / ml-perf-reading-groupLinks

My annotated papers and meeting recordings for the EleutherAI ML Performance research paper reading group

☆18

Alternatives and similar repositories for ml-perf-reading-group

Users that are interested in ml-perf-reading-group are comparing it to the libraries listed below

Sorting:

cloneofsimo / min-fsdp
☆78Updated 11 months ago
ethansmith2000 / fsdp_optimizers
supporting pytorch FSDP for optimizers
☆82Updated 6 months ago
huggingface / kernels
Load compute kernels from the Hub
☆191Updated last week
pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆224Updated 10 months ago
gpu-mode / profiling-cuda-in-torch
☆159Updated last year
MekkCyber / TritonAcademy
A repository to unravel the language of GPUs, making their kernel conversations easy to understand
☆185Updated 3 weeks ago
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆79Updated last year
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆189Updated last month
gpu-mode / triton-index
Cataloging released Triton kernels.
☆238Updated 5 months ago
graphcore-research / unit-scaling
A library for unit scaling in PyTorch
☆125Updated 7 months ago
huggingface / picotron_tutorial
☆194Updated 4 months ago
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆134Updated last year
mgmalek / efficient_cross_entropy
☆109Updated last year
dshah3 / GPU-Puzzles
Solve puzzles. Learn CUDA.
☆64Updated last year
hkproj / triton-flash-attention
☆174Updated 5 months ago
sandyresearch / chipmunk
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …
☆75Updated this week
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆277Updated 3 weeks ago
Dao-AILab / grouped-latent-attention
☆114Updated 3 weeks ago
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆27Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆132Updated 2 months ago
gpu-mode / ring-attention
ring-attention experiments
☆144Updated 8 months ago
melisa-writer / short-transformers
Prune transformer layers
☆69Updated last year
fal-ai / diffusion-speedrun
Focused on fast experimentation and simplicity
☆75Updated 6 months ago
open-lm-engine / cute-kernels
A bunch of kernels that might make stuff slower 😉
☆51Updated this week
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆77Updated 2 weeks ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆323Updated last week
warner-benjamin / optimi
Fast, Modern, and Low Precision PyTorch Optimizers
☆94Updated this week
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆558Updated last week
cloneofsimo / scaling-guide
WIP
☆93Updated 10 months ago
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 3 months ago