danielvegamyhre / ml-perf-reading-groupLinks
My annotated papers and meeting recordings for the EleutherAI ML Performance research paper reading group
☆18Updated last month
Alternatives and similar repositories for ml-perf-reading-group
Users that are interested in ml-perf-reading-group are comparing it to the libraries listed below
Sorting:
- ☆78Updated 11 months ago
- supporting pytorch FSDP for optimizers☆82Updated 6 months ago
- Load compute kernels from the Hub☆191Updated last week
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 10 months ago
- ☆159Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆185Updated 3 weeks ago
- Experiment of using Tangent to autodiff triton☆79Updated last year
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆189Updated last month
- Cataloging released Triton kernels.☆238Updated 5 months ago
- A library for unit scaling in PyTorch☆125Updated 7 months ago
- ☆194Updated 4 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆134Updated last year
- ☆109Updated last year
- Solve puzzles. Learn CUDA.☆64Updated last year
- ☆174Updated 5 months ago
- 🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …☆75Updated this week
- Applied AI experiments and examples for PyTorch☆277Updated 3 weeks ago
- ☆114Updated 3 weeks ago
- Learn CUDA with PyTorch☆27Updated this week
- Collection of kernels written in Triton language☆132Updated 2 months ago
- ring-attention experiments☆144Updated 8 months ago
- Prune transformer layers☆69Updated last year
- Focused on fast experimentation and simplicity☆75Updated 6 months ago
- A bunch of kernels that might make stuff slower 😉☆51Updated this week
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆77Updated 2 weeks ago
- Fast low-bit matmul kernels in Triton☆323Updated last week
- Fast, Modern, and Low Precision PyTorch Optimizers☆94Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆558Updated last week
- WIP☆93Updated 10 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 3 months ago