siboehm / ShallowSpeedLinks
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆134Updated last year
Alternatives and similar repositories for ShallowSpeed
Users that are interested in ShallowSpeed are comparing it to the libraries listed below
Sorting:
- ☆225Updated this week
- ☆160Updated last year
- Cataloging released Triton kernels.☆242Updated 6 months ago
- extensible collectives library in triton☆87Updated 3 months ago
- Fast low-bit matmul kernels in Triton☆327Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆195Updated 2 months ago
- ring-attention experiments☆144Updated 8 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆188Updated last month
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆187Updated last year
- Applied AI experiments and examples for PyTorch☆281Updated last month
- Collection of kernels written in Triton language☆136Updated 3 months ago
- A bunch of kernels that might make stuff slower 😉☆54Updated this week
- Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUs☆424Updated this week
- A Quirky Assortment of CuTe Kernels☆126Updated last week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆46Updated 2 weeks ago
- Custom kernels in Triton language for accelerating LLMs☆23Updated last year
- seqax = sequence modeling + JAX☆163Updated 3 weeks ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆468Updated this week
- Learn CUDA with PyTorch☆29Updated this week
- High-Performance SGEMM on CUDA devices☆97Updated 5 months ago
- Experiment of using Tangent to autodiff triton☆79Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 11 months ago
- PyTorch Single Controller☆296Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆561Updated 3 weeks ago
- ☆320Updated 2 weeks ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 3 months ago
- ☆511Updated last year
- Learning about CUDA by writing PTX code.☆133Updated last year
- ☆88Updated last year
- ☆28Updated 5 months ago