siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆126Updated last year
Alternatives and similar repositories for ShallowSpeed:
Users that are interested in ShallowSpeed are comparing it to the libraries listed below
- Cataloging released Triton kernels.☆204Updated 2 months ago
- ☆191Updated this week
- extensible collectives library in triton☆84Updated 6 months ago
- ring-attention experiments☆127Updated 5 months ago
- Applied AI experiments and examples for PyTorch☆249Updated this week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆234Updated this week
- Fastest kernels written from scratch☆199Updated 2 weeks ago
- Fast low-bit matmul kernels in Triton☆267Updated this week
- Collection of kernels written in Triton language☆114Updated last month
- ☆73Updated 4 months ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆174Updated last year
- Experiment of using Tangent to autodiff triton☆78Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆222Updated 7 months ago
- ☆151Updated last year
- seqax = sequence modeling + JAX☆150Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆34Updated this week
- ☆162Updated 9 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆34Updated 7 months ago
- ☆290Updated this week
- Solve puzzles. Learn CUDA.☆63Updated last year
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆524Updated last month
- JAX implementation of the Llama 2 model☆216Updated last year
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆189Updated this week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆232Updated 3 weeks ago
- ☆67Updated last year
- ☆101Updated 6 months ago
- ☆86Updated last year