HazyResearch / aisys-building-blocks
Building blocks for foundation models.
☆385Updated 10 months ago
Related projects ⓘ
Alternatives and complementary repositories for aisys-building-blocks
- A bibliography and survey of the papers surrounding o1☆577Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆473Updated 2 weeks ago
- Puzzles for learning Triton☆1,060Updated last month
- Helpful tools and examples for working with flex-attention☆460Updated 2 weeks ago
- What would you do with 1000 H100s...☆892Updated 9 months ago
- Annotated version of the Mamba paper☆455Updated 8 months ago
- Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton☆1,320Updated this week
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆474Updated 2 weeks ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆190Updated 2 weeks ago
- ☆197Updated 3 months ago
- Legible, Scalable, Reproducible Foundation Models with Named Tensors and Jax☆516Updated this week
- ☆386Updated 3 weeks ago
- Cataloging released Triton kernels.☆132Updated 2 months ago
- Puzzles for exploring transformers☆321Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- ☆222Updated 3 months ago
- For optimization algorithm research and development.☆408Updated this week
- ☆133Updated 9 months ago
- GPU programming related news and material links☆1,205Updated last month
- An ML Systems Onboarding list☆539Updated 3 months ago
- ☆140Updated this week
- MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvement…☆330Updated last week
- Transformers with Arbitrarily Large Context☆637Updated 2 months ago
- NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day☆251Updated last year
- Flash Attention in ~100 lines of CUDA (forward pass only)☆609Updated 7 months ago
- TensorDict is a pytorch dedicated tensor container.☆832Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆95Updated last year
- Pipeline Parallelism for PyTorch☆725Updated 2 months ago
- Minimalistic large language model 3D-parallelism training☆1,220Updated this week
- Ring attention implementation with flash attention☆578Updated this week