fattorib / ZeRO-transformer
Two implementations of ZeRO-1 optimizer sharding in JAX
☆13Updated last year
Related projects ⓘ
Alternatives and complementary repositories for ZeRO-transformer
- seqax = sequence modeling + JAX☆133Updated 4 months ago
- ☆73Updated 4 months ago
- Experiment of using Tangent to autodiff triton☆72Updated 10 months ago
- ☆77Updated 5 months ago
- A simple library for scaling up JAX programs☆127Updated 3 weeks ago
- Minimal but scalable implementation of large language models in JAX☆26Updated 3 weeks ago
- Transformer with Mu-Parameterization, implemented in Jax/Flax. Supports FSDP on TPU pods.☆29Updated 2 weeks ago
- Accelerated First Order Parallel Associative Scan☆164Updated 3 months ago
- extensible collectives library in triton☆72Updated 2 months ago
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆35Updated 4 months ago
- Triton Implementation of HyperAttention Algorithm☆46Updated 11 months ago
- A set of Python scripts that makes your experience on TPU better☆40Updated 4 months ago
- Cataloging released Triton kernels.☆138Updated 2 months ago
- ring-attention experiments☆97Updated last month
- A library for unit scaling in PyTorch☆105Updated 2 weeks ago
- ☆177Updated last week
- ☆132Updated last year
- Machine Learning eXperiment Utilities☆45Updated 5 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆185Updated last month
- Simple and efficient pytorch-native transformer training and inference (batched)☆61Updated 7 months ago
- LoRA for arbitrary JAX models and functions☆133Updated 8 months ago
- ☆36Updated 10 months ago
- ☆50Updated 6 months ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆194Updated this week
- JAX bindings for Flash Attention v2☆80Updated 4 months ago
- An implementation of the Llama architecture, to instruct and delight☆21Updated 3 months ago
- ☆224Updated 4 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆107Updated last year
- CUDA implementation of autoregressive linear attention, with all the latest research findings☆43Updated last year