cloneofsimo / ezmup
Simple implementation of muP, based on Spectral Condition for Feature Learning. The implementation is SGD only, dont use it for Adam
☆69Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for ezmup
- These papers will provide unique insightful concepts that will broaden your perspective on neural networks and deep learning☆46Updated last year
- WIP☆89Updated 3 months ago
- ☆50Updated 10 months ago
- ☆73Updated 4 months ago
- ☆31Updated 2 months ago
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆113Updated 7 months ago
- Efficient optimizers☆79Updated this week
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆112Updated 2 months ago
- ☆128Updated this week
- ☆77Updated 5 months ago
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆108Updated last month
- Language models scale reliably with over-training and on downstream tasks☆94Updated 7 months ago
- An implementation of PSGD Kron second-order optimizer for PyTorch☆16Updated this week
- Scalable neural net training via automatic normalization in the modular norm.☆121Updated 3 months ago
- ☆26Updated 6 months ago
- LoRA for arbitrary JAX models and functions☆132Updated 8 months ago
- Model Stock: All we need is just a few fine-tuned models☆92Updated last month
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆84Updated last week
- Implementation of Infini-Transformer in Pytorch☆104Updated last month
- ☆76Updated 7 months ago
- ☆21Updated 5 months ago
- Latent Diffusion Language Models☆67Updated last year
- A JAX implementation of the continuous time formulation of Consistency Models☆83Updated last year
- ☆53Updated 10 months ago
- Flexibly track outputs and grad-outputs of torch.nn.Module.☆13Updated last year
- ☆121Updated this week
- Normalized Transformer (nGPT)☆66Updated this week
- Implementation of the Llama architecture with RLHF + Q-learning☆157Updated 10 months ago
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆54Updated 3 months ago
- Pytorch/XLA SPMD Test code in Google TPU☆21Updated 7 months ago