Arongil / lipschitz-transformersLinks
Don't just regulate gradients like in Muon, regulate the weights too
☆31Updated 5 months ago
Alternatives and similar repositories for lipschitz-transformers
Users that are interested in lipschitz-transformers are comparing it to the libraries listed below
Sorting:
- Supporting code for the blog post on modular manifolds.☆111Updated 3 months ago
- Official code for the paper "Attention as a Hypernetwork"☆46Updated last year
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆132Updated last year
- WIP☆93Updated last year
- ☆122Updated 7 months ago
- [ICML 2025] Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction☆82Updated 7 months ago
- Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion mod…☆118Updated 2 months ago
- ☆33Updated last year
- ☆35Updated last year
- ☆62Updated last year
- ☆13Updated 10 months ago
- ☆53Updated last year
- Code for the paper "Function-Space Learning Rates"☆23Updated 7 months ago
- ☆27Updated 3 months ago
- 📄Small Batch Size Training for Language Models☆79Updated 3 months ago
- ☆34Updated last year
- Codes accompanying the paper "LaProp: a Better Way to Combine Momentum with Adaptive Gradient"☆29Updated 5 years ago
- supporting pytorch FSDP for optimizers☆84Updated last year
- ☆35Updated last year
- Gemstones: A Model Suite for Multi-Faceted Scaling Laws (NeurIPS 2025)☆30Updated 3 months ago
- Official PyTorch Implementation of the Longhorn Deep State Space Model☆57Updated last year
- Scalable and Stable Parallelization of Nonlinear RNNS☆28Updated 2 months ago
- ☆23Updated last year
- ☆43Updated 2 months ago
- Universal Neurons in GPT2 Language Models☆31Updated last year
- ☆56Updated last year
- ☆55Updated last year
- Flash Attention Triton kernel with support for second-order derivatives☆129Updated 3 weeks ago
- Focused on fast experimentation and simplicity☆79Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆87Updated last year