KellerJordan / top-sgdLinks
Optimization algorithm which fits a ResNet to CIFAR-10 5x faster than SGD / Adam (with terrible generalization)
☆14Updated 2 years ago
Alternatives and similar repositories for top-sgd
Users that are interested in top-sgd are comparing it to the libraries listed below
Sorting:
- Pytorch implementation of preconditioned stochastic gradient descent (Kron and affine preconditioner, low-rank approximation precondition…☆188Updated last month
- Replicating and dissecting the git-re-basin project in one-click-replication Colabs☆35Updated 3 years ago
- Open source code for EigenGame.☆33Updated 2 years ago
- ☆61Updated last year
- Parameter-Free Optimizers for Pytorch☆130Updated last year
- Sequence Modeling with Multiresolution Convolutional Memory (ICML 2023)☆127Updated 2 years ago
- Unofficial but Efficient Implementation of "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" in JAX☆89Updated last year
- Euclidean Wasserstein-2 optimal transportation☆47Updated 2 years ago
- nanoGPT-like codebase for LLM training☆110Updated 2 weeks ago
- ☆223Updated 11 months ago
- Lightning-like training API for JAX with Flax☆44Updated 11 months ago
- Experiment of using Tangent to autodiff triton☆80Updated last year
- ☆39Updated last year
- Implementation of Denoising Diffusion Probabilistic Models (DDPM) in JAX and Flax.☆20Updated 2 years ago
- Implementation of Gradient Agreement Filtering, from Chaubard et al. of Stanford, but for single machine microbatches, in Pytorch☆25Updated 10 months ago
- Deep Networks Grok All the Time and Here is Why☆37Updated last year
- DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule☆63Updated 2 years ago
- minGPT in JAX☆48Updated 3 years ago
- LoRA for arbitrary JAX models and functions☆143Updated last year
- A simple library for scaling up JAX programs☆144Updated 2 weeks ago
- ☆91Updated last year
- Implementation of GateLoop Transformer in Pytorch and Jax☆90Updated last year
- Official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks"☆59Updated 3 years ago
- ☆53Updated last year
- Maximal Update Parametrization (μP) with Flax & Optax.☆16Updated last year
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆68Updated last year
- Multi-framework implementation of Deep Kernel Shaping and Tailored Activation Transformations, which are methods that modify neural netwo…☆74Updated 4 months ago
- supporting pytorch FSDP for optimizers☆84Updated 11 months ago
- ☆72Updated 11 months ago
- Explorations into the recently proposed Taylor Series Linear Attention☆100Updated last year