riverstone496 / awesome-second-order-optimizationLinks
β28Updated last month
Alternatives and similar repositories for awesome-second-order-optimization
Users that are interested in awesome-second-order-optimization are comparing it to the libraries listed below
Sorting:
- πSmall Batch Size Training for Language Modelsβ63Updated last month
- Supporting code for the blog post on modular manifolds.β102Updated last month
- supporting pytorch FSDP for optimizersβ84Updated 11 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β85Updated last year
- β61Updated last year
- Code and weights for the paper "Cluster and Predict Latents Patches for Improved Masked Image Modeling"β123Updated 7 months ago
- A basic pure pytorch implementation of flash attentionβ16Updated last year
- β119Updated 5 months ago
- β223Updated 11 months ago
- β91Updated last year
- The simplest, fastest repository for training/finetuning medium-sized GPTs.β173Updated 4 months ago
- Dion optimizer algorithmβ384Updated this week
- WIPβ93Updated last year
- Landing repository for the paper "Softpick: No Attention Sink, No Massive Activations with Rectified Softmax"β85Updated 2 months ago
- An implementation of PSGD Kron second-order optimizer for PyTorchβ97Updated 3 months ago
- β95Updated 8 months ago
- π§± Modula software packageβ303Updated 3 months ago
- β254Updated 5 months ago
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]β68Updated last year
- Normalized Transformer (nGPT)β192Updated last year
- Code for the paper "Function-Space Learning Rates"β23Updated 5 months ago
- CIFAR-10 speedruns: 94% in 2.6 seconds and 96% in 27 secondsβ326Updated last week
- A comprehensive JAX/NNX library for diffusion and flow matching generative algorithms, featuring DiT (Diffusion Transformer) and its variβ¦β116Updated last month
- Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion modβ¦β108Updated 3 weeks ago
- β47Updated last month
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT trainingβ132Updated last year
- Flash Attention Triton kernel with support for second-order derivativesβ112Updated last month
- A simple library for scaling up JAX programsβ144Updated 2 weeks ago
- β38Updated last year
- Official Jax Implementation of MD4 Masked Diffusion Modelsβ144Updated 8 months ago