riverstone496 / awesome-second-order-optimizationLinks
β27Updated last year
Alternatives and similar repositories for awesome-second-order-optimization
Users that are interested in awesome-second-order-optimization are comparing it to the libraries listed below
Sorting:
- πSmall Batch Size Training for Language Modelsβ62Updated 3 weeks ago
- An implementation of PSGD Kron second-order optimizer for PyTorchβ96Updated last month
- β57Updated 11 months ago
- WIPβ94Updated last year
- β119Updated 3 months ago
- supporting pytorch FSDP for optimizersβ84Updated 9 months ago
- β88Updated last year
- Dion optimizer algorithmβ343Updated 2 weeks ago
- β210Updated 9 months ago
- Code and weights for the paper "Cluster and Predict Latents Patches for Improved Masked Image Modeling"β120Updated 5 months ago
- π§± Modula software packageβ237Updated last month
- Flow-matching algorithms in JAXβ104Updated last year
- The simplest, fastest repository for training/finetuning medium-sized GPTs.β162Updated 2 months ago
- CIFAR-10 speedruns: 94% in 2.6 seconds and 96% in 27 secondsβ301Updated 2 months ago
- Flash Attention Triton kernel with support for second-order derivativesβ86Updated this week
- Explorations into the recently proposed Taylor Series Linear Attentionβ100Updated last year
- β38Updated last year
- Implementation of Diffusion Transformer (DiT) in JAXβ291Updated last year
- β34Updated last year
- Landing repository for the paper "Softpick: No Attention Sink, No Massive Activations with Rectified Softmax"β84Updated last week
- [ICML 2025] Roll the dice & look before you leap: Going beyond the creative limits of next-token predictionβ68Updated 3 months ago
- Code for the paper "Function-Space Learning Rates"β23Updated 3 months ago
- A MAD laboratory to improve AI architecture designs π§ͺβ129Updated 9 months ago
- Experiment of using Tangent to autodiff tritonβ81Updated last year
- A basic pure pytorch implementation of flash attentionβ16Updated 10 months ago
- β243Updated 3 months ago
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT trainingβ132Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β83Updated 10 months ago
- β85Updated 6 months ago
- $100K or 100 Days: Trade-offs when Pre-Training with Academic Resourcesβ146Updated 4 months ago