riverstone496 / awesome-second-order-optimizationLinks
β27Updated last year
Alternatives and similar repositories for awesome-second-order-optimization
Users that are interested in awesome-second-order-optimization are comparing it to the libraries listed below
Sorting:
- πSmall Batch Size Training for Language Modelsβ43Updated last week
- supporting pytorch FSDP for optimizersβ84Updated 8 months ago
- An implementation of PSGD Kron second-order optimizer for PyTorchβ96Updated last month
- β207Updated 8 months ago
- β87Updated last year
- Implementations of attention with the softpick function, naive and FlashAttention-2β83Updated 4 months ago
- WIPβ94Updated last year
- π§± Modula software packageβ231Updated last week
- β80Updated 6 months ago
- CIFAR-10 speedruns: 94% in 2.6 seconds and 96% in 27 secondsβ284Updated last month
- β240Updated 2 months ago
- Code and weights for the paper "Cluster and Predict Latents Patches for Improved Masked Image Modeling"β116Updated 4 months ago
- β56Updated 10 months ago
- A MAD laboratory to improve AI architecture designs π§ͺβ127Updated 8 months ago
- Explorations into the recently proposed Taylor Series Linear Attentionβ100Updated last year
- Dion optimizer algorithmβ318Updated last week
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT trainingβ131Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β82Updated 10 months ago
- β53Updated last year
- Deep Networks Grok All the Time and Here is Whyβ37Updated last year
- β115Updated 2 months ago
- Normalized Transformer (nGPT)β187Updated 9 months ago
- Stick-breaking attentionβ59Updated 2 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ204Updated 5 months ago
- β52Updated last year
- β34Updated last year
- The evaluation framework for training-free sparse attention in LLMsβ91Updated 2 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β85Updated last month
- Code for the paper "Function-Space Learning Rates"β23Updated 2 months ago
- Implementation of Diffusion Transformer (DiT) in JAXβ292Updated last year