riverstone496 / awesome-second-order-optimizationLinks
☆27Updated last year
Alternatives and similar repositories for awesome-second-order-optimization
Users that are interested in awesome-second-order-optimization are comparing it to the libraries listed below
Sorting:
- Implementations of attention with the softpick function, naive and FlashAttention-2☆80Updated 2 months ago
- Explorations into the recently proposed Taylor Series Linear Attention☆99Updated 10 months ago
- Code for the paper "Function-Space Learning Rates"☆20Updated last month
- $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources☆140Updated last month
- Code and weights for the paper "Cluster and Predict Latents Patches for Improved Masked Image Modeling"☆112Updated 3 months ago
- An implementation of PSGD Kron second-order optimizer for PyTorch☆92Updated 3 months ago
- ☆197Updated 7 months ago
- supporting pytorch FSDP for optimizers☆82Updated 7 months ago
- ☆43Updated last month
- ☆55Updated 7 months ago
- 🧱 Modula software package☆204Updated 3 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆78Updated last month
- ☆53Updated 9 months ago
- Deep Networks Grok All the Time and Here is Why☆37Updated last year
- A MAD laboratory to improve AI architecture designs 🧪☆123Updated 7 months ago
- The evaluation framework for training-free sparse attention in LLMs☆83Updated 3 weeks ago
- Minimal (truly) muP implementation, consistent with TP4 and TP5 papers notation☆14Updated last month
- ☆80Updated last year
- Self contained pytorch implementation of a sinkhorn based router, for mixture of experts or otherwise☆36Updated 10 months ago
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆66Updated 9 months ago
- WIP☆93Updated 11 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆147Updated 2 weeks ago
- ☆110Updated last month
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆75Updated 8 months ago
- Code accompanying the paper "Generalized Interpolating Discrete Diffusion"☆94Updated last month
- Experiment of using Tangent to autodiff triton☆79Updated last year
- A basic pure pytorch implementation of flash attention☆16Updated 8 months ago
- ☆32Updated 9 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆142Updated last month
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆116Updated last week