riverstone496 / awesome-second-order-optimizationLinks

☆28

Alternatives and similar repositories for awesome-second-order-optimization

Users that are interested in awesome-second-order-optimization are comparing it to the libraries listed below

Sorting:

martin-marek / batch-size
📄Small Batch Size Training for Language Models
☆63Updated last month
thinking-machines-lab / manifolds
Supporting code for the blog post on modular manifolds.
☆102Updated last month
ethansmith2000 / fsdp_optimizers
supporting pytorch FSDP for optimizers
☆84Updated 11 months ago
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆85Updated last year
shikaiqiu / compute-better-spent
☆61Updated last year
facebookresearch / capi
Code and weights for the paper "Cluster and Predict Latents Patches for Improved Masked Image Modeling"
☆123Updated 7 months ago
CLAIRE-Labo / flash_attention
A basic pure pytorch implementation of flash attention
☆16Updated last year
kvfrans / splus
☆119Updated 5 months ago
nikhilvyas / SOAP
☆223Updated 11 months ago
cloneofsimo / min-fsdp
☆91Updated last year
EleutherAI / nanoGPT-mup
The simplest, fastest repository for training/finetuning medium-sized GPTs.
☆173Updated 4 months ago
microsoft / dion
Dion optimizer algorithm
☆384Updated this week
cloneofsimo / scaling-guide
WIP
☆93Updated last year
zaydzuhri / softpick-attention
Landing repository for the paper "Softpick: No Attention Sink, No Massive Activations with Rectified Softmax"
☆85Updated 2 months ago
evanatyourservice / kron_torch
An implementation of PSGD Kron second-order optimizer for PyTorch
☆97Updated 3 months ago
sustcsonglin / linear-attention-and-beyond-slides
☆95Updated 8 months ago
modula-systems / modula
🧱 Modula software package
☆303Updated 3 months ago
HanGuo97 / log-linear-attention
☆254Updated 5 months ago
tml-epfl / why-weight-decay
Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]
☆68Updated last year
NVIDIA / ngpt
Normalized Transformer (nGPT)
☆192Updated last year
edwardmilsom / function-space-learning-rates-paper
Code for the paper "Function-Space Learning Rates"
☆23Updated 5 months ago
KellerJordan / cifar10-airbench
CIFAR-10 speedruns: 94% in 2.6 seconds and 96% in 27 seconds
☆326Updated last week
willisma / diffuse_nnx
A comprehensive JAX/NNX library for diffusion and flow matching generative algorithms, featuring DiT (Diffusion Transformer) and its vari…
☆116Updated last month
wmn-231314 / diffusion-data-constraint
Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion mod…
☆108Updated 3 weeks ago
LIONS-EPFL / scion
☆47Updated last month
cloneofsimo / min-max-gpt
Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training
☆132Updated last year
amorehead / jvp_flash_attention
Flash Attention Triton kernel with support for second-order derivatives
☆112Updated last month
young-geng / scalax
A simple library for scaling up JAX programs
☆144Updated 2 weeks ago
srush / mamba-primer
☆38Updated last year
google-deepmind / md4
Official Jax Implementation of MD4 Masked Diffusion Models
☆144Updated 8 months ago