Arongil / lipschitz-transformersLinks

Don't just regulate gradients like in Muon, regulate the weights too

☆27

Alternatives and similar repositories for lipschitz-transformers

Users that are interested in lipschitz-transformers are comparing it to the libraries listed below

Sorting:

martin-marek / batch-size
📄Small Batch Size Training for Language Models
☆62Updated last week
fal-ai-community / minDDPD
☆33Updated 9 months ago
ChenWu98 / algorithmic-creativity
[ICML 2025] Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
☆71Updated 4 months ago
shikaiqiu / compute-better-spent
☆58Updated last year
dvruette / barrel-rec-pytorch
☆53Updated last year
fal-ai / diffusion-speedrun
Focused on fast experimentation and simplicity
☆75Updated 9 months ago
cloneofsimo / scaling-guide
WIP
☆93Updated last year
thinking-machines-lab / manifolds
Supporting code for the blog post on modular manifolds.
☆71Updated 2 weeks ago
SHI-Labs / CompactNet
☆32Updated last year
p-doom / jasmine
A simple, performant and scalable JAX-based world modeling codebase
☆76Updated last week
smonsays / hypernetwork-attention
Official code for the paper "Attention as a Hypernetwork"
☆43Updated last year
Cranial-XIX / longhorn
Official PyTorch Implementation of the Longhorn Deep State Space Model
☆55Updated 10 months ago
corl-team / lime
Official implementation of the paper "You Do Not Fully Utilize Transformer's Representation Capacity"
☆31Updated 4 months ago
cloneofsimo / zeroshampoo
☆34Updated last year
fal-ai-community / NativeSparseAttention
research impl of Native Sparse Attention (2502.11089)
☆61Updated 7 months ago
wmn-231314 / diffusion-data-constraint
Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion mod…
☆101Updated last month
lucidrains / GAF-microbatch-pytorch
Implementation of Gradient Agreement Filtering, from Chaubard et al. of Stanford, but for single machine microbatches, in Pytorch
☆25Updated 8 months ago
cloneofsimo / min-max-gpt
Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training
☆132Updated last year
ethansmith2000 / fsdp_optimizers
supporting pytorch FSDP for optimizers
☆83Updated 10 months ago
edwardmilsom / function-space-learning-rates-paper
Code for the paper "Function-Space Learning Rates"
☆23Updated 4 months ago
google-deepmind / spectral_ssm
☆34Updated last year
zhixuan-lin / forgetting-transformer
[ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning
☆131Updated 2 weeks ago
radarFudan / mamba-minimal-jax
☆33Updated 10 months ago
lucidrains / taylor-series-linear-attention
Explorations into the recently proposed Taylor Series Linear Attention
☆100Updated last year
gregorbachmann / scaling_mlps
☆52Updated last year
test-time-training / ttt-tk
☆41Updated 6 months ago
cloneofsimo / efae
☆23Updated last year
amorehead / jvp_flash_attention
Flash Attention Triton kernel with support for second-order derivatives
☆98Updated this week
SamsungSAILMontreal / nino
Code for "Accelerating Training with Neuron Interaction and Nowcasting Networks" [to appear at ICLR 2025]
☆20Updated last week
dayal-kalra / low-memory-adam
☆13Updated 7 months ago