tml-epfl / why-weight-decayLinks

Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]

☆69

Alternatives and similar repositories for why-weight-decay

Users that are interested in why-weight-decay are comparing it to the libraries listed below

Sorting:

thjashin / multires-conv
Sequence Modeling with Multiresolution Convolutional Memory (ICML 2023)
☆127Updated 2 years ago
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆85Updated last year
lucidrains / taylor-series-linear-attention
Explorations into the recently proposed Taylor Series Linear Attention
☆100Updated last year
stanislavfort / dissect-git-re-basin
Replicating and dissecting the git-re-basin project in one-click-replication Colabs
☆36Updated 3 years ago
lucidrains / infini-transformer-pytorch
Implementation of Infini-Transformer in Pytorch
☆113Updated 11 months ago
srush / mamba-primer
☆38Updated last year
sustcsonglin / mamba-triton
☆50Updated last year
srush / mamba-scans
Blog post
☆17Updated last year
LIONS-EPFL / scion
☆49Updated last month
amirzandieh / HyperAttention
Triton Implementation of HyperAttention Algorithm
☆48Updated last year
gregorbachmann / scaling_mlps
☆52Updated last year
shikaiqiu / compute-better-spent
☆62Updated last year
sustcsonglin / gated_linear_attention_layer
☆32Updated last year
smonsays / hypernetwork-attention
Official code for the paper "Attention as a Hypernetwork"
☆46Updated last year
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆81Updated 2 years ago
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆62Updated last year
lucidrains / pause-transformer
Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…
☆53Updated 2 years ago
berlino / seq_icl
☆53Updated last year
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆73Updated last year
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆47Updated last year
JeanKaddour / LAWA
Latest Weight Averaging (NeurIPS HITY 2022)
☆32Updated 2 years ago
lucidrains / autoregressive-linear-attention-cuda
CUDA implementation of autoregressive linear attention, with all the latest research findings
☆46Updated 2 years ago
lsj2408 / URPE
[NeurIPS 2022] Your Transformer May Not be as Powerful as You Expect (official implementation)
☆34Updated 2 years ago
OpenNLPLab / HGRN2
HGRN2: Gated Linear RNNs with State Expansion
☆55Updated last year
shawntan / stickbreaking-attention
Stick-breaking attention
☆61Updated 5 months ago
lucidrains / gateloop-transformer
Implementation of GateLoop Transformer in Pytorch and Jax
☆91Updated last year
OpenNLPLab / HGRN
[NeurIPS 2023 spotlight] Official implementation of HGRN in our NeurIPS 2023 paper - Hierarchically Gated Recurrent Neural Network for Se…
☆66Updated last year
lucidrains / gated-state-spaces-pytorch
Implementation of Gated State Spaces, from the paper "Long Range Language Modeling via Gated State Spaces", in Pytorch
☆101Updated 2 years ago
zyushun / hessian-spectrum
Code for the paper: Why Transformers Need Adam: A Hessian Perspective
☆63Updated 8 months ago
maximzubkov / fft-scan
Efficient PScan implementation in PyTorch
☆17Updated last year