tml-epfl / why-weight-decay
Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]
☆58Updated 3 months ago
Alternatives and similar repositories for why-weight-decay:
Users that are interested in why-weight-decay are comparing it to the libraries listed below
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆80Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆66Updated 2 months ago
- ☆50Updated 3 months ago
- Sequence Modeling with Multiresolution Convolutional Memory (ICML 2023)☆121Updated last year
- Code for the paper: Why Transformers Need Adam: A Hessian Perspective☆47Updated 8 months ago
- ☆51Updated 7 months ago
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆61Updated 5 months ago
- HGRN2: Gated Linear RNNs with State Expansion☆52Updated 4 months ago
- Stick-breaking attention☆41Updated this week
- Blog post☆16Updated 11 months ago
- Official code for the paper "Attention as a Hypernetwork"☆23Updated 6 months ago
- [NeurIPS 2022] Your Transformer May Not be as Powerful as You Expect (official implementation)☆34Updated last year
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…☆51Updated last year
- Explorations into the recently proposed Taylor Series Linear Attention☆91Updated 4 months ago
- ☆15Updated last year
- Latest Weight Averaging (NeurIPS HITY 2022)☆28Updated last year
- nanoGPT-like codebase for LLM training☆83Updated this week
- Code for the paper "Data Feedback Loops: Model-driven Amplification of Dataset Biases"☆15Updated 2 years ago
- ☆37Updated 9 months ago
- An implementation of PSGD Kron second-order optimizer for PyTorch☆21Updated 2 weeks ago
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆25Updated 9 months ago
- ☆23Updated 2 months ago
- PyTorch implementation for "Long Horizon Temperature Scaling", ICML 2023☆20Updated last year
- Implementation of GateLoop Transformer in Pytorch and Jax☆87Updated 6 months ago
- Implementation of the Kalman Filtering Attention proposed in "Kalman Filtering Attention for User Behavior Modeling in CTR Prediction"☆57Updated last year
- [ICML 2024] SINGD: KFAC-like Structured Inverse-Free Natural Gradient Descent (http://arxiv.org/abs/2312.05705)☆21Updated 2 months ago
- ☆32Updated last year
- Replicating and dissecting the git-re-basin project in one-click-replication Colabs☆36Updated 2 years ago
- ☆44Updated last year
- Triton Implementation of HyperAttention Algorithm☆46Updated last year