zyushun / hessian-spectrumLinks
Code for the paper: Why Transformers Need Adam: A Hessian Perspective
☆62Updated 5 months ago
Alternatives and similar repositories for hessian-spectrum
Users that are interested in hessian-spectrum are comparing it to the libraries listed below
Sorting:
- Stick-breaking attention☆59Updated last month
- Physics of Language Models, Part 4☆232Updated 3 weeks ago
- Code accompanying the paper "Massive Activations in Large Language Models"☆176Updated last year
- The code for creating the iGSM datasets in papers "Physics of Language Models Part 2.1, Grade-School Math and the Hidden Reasoning Proces…☆74Updated 7 months ago
- ☆70Updated 8 months ago
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning☆124Updated last week
- [ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)☆116Updated last month
- SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)☆32Updated 9 months ago
- ☆80Updated 6 months ago
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"☆95Updated last month
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆70Updated last year
- Code for "Reasoning to Learn from Latent Thoughts"☆116Updated 4 months ago
- [ICLR 2025] Code for the paper "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning"☆70Updated 6 months ago
- ☆91Updated last year
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆81Updated last year
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆67Updated 11 months ago
- 📄Small Batch Size Training for Language Models☆43Updated this week
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆199Updated 5 months ago
- ☆238Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆81Updated 9 months ago
- Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden States☆72Updated last year
- 🔥 A minimal training framework for scaling FLA models☆233Updated last week
- Implementation of CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation☆23Updated 6 months ago
- ☆53Updated last year
- LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning☆35Updated last year
- [ICML 2024] Official code for the paper "Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark ".☆110Updated last month
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆160Updated 2 months ago
- Kinetics: Rethinking Test-Time Scaling Laws☆76Updated last month
- ☆33Updated last year
- ☆148Updated 2 years ago