zyushun / hessian-spectrum
Code for the paper: Why Transformers Need Adam: A Hessian Perspective
☆53Updated last week
Alternatives and similar repositories for hessian-spectrum:
Users that are interested in hessian-spectrum are comparing it to the libraries listed below
- SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)☆30Updated 4 months ago
- Welcome to the 'In Context Learning Theory' Reading Group☆28Updated 4 months ago
- ☆65Updated 3 months ago
- Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation (ICML'24 Oral)☆14Updated 8 months ago
- LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning☆28Updated 11 months ago
- Source code of "Task arithmetic in the tangent space: Improved editing of pre-trained models".☆97Updated last year
- [ICML 2024] Official code for the paper "Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark ".☆91Updated 8 months ago
- [ICLR 2023] Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation☆12Updated last year
- Preprint: Asymmetry in Low-Rank Adapters of Foundation Models☆35Updated last year
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆65Updated 7 months ago
- Code accompanying the paper "Massive Activations in Large Language Models"☆149Updated last year
- Neural Tangent Kernel Papers☆106Updated 2 months ago
- ☆51Updated 10 months ago
- ☆29Updated last year
- ☆101Updated last year
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆63Updated 5 months ago
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆79Updated last year
- Pytorch code for experiments on Linear Transformers☆20Updated last year
- 🔥 A minimal training framework for scaling FLA models☆82Updated this week
- [NeurIPS'24 Spotlight] Observational Scaling Laws☆53Updated 5 months ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆48Updated 2 years ago
- Stick-breaking attention☆48Updated last week
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆71Updated 4 months ago
- [NeurIPS 2023 Spotlight] Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training☆34Updated last year
- Source code of "What can linearized neural networks actually say about generalization?☆20Updated 3 years ago
- ☆24Updated last year
- Official repository for our paper, Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Mode…☆15Updated 4 months ago
- Official Jax Implementation of MD4 Masked Diffusion Models☆67Updated 3 weeks ago
- ☆91Updated last year