zyushun / hessian-spectrum
Code for the paper: Why Transformers Need Adam: A Hessian Perspective
☆51Updated this week
Alternatives and similar repositories for hessian-spectrum:
Users that are interested in hessian-spectrum are comparing it to the libraries listed below
- ☆65Updated 3 months ago
- SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)☆30Updated 4 months ago
- [ICLR 2023] Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation☆12Updated last year
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆79Updated last year
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆63Updated 7 months ago
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆61Updated 5 months ago
- LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning☆28Updated 11 months ago
- Stick-breaking attention☆48Updated this week
- ☆24Updated last year
- [ICML 2024] Official code for the paper "Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark ".☆91Updated 8 months ago
- Preprint: Asymmetry in Low-Rank Adapters of Foundation Models☆35Updated last year
- [NeurIPS 2023 Spotlight] Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training☆32Updated last year
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆98Updated 6 months ago
- ☆25Updated 2 months ago
- Source code of "Task arithmetic in the tangent space: Improved editing of pre-trained models".☆97Updated last year
- Code accompanying the paper "Massive Activations in Large Language Models"☆148Updated last year
- ☆30Updated last year
- Welcome to the 'In Context Learning Theory' Reading Group☆28Updated 4 months ago
- ☆29Updated last year
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆27Updated 11 months ago
- [ICML2024 Spotlight] Fine-Tuning Pre-trained Large Language Models Sparsely☆21Updated 8 months ago
- Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation (ICML'24 Oral)☆14Updated 7 months ago
- 🔥 A minimal training framework for scaling FLA models☆75Updated this week
- ☆51Updated 9 months ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆48Updated 2 years ago
- ☆90Updated last year
- [ICLR 2025] Code for the paper "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning"☆38Updated last month
- [ICLR'24] "DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training" by Aochuan Chen*, Yimeng Zhang*, Jinghan Jia, James Di…☆52Updated 5 months ago
- [NeurIPS'24 Spotlight] Observational Scaling Laws☆53Updated 5 months ago