zyushun / hessian-spectrum
Code for the paper: Why Transformers Need Adam: A Hessian Perspective
☆49Updated 9 months ago
Alternatives and similar repositories for hessian-spectrum:
Users that are interested in hessian-spectrum are comparing it to the libraries listed below
- ☆63Updated 2 months ago
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆62Updated 6 months ago
- Source code of "Task arithmetic in the tangent space: Improved editing of pre-trained models".☆94Updated last year
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆60Updated 4 months ago
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆79Updated last year
- [ICML 2024] Official code for the paper "Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark ".☆86Updated 7 months ago
- ☆51Updated 9 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆70Updated 3 months ago
- Welcome to the 'In Context Learning Theory' Reading Group☆28Updated 3 months ago
- Code accompanying the paper "Massive Activations in Large Language Models"☆140Updated 11 months ago
- SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)☆30Updated 3 months ago
- ☆80Updated 11 months ago
- nanoGPT-like codebase for LLM training☆89Updated this week
- ☆23Updated last year
- Stick-breaking attention☆43Updated last month
- Deep Learning & Information Bottleneck☆56Updated last year
- [ICLR 2023] Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation☆12Updated last year
- [ICML 2024] Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity; Lu Yin*, Ajay Jaiswal*, Shiwei Liu, So…☆16Updated 8 months ago
- ☆28Updated last year
- Towards Understanding Sharpness-Aware Minimization [ICML 2022]☆35Updated 2 years ago
- ☆77Updated last year
- [NeurIPS'24 Spotlight] Observational Scaling Laws☆50Updated 4 months ago
- ☆17Updated 8 months ago
- Visualization of mean field and neural tangent kernel regime☆21Updated 6 months ago
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆27Updated 10 months ago
- Preprint: Asymmetry in Low-Rank Adapters of Foundation Models☆34Updated 11 months ago
- ☆86Updated last year
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆48Updated last year
- ☆52Updated 4 months ago
- ☆34Updated last year