zyushun / hessian-spectrum
Code for the paper: Why Transformers Need Adam: A Hessian Perspective
☆56Updated last month
Alternatives and similar repositories for hessian-spectrum:
Users that are interested in hessian-spectrum are comparing it to the libraries listed below
- ☆67Updated 4 months ago
- Stick-breaking attention☆50Updated last month
- SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)☆30Updated 5 months ago
- ☆51Updated 10 months ago
- Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation (ICML'24 Oral)☆14Updated 8 months ago
- [ICML 2024] Official code for the paper "Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark ".☆97Updated 9 months ago
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆65Updated 6 months ago
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆90Updated last week
- ☆31Updated last year
- ☆67Updated last month
- ☆27Updated last month
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆80Updated last year
- [ICLR 2023] Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation☆12Updated last year
- ☆18Updated last year
- Code for testing DCT plus Sparse (DCTpS) networks☆14Updated 3 years ago
- Welcome to the 'In Context Learning Theory' Reading Group☆26Updated 5 months ago
- Implementation of CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation☆17Updated last month
- ☆31Updated last year
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆65Updated 8 months ago
- LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning☆30Updated last year
- Code accompanying the paper "Massive Activations in Large Language Models"☆154Updated last year
- [ICLR 2025] Code for the paper "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning"☆47Updated 2 months ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆49Updated 2 years ago
- ☆10Updated 3 months ago
- Preprint: Asymmetry in Low-Rank Adapters of Foundation Models☆35Updated last year
- source code for paper "Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models"☆24Updated 9 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆152Updated 3 weeks ago
- ☆25Updated last year
- ☆74Updated 3 weeks ago
- Sharpness-Aware Minimization Leads to Low-Rank Features [NeurIPS 2023]☆28Updated last year