zyushun / hessian-spectrum
Code for the paper: Why Transformers Need Adam: A Hessian Perspective
☆42Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for hessian-spectrum
- Stick-breaking attention☆34Updated last week
- Code accompanying the paper "Massive Activations in Large Language Models"☆123Updated 8 months ago
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆79Updated last year
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆52Updated last month
- SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)☆24Updated 2 weeks ago
- ☆50Updated 6 months ago
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆28Updated 7 months ago
- ☆59Updated 3 years ago
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆54Updated 3 months ago
- ☆45Updated 9 months ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆44Updated last year
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆109Updated 8 months ago
- ☆132Updated last year
- [ATTRIB @ NeurIPS 2024] When Attention Sink Emerges in Language Models: An Empirical View☆29Updated last month
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆24Updated 7 months ago
- ☆21Updated last year
- ☆74Updated 11 months ago
- ☆69Updated 8 months ago
- [ICLR 2023] Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation☆12Updated last year
- Sparse Backpropagation for Mixture-of-Expert Training☆24Updated 4 months ago
- ☆13Updated 6 months ago
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆78Updated 2 months ago
- [ICML 2024] Official code for the paper "Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark ".☆73Updated 4 months ago
- [NeurIPS'24 Spotlight] Observational Scaling Laws☆44Updated last month
- ☆98Updated 8 months ago
- Code for studying the super weight in LLM☆16Updated last week
- ☆24Updated 8 months ago
- ☆53Updated 3 weeks ago
- ☆20Updated 11 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆49Updated 3 weeks ago