zyushun / hessian-spectrumLinks

Code for the paper: Why Transformers Need Adam: A Hessian Perspective

☆63

Alternatives and similar repositories for hessian-spectrum

Users that are interested in hessian-spectrum are comparing it to the libraries listed below

Sorting:

locuslab / edge-of-stability
☆73Updated last year
shawntan / stickbreaking-attention
Stick-breaking attention
☆61Updated 5 months ago
alvin-zyl / CoLA
Implementation of CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation
☆24Updated 9 months ago
locuslab / massive-activations
Code accompanying the paper "Massive Activations in Large Language Models"
☆187Updated last year
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆73Updated last year
zhixuan-lin / forgetting-transformer
[ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning
☆134Updated last month
sail-sg / Attention-Sink
[ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)
☆143Updated 4 months ago
berlino / seq_icl
☆53Updated last year
facebookresearch / iGSM
The code for creating the iGSM datasets in papers "Physics of Language Models Part 2.1, Grade-School Math and the Hidden Reasoning Proces…
☆80Updated 10 months ago
gregorbachmann / Next-Token-Failures
☆106Updated last year
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆81Updated 2 years ago
TianjinYellow / SPAM-Optimizer
☆36Updated 8 months ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆212Updated 5 months ago
dtsip / in-context-learning
☆242Updated last year
facebookresearch / PhysicsLM4
Physics of Language Models, Part 4
☆262Updated 4 months ago
Infini-AI-Lab / Kinetics
Kinetics: Rethinking Test-Time Scaling Laws
☆84Updated 4 months ago
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆85Updated last year
osehmathias / lisa
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
☆36Updated last year
bansky-cl / diffusion-nlp-paper-arxiv
Auto get diffusion nlp papers in Axriv. More papers Information can be found in another repository "Diffusion-LM-Papers".
☆233Updated this week
andyjm3 / SLTrain
SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)
☆38Updated last year
LIONS-EPFL / scion
☆48Updated last month
HKUNLP / diffusion-vs-ar
[ICLR 2025] Code for the paper "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning"
☆85Updated 9 months ago
ZO-Bench / ZO-LLM
[ICML‘24] Official code for the paper "Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark ".
☆119Updated 5 months ago
epfml / dynamic-sparse-flash-attention
☆150Updated 2 years ago
sustcsonglin / linear-attention-and-beyond-slides
☆99Updated 9 months ago
GuoTianYu2000 / Active-Dormant-Attention
codes and plots for "Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs"
☆10Updated 11 months ago
tml-epfl / why-weight-decay
Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]
☆69Updated last year
PKU-ML / LongPPL
Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"
☆105Updated last month
epfml / llm-baselines
nanoGPT-like codebase for LLM training
☆112Updated last month
HazyResearch / fly
☆221Updated 2 years ago