neelnanda-io / GrokkingLinks

A Mechanistic Interpretability Analysis of Grokking

☆23

Alternatives and similar repositories for Grokking

Users that are interested in Grokking are comparing it to the libraries listed below

Sorting:

anthropics / toy-models-of-superposition
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆130Updated 3 years ago
google-deepmind / mishax
☆143Updated 2 months ago
neelnanda-io / Neuroscope
Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons
☆13Updated 2 years ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 6 months ago
google-deepmind / dangerous-capability-evaluations
☆62Updated last month
thestephencasper / everything-you-need
we got you bro
☆36Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
mcleish7 / arithmetic
Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (NeurIPS 2024)
☆193Updated last year
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆75Updated 2 years ago
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆122Updated last year
METR / RE-Bench
☆117Updated last month
callummcdougall / sae_visualizer
☆29Updated last year
neelnanda-io / 1L-Sparse-Autoencoder
☆132Updated 2 years ago
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆114Updated 4 months ago
amack315 / unsupervised-steering-vectors
☆36Updated last year
aadityasingh / icl-dynamics
☆24Updated 7 months ago
bilal-chughtai / rep-theory-mech-interp
☆27Updated 2 years ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆60Updated last year
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆227Updated 11 months ago
JasonGross / guarantees-based-mechanistic-interpretability
☆17Updated last week
goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆163Updated 7 months ago
epfml / llm-baselines
nanoGPT-like codebase for LLM training
☆109Updated 2 weeks ago
goodfire-ai / scribe
☆53Updated last month
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆129Updated 8 months ago
Silent-Zebra / twisted-smc-lm
☆31Updated 7 months ago
jbloomAus / SAEDashboard
☆79Updated last month
ApolloResearch / sample
Repository with sample code using Apollo's suggested engineering practices
☆13Updated 11 months ago
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆84Updated 11 months ago
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆225Updated last week
KindXiaoming / Omnigrok
Omnigrok: Grokking Beyond Algorithmic Data
☆62Updated 2 years ago