HumanCompatibleAI / leela-interp
Code for "Evidence of Learned Look-Ahead in a Chess-Playing Neural Network"
☆20Updated 10 months ago
Alternatives and similar repositories for leela-interp:
Users that are interested in leela-interp are comparing it to the libraries listed below
- Code for minimum-entropy coupling.☆31Updated 9 months ago
- This repo is built to facilitate the training and analysis of autoregressive transformers on maze-solving tasks.☆27Updated 7 months ago
- Sparse Autoencoder Training Library☆48Updated 5 months ago
- PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)☆18Updated 2 months ago
- ☆26Updated last year
- ☆22Updated 2 months ago
- Scaling scaling laws with board games.☆48Updated last year
- Sparse and discrete interpretability tool for neural networks☆62Updated last year
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆26Updated 10 months ago
- ☆35Updated last month
- Experiments with representation engineering☆11Updated last year
- gzip Predicts Data-dependent Scaling Laws☆34Updated 10 months ago
- Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆11Updated 6 months ago
- Minimal but scalable implementation of large language models in JAX☆34Updated 5 months ago
- Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons☆12Updated 2 years ago
- ☆30Updated 11 months ago
- ☆18Updated 11 months ago
- ☆16Updated 4 months ago
- ☆36Updated 5 months ago
- ☆26Updated last year
- ☆13Updated 9 months ago
- Redwood Research's transformer interpretability tools☆14Updated 3 years ago
- 🔬 Interpretability for Leela Chess Zero networks.☆12Updated 2 weeks ago
- Notebooks accompanying Anthropic's "Toy Models of Superposition" paper☆119Updated 2 years ago
- Learn online intrinsic rewards from LLM feedback☆35Updated 4 months ago
- Generative cellular automaton-like learning environments for RL.☆19Updated 2 months ago
- ☆51Updated 10 months ago
- Open source replication of Anthropic's Crosscoders for Model Diffing☆52Updated 5 months ago
- Applying SAEs for fine-grained control☆17Updated 4 months ago
- Mechanistic Interpretability for Transformer Models☆50Updated 2 years ago