goodfire-ai / r1-interpretabilityLinks

Open source interpretability artefacts for R1.

☆163

Alternatives and similar repositories for r1-interpretability

Users that are interested in r1-interpretability are comparing it to the libraries listed below

Sorting:

TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆216Updated last year
google-deepmind / mishax
☆144Updated 2 months ago
LeonGuertler / TextArena
A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning
☆319Updated last month
METR / RE-Bench
☆119Updated last month
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆228Updated this week
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆62Updated last year
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆234Updated 4 months ago
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆189Updated 8 months ago
PrimeIntellect-ai / prime-environments
Training-Ready RL Environments + Evals
☆182Updated this week
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆130Updated 9 months ago
kanishkg / stream-of-search
Repository for the paper Stream of Search: Learning to Search in Language
☆151Updated 10 months ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆231Updated 11 months ago
LeonGuertler / UnstableBaselines
☆106Updated last month
emergent-misalignment / emergent-misalignment
☆229Updated this week
casper-hansen / OpenCoconut
OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.
☆173Updated 10 months ago
da03 / Internalize_CoT_Step_by_Step
☆199Updated 7 months ago
ekinakyurek / marc
Public repository for "The Surprising Effectiveness of Test-Time Training for Abstract Reasoning"
☆340Updated 3 weeks ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆150Updated 5 months ago
google-deepmind / latent-multi-hop-reasoning
[ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?
☆84Updated 8 months ago
mcleish7 / arithmetic
Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (NeurIPS 2024)
☆194Updated last year
SalesforceAIResearch / LaTRO
☆124Updated 9 months ago
interp-reasoning / thought-anchors
⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.
☆92Updated last month
neelnanda-io / Crosscoders
☆58Updated last year
tilde-research / sieve
Applying SAEs for fine-grained control
☆24Updated 11 months ago
tokenbender / avataRL
rl from zero pretrain, can it be done? yes.
☆281Updated 2 months ago
jacobdunefsky / transcoder_circuits
☆189Updated last year
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆84Updated last year
PrimeIntellect-ai / genesys
☆136Updated 8 months ago
JackCai1206 / arithmetic-self-improve
☆37Updated 9 months ago
hijohnnylin / neuronpedia
open source interpretability platform 🧠
☆509Updated last week