anthropics / attribution-graphs-frontendLinks
https://transformer-circuits.pub/2025/attribution-graphs/methods.html
☆90Updated 7 months ago
Alternatives and similar repositories for attribution-graphs-frontend
Users that are interested in attribution-graphs-frontend are comparing it to the libraries listed below
Sorting:
- Open source interpretability artefacts for R1.☆163Updated 6 months ago
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆225Updated last week
- ☆543Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆299Updated 5 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆214Updated last year
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models☆284Updated 3 months ago
- ☆188Updated last year
- Sparsify transformers with SAEs and transcoders☆654Updated last week
- Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.☆163Updated this week
- Public repository for "The Surprising Effectiveness of Test-Time Training for Abstract Reasoning"☆336Updated last week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆227Updated 11 months ago
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆233Updated 4 months ago
- 🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.☆568Updated 2 weeks ago
- Reproducible, flexible LLM evaluations☆264Updated 3 weeks ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆202Updated last year
- ☆56Updated last year
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆311Updated 3 weeks ago
- ☆194Updated last month
- ☆197Updated 7 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆129Updated 8 months ago
- Steering Llama 2 with Contrastive Activation Addition☆193Updated last year
- Code for the paper: "Learning to Reason without External Rewards"☆373Updated 4 months ago
- open source interpretability platform 🧠☆486Updated this week
- A simple unified framework for evaluating LLMs☆254Updated 7 months ago
- ☆364Updated 2 months ago
- Open source replication of Anthropic's Crosscoders for Model Diffing☆60Updated last year
- Using sparse coding to find distributed representations used by neural networks.☆283Updated 2 years ago
- ☆212Updated 11 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆121Updated last year
- Sparse Autoencoder for Mechanistic Interpretability☆284Updated last year