anthropics / attribution-graphs-frontendLinks
https://transformer-circuits.pub/2025/attribution-graphs/methods.html
☆75Updated 3 months ago
Alternatives and similar repositories for attribution-graphs-frontend
Users that are interested in attribution-graphs-frontend are comparing it to the libraries listed below
Sorting:
- A toolkit for describing model features and intervening on those features to steer behavior.☆193Updated 8 months ago
- Open source interpretability artefacts for R1.☆154Updated 2 months ago
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆193Updated this week
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆224Updated this week
- Public repository for "The Surprising Effectiveness of Test-Time Training for Abstract Reasoning"☆318Updated 8 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆243Updated last month
- ☆146Updated 8 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆115Updated 4 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆136Updated this week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆206Updated 7 months ago
- open source interpretability platform 🧠☆293Updated this week
- Code for the paper: "Learning to Reason without External Rewards"☆325Updated last week
- A simple unified framework for evaluating LLMs☆225Updated 3 months ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆210Updated last week
- ☆182Updated 3 months ago
- ☆502Updated last year
- Repository for the paper Stream of Search: Learning to Search in Language☆149Updated 5 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆189Updated last year
- ☆182Updated 3 months ago
- Improving Alignment and Robustness with Circuit Breakers☆220Updated 9 months ago
- Sparsify transformers with SAEs and transcoders☆584Updated last week
- Function Vectors in Large Language Models (ICLR 2024)☆172Updated 3 months ago
- ☆134Updated 3 months ago
- A simplified implementation for experimenting with RLVR on GSM8K, This repository provides a starting point for exploring reasoning.☆113Updated 5 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆149Updated 5 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆112Updated last year
- LLM-Merging: Building LLMs Efficiently through Merging☆201Updated 9 months ago
- 🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.☆408Updated this week
- ☆69Updated last month
- Reproducible, flexible LLM evaluations☆222Updated last week