jettjaniak / chainscopeLinks
Repository for the "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful" paper
☆29Updated 3 months ago
Alternatives and similar repositories for chainscope
Users that are interested in chainscope are comparing it to the libraries listed below
Sorting:
- Steering vectors for transformer language models in Pytorch / Huggingface☆127Updated 8 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆211Updated 11 months ago
- Sparse Autoencoder for Mechanistic Interpretability☆283Updated last year
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆221Updated last week
- Steering Llama 2 with Contrastive Activation Addition☆191Updated last year
- ☆183Updated 11 months ago
- ⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.☆89Updated last week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆224Updated 10 months ago
- ☆252Updated last year
- Open source interpretability artefacts for R1.☆163Updated 6 months ago
- Open source replication of Anthropic's Crosscoders for Model Diffing☆59Updated last year
- ☆192Updated 3 weeks ago
- ☆55Updated 11 months ago
- Improving Alignment and Robustness with Circuit Breakers☆240Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆98Updated 2 years ago
- ☆80Updated 9 months ago
- Using sparse coding to find distributed representations used by neural networks.☆281Updated last year
- Governance of the Commons Simulation (GovSim)☆59Updated 9 months ago
- ☆131Updated 2 years ago
- ☆79Updated last month
- ☆79Updated 3 weeks ago
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆53Updated 11 months ago
- ☆111Updated 8 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆296Updated 4 months ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆120Updated last year
- Evaluating the Moral Beliefs Encoded in LLMs☆31Updated 10 months ago
- This repository collects all relevant resources about interpretability in LLMs☆377Updated last year
- Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs).☆163Updated this week
- Unified access to Large Language Model modules using NNsight☆55Updated this week
- Repository for the paper Stream of Search: Learning to Search in Language☆151Updated 9 months ago