☆28May 4, 2023Updated 2 years ago
Alternatives and similar repositories for rep-theory-mech-interp
Users that are interested in rep-theory-mech-interp are comparing it to the libraries listed below
Sorting:
- DiWA: Diverse Weight Averaging for Out-of-Distribution Generalization☆31Jan 31, 2023Updated 3 years ago
- ☆12Feb 11, 2026Updated 3 weeks ago
- Code for "Automatic Circuit Finding and Faithfulness"☆17Jul 11, 2024Updated last year
- Notebooks accompanying Anthropic's "Toy Models of Superposition" paper☆137Sep 14, 2022Updated 3 years ago
- Pytorch Implementation of MuZero for gym environment. It support any Discrete , Box and Box2D configuration for the action space and obse…☆19Jan 24, 2023Updated 3 years ago
- Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"☆19Dec 14, 2024Updated last year
- gpt completions in vscode☆35Mar 24, 2023Updated 2 years ago
- ☆26Oct 9, 2024Updated last year
- A collection of different ways to implement accessing and modifying internal model activations for LLMs☆20Oct 18, 2024Updated last year
- Ember is a hosted API/SDK that lets you shape AI model behavior by directly controlling a model's internal units of computation, or "feat…☆43Jul 14, 2025Updated 7 months ago
- This repository contains the implementation of Label-Free XAI, a new framework to adapt explanation methods to unsupervised models. For m…☆25Sep 21, 2022Updated 3 years ago
- ☆30Jul 17, 2023Updated 2 years ago
- Official repo for the paper "Bilinear MLPs enable weight-based mechanistic interpretability".☆28Aug 2, 2025Updated 7 months ago
- ☆26Oct 6, 2024Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆102Sep 21, 2023Updated 2 years ago
- This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…☆25Feb 16, 2026Updated 3 weeks ago
- ☆29Oct 18, 2022Updated 3 years ago
- Sparse probing paper full code.☆67Dec 17, 2023Updated 2 years ago
- ☆209Oct 14, 2025Updated 4 months ago
- Code associated to papers on superposition (in ML interpretability)☆37Sep 13, 2022Updated 3 years ago
- Visual Transformer Mechanistic Analysis Tool☆36Jun 3, 2023Updated 2 years ago
- Mechanistic Interpretability Visualizations using React☆328Dec 18, 2024Updated last year
- Auditing agents for fine-tuning safety☆20Oct 21, 2025Updated 4 months ago
- A toolkit that provides a range of model diffing techniques including a UI to visualize them interactively.☆63Feb 26, 2026Updated last week
- Official repository of "LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging"☆32Nov 4, 2024Updated last year
- Synthetic Hypertext and Homomorphic Catalogue☆15Dec 28, 2024Updated last year
- ☆34Feb 29, 2024Updated 2 years ago
- ☆32Mar 27, 2025Updated 11 months ago
- Situational Awareness Dataset☆46Dec 14, 2024Updated last year
- Open Source Replication of Anthropic's Alignment Faking Paper☆54Apr 4, 2025Updated 11 months ago
- ☆38Feb 8, 2024Updated 2 years ago
- ☆11Jun 20, 2023Updated 2 years ago
- Primus-SaFE(Stability and Fault Endurance)☆52Updated this week
- A Data-Driven Approach to Predict the Success of Bank Telemarketing☆10Apr 27, 2021Updated 4 years ago
- ☆36Jul 14, 2022Updated 3 years ago
- [CVPR 2020] A generative model with latent factors that are independent and localized.☆12Mar 27, 2025Updated 11 months ago
- ☆31May 1, 2025Updated 10 months ago
- ☆34Aug 30, 2021Updated 4 years ago
- Preprint: Asymmetry in Low-Rank Adapters of Foundation Models☆38Feb 27, 2024Updated 2 years ago