vedantpalit / Towards-Vision-Language-Mechanistic-InterpretabilityView external linksLinks
This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper accepted at the ICCV CLVL Workshop 2023
☆25Apr 18, 2024Updated last year
Alternatives and similar repositories for Towards-Vision-Language-Mechanistic-Interpretability
Users that are interested in Towards-Vision-Language-Mechanistic-Interpretability are comparing it to the libraries listed below
Sorting:
- Applying SAEs for fine-grained control☆25Dec 15, 2024Updated last year
- ☆23Apr 23, 2024Updated last year
- ☆13Apr 10, 2025Updated 10 months ago
- This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆30Oct 27, 2025Updated 3 months ago
- Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons☆13Feb 13, 2023Updated 3 years ago
- ☆16Jun 19, 2023Updated 2 years ago
- A tiny easily hackable implementation of a feature dashboard.☆15Oct 21, 2025Updated 3 months ago
- ☆13Feb 24, 2025Updated 11 months ago
- Memory-Based Meta-Learning on Non-Stationary Distributions☆17Mar 14, 2024Updated last year
- ☆17Jun 8, 2019Updated 6 years ago
- ☆16Mar 5, 2024Updated last year
- Code for reproducing our paper "Low Rank Adapting Models for Sparse Autoencoder Features"☆17Mar 31, 2025Updated 10 months ago
- ☆17Feb 14, 2024Updated 2 years ago
- ☆51Oct 23, 2023Updated 2 years ago
- ☆17Updated this week
- [ICML 24] A novel automated neuron explanation framework that can accurately describe poly-semantic concepts in deep neural networks☆14May 2, 2025Updated 9 months ago
- Repository for PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits, accepted at CVPR 2024 XAI4CV Works…☆20May 29, 2024Updated last year
- PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)☆20Jan 19, 2025Updated last year
- Quantum Fast Approximate Synthesis Tool☆19Jan 23, 2023Updated 3 years ago
- Extensive Self-Contrast Enables Feedback-Free Language Model Alignment☆21Apr 2, 2024Updated last year
- ☆23Jun 13, 2024Updated last year
- A collection of different ways to implement accessing and modifying internal model activations for LLMs☆21Oct 18, 2024Updated last year
- A quantum circuit optimizer based on sum-over-paths representations☆26Nov 8, 2019Updated 6 years ago
- Data for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder"☆20Oct 26, 2023Updated 2 years ago
- Official code for "Can We Talk Models Into Seeing the World Differently?" (ICLR 2025).☆27Jan 26, 2025Updated last year
- Code for "Counterfactual Token Generation in Large Language Models", Arxiv 2024.☆32Nov 7, 2024Updated last year
- ☆24Jan 28, 2025Updated last year
- (NeurIPS '22) LISA: Learning Interpretable Skill Abstractions - A framework for unsupervised skill learning using Imitation☆29Feb 22, 2023Updated 2 years ago
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆27Nov 20, 2024Updated last year
- ☆27Oct 6, 2024Updated last year
- ☆23Apr 17, 2022Updated 3 years ago
- A Mechanistic Interpretability Analysis of Grokking☆27Sep 26, 2022Updated 3 years ago
- ☆27Oct 22, 2024Updated last year
- ☆267Oct 1, 2024Updated last year
- Algebraic value editing in pretrained language models☆68Nov 1, 2023Updated 2 years ago
- ☆66Feb 16, 2023Updated 2 years ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆127Mar 9, 2024Updated last year
- ☆29Apr 30, 2024Updated last year
- Optim4RL is a Jax framework of learning to optimize for reinforcement learning.☆28Nov 27, 2024Updated last year