shauli-ravfogel / adv-kernel-removalLinks
☆12Updated 3 years ago
Alternatives and similar repositories for adv-kernel-removal
Users that are interested in adv-kernel-removal are comparing it to the libraries listed below
Sorting:
- ☆36Updated 3 years ago
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Updated 5 months ago
- ☆37Updated last year
- ☆16Updated last year
- ☆44Updated 2 years ago
- Data for "Datamodels: Predicting Predictions with Training Data"☆97Updated 2 years ago
- PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆42Updated last week
- ☆51Updated 2 years ago
- PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)☆20Updated last year
- ☆30Updated 2 years ago
- ☆32Updated last year
- What do we learn from inverting CLIP models?☆58Updated last year
- Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet☆32Updated 2 years ago
- ☆35Updated 2 years ago
- ☆20Updated 2 months ago
- This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…☆25Updated last year
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆57Updated 3 months ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆162Updated 7 months ago
- Official PyTorch Implementation for Meaning Representations from Trajectories in Autoregressive Models (ICLR 2024)☆22Updated last year
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆68Updated last year
- A modern look at the relationship between sharpness and generalization [ICML 2023]☆43Updated 2 years ago
- ☆32Updated 2 years ago
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆94Updated last year
- Sparse Autoencoder Training Library☆56Updated 9 months ago
- ☆17Updated 2 years ago
- ModelDiff: A Framework for Comparing Learning Algorithms☆58Updated 2 years ago
- Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"☆13Updated last year
- Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆12Updated last year
- Gemstones: A Model Suite for Multi-Faceted Scaling Laws (NeurIPS 2025)☆32Updated 4 months ago