thestephencasper / benchmarking_interpretabilityLinks
☆34Updated last year
Alternatives and similar repositories for benchmarking_interpretability
Users that are interested in benchmarking_interpretability are comparing it to the libraries listed below
Sorting:
- Privacy backdoors☆51Updated last year
- ☆22Updated 11 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆113Updated last year
- ☆54Updated 2 years ago
- ☆40Updated 8 months ago
- ModelDiff: A Framework for Comparing Learning Algorithms☆57Updated last year
- ☆71Updated 2 years ago
- ☆29Updated 2 years ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated last year
- Sparse Autoencoder Training Library☆52Updated last month
- Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet☆32Updated last year
- Code for the paper "The Journey, Not the Destination: How Data Guides Diffusion Models"☆22Updated last year
- Repository for PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits, accepted at CVPR 2024 XAI4CV Works…☆15Updated last year
- What do we learn from inverting CLIP models?☆55Updated last year
- ☆35Updated 6 months ago
- Starter kit and data loading code for the Trojan Detection Challenge NeurIPS 2022 competition☆33Updated last year
- ☆43Updated 2 years ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆95Updated 3 weeks ago
- Code for the paper "Evading Black-box Classifiers Without Breaking Eggs" [SaTML 2024]☆20Updated last year
- Official Repository for Dataset Inference for LLMs☆34Updated 11 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆59Updated 2 weeks ago
- Official Repository for ICML 2023 paper "Can Neural Network Memorization Be Localized?"☆18Updated last year
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆27Updated last year
- Recycling diverse models☆44Updated 2 years ago
- ☆14Updated last year
- ☆18Updated 4 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆127Updated 3 weeks ago
- Code for the paper "A Light Recipe to Train Robust Vision Transformers" [SaTML 2023]☆52Updated 2 years ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆108Updated 4 months ago
- Data for "Datamodels: Predicting Predictions with Training Data"☆97Updated 2 years ago