thestephencasper / benchmarking_interpretability
☆31Updated last year
Related projects ⓘ
Alternatives and complementary repositories for benchmarking_interpretability
- ☆13Updated 8 months ago
- ☆19Updated 3 months ago
- Privacy backdoors☆47Updated 6 months ago
- Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet☆29Updated last year
- Starter kit and data loading code for the Trojan Detection Challenge NeurIPS 2022 competition☆33Updated last year
- What do we learn from inverting CLIP models?☆45Updated 8 months ago
- ☆34Updated last year
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆107Updated 5 months ago
- ☆16Updated last year
- Code for the paper "The Journey, Not the Destination: How Data Guides Diffusion Models"☆19Updated 11 months ago
- ☆26Updated 3 weeks ago
- ☆22Updated last year
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆42Updated 6 months ago
- Code for the paper "Evading Black-box Classifiers Without Breaking Eggs" [SaTML 2024]☆19Updated 7 months ago
- ☆49Updated last year
- ☆32Updated 11 months ago
- Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"☆17Updated 8 months ago
- ☆27Updated last month
- Distilling Model Failures as Directions in Latent Space☆45Updated last year
- Data for "Datamodels: Predicting Predictions with Training Data"☆90Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆64Updated 9 months ago
- ☆64Updated last year
- ModelDiff: A Framework for Comparing Learning Algorithms☆53Updated last year
- A modern look at the relationship between sharpness and generalization [ICML 2023]☆43Updated last year
- Intriguing Properties of Data Attribution on Diffusion Models (ICLR 2024)☆23Updated 9 months ago
- Recycling diverse models☆44Updated last year
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆79Updated 6 months ago
- Do input gradients highlight discriminative features? [NeurIPS 2021] (https://arxiv.org/abs/2102.12781)☆13Updated last year
- This is an official repository for "LAVA: Data Valuation without Pre-Specified Learning Algorithms" (ICLR2023).☆44Updated 5 months ago
- Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting☆12Updated 3 months ago