shauli-ravfogel / adv-kernel-removalLinks
☆12Updated 2 years ago
Alternatives and similar repositories for adv-kernel-removal
Users that are interested in adv-kernel-removal are comparing it to the libraries listed below
Sorting:
- ☆36Updated 3 years ago
- This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…☆22Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆112Updated last month
- ☆15Updated last year
- Distilling Model Failures as Directions in Latent Space☆47Updated 2 years ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated last year
- Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"☆12Updated last year
- ☆47Updated last year
- ModelDiff: A Framework for Comparing Learning Algorithms☆59Updated last year
- Code for "Universal Adversarial Triggers Are Not Universal."☆17Updated last year
- Sparse Autoencoder Training Library☆54Updated 3 months ago
- ☆22Updated last year
- ☆29Updated 2 years ago
- ☆17Updated last year
- PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆39Updated 9 months ago
- Code to enable layer-level steering in LLMs using sparse auto encoders☆23Updated 3 months ago
- ☆34Updated last year
- Official PyTorch Implementation for Meaning Representations from Trajectories in Autoregressive Models (ICLR 2024)☆21Updated last year
- Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet☆32Updated last year
- AIR-Bench 2024 is a safety benchmark that aligns with emerging government regulations and company policies☆23Updated 11 months ago
- A modern look at the relationship between sharpness and generalization [ICML 2023]☆43Updated last year
- Algebraic value editing in pretrained language models☆65Updated last year
- ☆103Updated 6 months ago
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆31Updated 6 months ago
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆77Updated 8 months ago
- Data for "Datamodels: Predicting Predictions with Training Data"☆97Updated 2 years ago
- ☆43Updated 2 years ago
- Implementation of Influence Function approximations for differently sized ML models, using PyTorch☆15Updated last year
- ☆20Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆70Updated last year