shauli-ravfogel / adv-kernel-removalLinks

☆12

Alternatives and similar repositories for adv-kernel-removal

Users that are interested in adv-kernel-removal are comparing it to the libraries listed below

Sorting:

shauli-ravfogel / rlace-icml
☆36Updated 2 years ago
milesaturpin / cot-unfaithfulness
☆44Updated last year
jiahai-feng / binding-iclr
☆14Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆27Updated last year
MaheepChaudhary / SAE-Ravel
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆11Updated 5 months ago
ethz-spylab / superhuman-ai-consistency
☆29Updated 2 years ago
vedantpalit / Towards-Vision-Language-Mechanistic-Interpretability
This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…
☆22Updated last year
tml-epfl / icl-alignment
Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]
☆30Updated 5 months ago
katiekang1998 / reasoning_generalization
☆32Updated 5 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆95Updated 3 weeks ago
peterljq / Parsimonious-Concept-Engineering
PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)
☆37Updated 7 months ago
Jiaxin-Wen / MisleadLM
Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""
☆14Updated 8 months ago
safety-research / open-source-alignment-faking
Open Source Replication of Anthropic's Alignment Faking Paper
☆13Updated 2 months ago
dtch1997 / steering-bench
Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"
☆11Updated 6 months ago
Butanium / nnterp
A small package implementing some useful wrapping around nnsight
☆13Updated this week
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆52Updated last month
MadryLab / rethinking-backdoor-attacks
☆16Updated last year
lingo-mit / lm-truthfulness
☆17Updated last year
rtaori / data_feedback
Code for the paper "Data Feedback Loops: Model-driven Amplification of Dataset Biases"
☆16Updated 2 years ago
locuslab / acr-memorization
☆35Updated 6 months ago
McGill-NLP / AdversarialTriggers
Code for "Universal Adversarial Triggers Are Not Universal."
☆17Updated last year
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆47Updated 8 months ago
thestephencasper / explore_establish_exploit_llms
☆31Updated last year
MadryLab / datamodels-data
Data for "Datamodels: Predicting Predictions with Training Data"
☆97Updated 2 years ago
MadryLab / modeldiff
ModelDiff: A Framework for Comparing Learning Algorithms
☆57Updated last year
mcleish7 / gemstone-scaling-laws
☆26Updated 4 months ago
jmerullo / lm_vector_arithmetic
☆35Updated 2 years ago
EmpathYang / ADEPT
Source code and data for ADEPT: A DEbiasing PrompT Framework (AAAI-23).
☆15Updated 6 months ago
stanford-crfm / air-bench-2024
AIR-Bench 2024 is a safety benchmark that aligns with emerging government regulations and company policies
☆23Updated 10 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆67Updated 2 months ago