locuslab / tofu
Landing Page for TOFU
☆79Updated 3 months ago
Related projects: ⓘ
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)☆46Updated 5 months ago
- ☆47Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆22Updated 2 months ago
- ☆21Updated 2 months ago
- ☆37Updated 10 months ago
- Source code of "Task arithmetic in the tangent space: Improved editing of pre-trained models".☆79Updated last year
- ☆23Updated 4 months ago
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆77Updated 4 months ago
- ☆60Updated 2 years ago
- Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆34Updated 3 weeks ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆46Updated last month
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆71Updated 4 months ago
- AI Logging for Interpretability and Explainability🔬☆74Updated 3 months ago
- Official Implementation of the paper "Three Bricks to Consolidate Watermarks for LLMs"☆41Updated 7 months ago
- Official code for the paper: Evaluating Copyright Takedown Methods for Language Models☆14Updated 2 months ago
- [ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"☆34Updated 4 months ago
- A resource repository for machine unlearning in large language models☆128Updated last week
- Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆55Updated 2 months ago
- Official repository of "Localizing Task Information for Improved Model Merging and Compression" [ICML 2024]☆26Updated 3 months ago
- ☆63Updated 10 months ago
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆64Updated 6 months ago
- ☆31Updated 11 months ago
- ☆14Updated 2 months ago
- Parsimonious Concept Engineering (PaCE) uses sparse coding on a large-scale concept dictionary to effectively improve the trustworthiness…☆25Updated 3 months ago
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆54Updated 8 months ago
- A resource repository for representation engineering in large language models☆35Updated last week
- Code for watermarking language models☆68Updated last week
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆12Updated 4 months ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆64Updated 2 weeks ago
- Official code implementation of SKU, Accepted by ACL 2024 Findings☆11Updated 4 months ago