XuandongZhao / Unigram-Watermark
[ICLR 2024] Provable Robust Watermarking for AI-Generated Text
☆26Updated 11 months ago
Related projects ⓘ
Alternatives and complementary repositories for Unigram-Watermark
- ☆38Updated last year
- ☆33Updated last year
- Official Repository for Dataset Inference for LLMs☆23Updated 3 months ago
- ☆49Updated last year
- ☆15Updated 3 months ago
- [ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"☆46Updated last month
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆26Updated 4 months ago
- Code and data for paper "A Semantic Invariant Robust Watermark for Large Language Models" accepted by ICLR 2024.☆25Updated 5 months ago
- Official code for the paper: Evaluating Copyright Takedown Methods for Language Models☆15Updated 3 months ago
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆61Updated 10 months ago
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆83Updated 5 months ago
- ☆34Updated 3 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆51Updated this week
- EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue☆32Updated 2 weeks ago
- Weak-to-Strong Jailbreaking on Large Language Models☆64Updated 8 months ago
- ☆26Updated 6 months ago
- Code for "Universal Adversarial Triggers Are Not Universal."☆15Updated 6 months ago
- ☆26Updated last week
- Restore safety in fine-tuned language models through task arithmetic☆25Updated 7 months ago
- Official code implementation of SKU, Accepted by ACL 2024 Findings☆11Updated 5 months ago
- RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024☆58Updated last month
- Official Implementation of the paper "Three Bricks to Consolidate Watermarks for LLMs"☆41Updated 8 months ago
- Min-K%++: Improved baseline for detecting pre-training data of LLMs https://arxiv.org/abs/2404.02936☆26Updated 4 months ago
- Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.☆43Updated 11 months ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆69Updated 2 months ago
- Code for watermarking language models☆72Updated 2 months ago
- Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model☆62Updated 2 years ago
- Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆58Updated last month
- [ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"☆66Updated 11 months ago
- [EMNLP 2024] The official GitHub repo for the paper "Course-Correction: Safety Alignment Using Synthetic Preferences"☆19Updated last month