pillowsofwind / Course-Correction
[EMNLP 2024] The official GitHub repo for the paper "Course-Correction: Safety Alignment Using Synthetic Preferences"
☆19Updated 7 months ago
Alternatives and similar repositories for Course-Correction:
Users that are interested in Course-Correction are comparing it to the libraries listed below
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆92Updated 11 months ago
- ☆27Updated 10 months ago
- ☆18Updated 6 months ago
- Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model☆68Updated 2 years ago
- BeHonest: Benchmarking Honesty in Large Language Models☆31Updated 8 months ago
- ☆25Updated 7 months ago
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks☆25Updated 9 months ago
- Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"☆63Updated last year
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models☆17Updated 9 months ago
- Evaluating the Ripple Effects of Knowledge Editing in Language Models☆55Updated last year
- ☆37Updated last year
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆82Updated 11 months ago
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆75Updated last year
- ☆36Updated 7 months ago
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆76Updated 3 weeks ago
- ☆73Updated 11 months ago
- Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.☆47Updated last year
- [ACL 2024] Code for the paper "ALaRM: Align Language Models via Hierarchical Rewards Modeling"☆25Updated last year
- Weak-to-Strong Jailbreaking on Large Language Models☆73Updated last year
- Mostly recording papers about models' trustworthy applications. Intending to include topics like model evaluation & analysis, security, c…☆21Updated last year
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆89Updated 8 months ago
- [EMNLP 2024] The official GitHub repo for the survey paper "Knowledge Conflicts for LLMs: A Survey"☆114Updated 7 months ago
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆97Updated last year
- ☆41Updated last year
- Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"☆29Updated 8 months ago
- [EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions☆109Updated 7 months ago
- 【ACL 2024】 SALAD benchmark & MD-Judge☆145Updated last month
- Public code repo for COLING 2025 paper "Aligning LLMs with Individual Preferences via Interaction"☆26Updated last month
- GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.☆60Updated 9 months ago
- ☆44Updated 6 months ago