OSU-NLP-Group / AgentAttack
☆13Updated 6 months ago
Related projects: ⓘ
- The official GitHub repo for the paper "Course-Correction: Safety Alignment Using Synthetic Preferences"☆19Updated last month
- [FCS'24] LVLM Safety paper☆11Updated 5 months ago
- ☆37Updated 10 months ago
- Model Editing Can Hurt General Abilities of Large Language Models☆29Updated 7 months ago
- ☆11Updated 2 weeks ago
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆79Updated 3 months ago
- ☆32Updated 10 months ago
- This repository contains data, code and models for contextual noncompliance.☆17Updated 2 months ago
- Restore safety in fine-tuned language models through task arithmetic☆25Updated 5 months ago
- Offical code of the paper Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Le…☆66Updated 6 months ago
- ☆46Updated 2 weeks ago
- ☆42Updated 5 months ago
- ☆24Updated 3 months ago
- Grade-School Math with Irrelevant Context (GSM-IC) benchmark is an arithmetic reasoning dataset built upon GSM8K, by adding irrelevant se…☆51Updated last year
- The repository of the project "Fine-tuning Large Language Models with Sequential Instructions", code base comes from open-instruct and LA…☆30Updated 2 months ago
- Code of paper "Accelerating Greedy Coordinate Gradient via Probe Sampling"☆16Updated 5 months ago
- ☆40Updated last year
- Directional Preference Alignment☆44Updated 3 months ago
- The code of “Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning”☆15Updated 6 months ago
- ☆44Updated 2 weeks ago
- ☆25Updated 3 months ago
- ☆20Updated 2 months ago
- Knowledge Circuits in Pretrained Transformers☆46Updated this week
- Evaluate the Quality of Critique☆35Updated 3 months ago
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆55Updated 8 months ago
- [ICLR 2024] Unveiling the Pitfalls of Knowledge Editing for Large Language Models☆22Updated 3 months ago
- A resource repository for representation engineering in large language models☆36Updated last week
- [ICLR 2024] Provable Robust Watermarking for AI-Generated Text☆25Updated 9 months ago
- Codebase for Inference-Time Policy Adapters☆19Updated 10 months ago
- ☆47Updated last year