snu-mllab / Bayesian-Red-Teaming
About Official PyTorch implementation of "Query-Efficient Black-Box Red Teaming via Bayesian Optimization" (ACL'23)
☆12Updated last year
Related projects ⓘ
Alternatives and complementary repositories for Bayesian-Red-Teaming
- [ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Models☆76Updated 2 months ago
- [ICLR 2022] Towards Continual Knowledge Learning of Language Models☆93Updated 2 years ago
- This repository contains the official code for the paper: "Prompt Injection: Parameterization of Fixed Inputs"☆32Updated 2 months ago
- ☆23Updated 11 months ago
- [EMNLP 2022] TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models☆66Updated 6 months ago
- ☆24Updated last year
- ☆20Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆28Updated 4 months ago
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆15Updated 6 months ago
- [ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"☆47Updated last month
- Restore safety in fine-tuned language models through task arithmetic☆26Updated 7 months ago
- [EMNLP Findings 2024 & ACL 2024 NLRSE Oral] Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards☆44Updated 6 months ago
- ☆23Updated 2 months ago
- ☆15Updated 8 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆57Updated 2 weeks ago
- [TACL 2024] Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis☆10Updated last week
- 🤫 Code and benchmark for our ICLR 2024 spotlight paper: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Con…☆34Updated 11 months ago
- ☆16Updated 4 months ago
- The git repository of Modular Prompted Chatbot paper☆33Updated last year
- ☆36Updated last year
- ☆48Updated last year
- [NeurIPS 2022 Workshop] A Case Study with Negated Prompts using T0 (3B, 11B), InstructGPT (350M-175B), GPT-3 (350M - 175B) & OPT (125M - …☆23Updated 2 years ago
- ☆20Updated 4 months ago
- Code for "Universal Adversarial Triggers Are Not Universal."☆16Updated 6 months ago
- [EMNLP 2023 Findings] Efficiently Enhancing Zero-Shot Performance of Instruction Following Model via Retrieval of Soft Prompt☆20Updated last year
- ☆33Updated 9 months ago
- [ACL 2021] Learning to Perturb Word Embeddings for Out-of-distribution QA☆16Updated 2 years ago
- ☆49Updated last year
- Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model☆65Updated 2 years ago
- ☆26Updated 6 months ago