snu-mllab / Bayesian-Red-Teaming
About Official PyTorch implementation of "Query-Efficient Black-Box Red Teaming via Bayesian Optimization" (ACL'23)
☆12Updated last year
Related projects ⓘ
Alternatives and complementary repositories for Bayesian-Red-Teaming
- [ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Models☆76Updated 2 months ago
- This repository contains the official code for the paper: "Prompt Injection: Parameterization of Fixed Inputs"☆32Updated 2 months ago
- [ICLR 2022] Towards Continual Knowledge Learning of Language Models☆93Updated 2 years ago
- ☆24Updated last year
- [ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"☆45Updated last month
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆14Updated 6 months ago
- [EMNLP 2022] TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models☆66Updated 5 months ago
- ☆15Updated 3 months ago
- ☆35Updated last year
- ☆35Updated 9 months ago
- [NeurIPS 2022 Workshop] A Case Study with Negated Prompts using T0 (3B, 11B), InstructGPT (350M-175B), GPT-3 (350M - 175B) & OPT (125M - …☆23Updated 2 years ago
- ☆15Updated 8 months ago
- ☆23Updated 11 months ago
- ☆23Updated last month
- 🤫 Code and benchmark for our ICLR 2024 spotlight paper: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Con…☆34Updated 10 months ago
- [EMNLP Findings 2024 & ACL 2024 NLRSE Oral] Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards☆44Updated 6 months ago
- ☆26Updated 6 months ago
- ☆74Updated last year
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆52Updated last week
- [EMNLP 2023 Findings] Efficiently Enhancing Zero-Shot Performance of Instruction Following Model via Retrieval of Soft Prompt☆20Updated last year
- ☆34Updated 3 months ago
- ☆20Updated last year
- ☆48Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆26Updated 4 months ago
- The git repository of Modular Prompted Chatbot paper☆33Updated last year
- Code for "Universal Adversarial Triggers Are Not Universal."☆15Updated 6 months ago
- Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model☆62Updated 2 years ago
- Align your LM to express calibrated verbal statements of confidence in its long-form generations.☆19Updated 5 months ago
- Restore safety in fine-tuned language models through task arithmetic☆26Updated 7 months ago
- ☆20Updated 4 months ago