snu-mllab / Bayesian-Red-TeamingLinks
About Official PyTorch implementation of "Query-Efficient Black-Box Red Teaming via Bayesian Optimization" (ACL'23)
โ15Updated 2 years ago
Alternatives and similar repositories for Bayesian-Red-Teaming
Users that are interested in Bayesian-Red-Teaming are comparing it to the libraries listed below
Sorting:
- [ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Modelsโ86Updated last year
- ๐คซ Code and benchmark for our ICLR 2024 spotlight paper: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Conโฆโ50Updated 2 years ago
- This repository contains the official code for the paper: "Prompt Injection: Parameterization of Fixed Inputs"โ32Updated last year
- Restore safety in fine-tuned language models through task arithmeticโ31Updated last year
- [๐๐๐๐๐ ๐ ๐ข๐ง๐๐ข๐ง๐ ๐ฌ ๐๐๐๐ & ๐๐๐ ๐๐๐๐ ๐๐๐๐๐ ๐๐ซ๐๐ฅ] ๐๐ฏ๐ฉ๐ข๐ฏ๐ค๐ช๐ฏ๐จ ๐๐ข๐ต๐ฉ๐ฆ๐ฎ๐ข๐ต๐ช๐ค๐ข๐ญ ๐๐ฆ๐ข๐ด๐ฐ๐ฏ๐ช๐ฏโฆโ51Updated last year
- โ29Updated last year
- โ22Updated 5 months ago
- [EMNLP 2024] Official implementation of "Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utโฆโ23Updated last year
- [ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"โ66Updated last year
- [ICLR 2025 Oral] Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisitionโ17Updated last year
- [NeurIPS 2025] Reasoning Models Better Express Their Confidence"โ22Updated 2 months ago
- Source codes for "Preference-grounded Token-level Guidance for Language Model Fine-tuning" (NeurIPS 2023).โ17Updated last year
- โ46Updated 2 years ago
- Self-Supervised Alignment with Mutual Informationโ20Updated last year
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.โ85Updated 11 months ago
- โ24Updated 2 years ago
- โ27Updated 2 years ago
- Code for Representation Bending Paperโ16Updated 6 months ago
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewardsโ47Updated 9 months ago
- This repository contains data, code and models for contextual noncompliance.โ25Updated last year
- Official implementation of Privacy Implications of Retrieval-Based Language Models (EMNLP 2023). https://arxiv.org/abs/2305.14888โ37Updated last year
- Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Modelโ71Updated 3 years ago
- Offical code of the paper Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Leโฆโ75Updated last year
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]โ32Updated last year
- โ15Updated last year
- โ13Updated 7 months ago
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Modelsโ19Updated 5 months ago
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]โ21Updated last year
- โ43Updated last year
- The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"โ39Updated 2 years ago