collinzrj / output2prompt
☆37Updated 2 weeks ago
Alternatives and similar repositories for output2prompt:
Users that are interested in output2prompt are comparing it to the libraries listed below
- Weak-to-Strong Jailbreaking on Large Language Models☆72Updated last year
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆41Updated 2 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆67Updated last year
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆68Updated last month
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆89Updated 10 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆95Updated last month
- Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]☆66Updated 2 months ago
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆102Updated 6 months ago
- ☆94Updated last year
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆67Updated 9 months ago
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆68Updated 3 months ago
- ☆42Updated last month
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆133Updated last year
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆84Updated 4 months ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆59Updated 2 months ago
- The Paper List on Data Contamination for Large Language Models Evaluation.☆90Updated last month
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆124Updated 8 months ago
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks☆25Updated 8 months ago
- ☆31Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆192Updated 6 months ago
- An official implementation of "Catastrophic Failure of LLM Unlearning via Quantization" (ICLR 2025)☆26Updated last month
- ☆37Updated last year
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique☆14Updated 7 months ago
- ☆17Updated 5 months ago
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆58Updated 2 months ago
- Official Repository for Dataset Inference for LLMs☆32Updated 8 months ago
- ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"☆33Updated last month
- ☆53Updated 2 years ago
- A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.☆85Updated last year