Libr-AI / do-not-answer
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
☆224Updated 9 months ago
Alternatives and similar repositories for do-not-answer:
Users that are interested in do-not-answer are comparing it to the libraries listed below
- This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.☆444Updated last year
- Papers about red teaming LLMs and Multimodal models.☆104Updated 3 months ago
- 【ACL 2024】 SALAD benchmark & MD-Judge☆132Updated this week
- Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"☆156Updated 3 months ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆280Updated last year
- ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [EMNLP 2024 Findings]☆176Updated 5 months ago
- [NDSS'25 Poster] A collection of automated evaluators for assessing jailbreak attempts.☆120Updated this week
- [NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey☆89Updated 7 months ago
- Improving Alignment and Robustness with Circuit Breakers☆189Updated 5 months ago
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…☆218Updated last year
- The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Lang…☆94Updated last month
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆120Updated 7 months ago
- Run safety benchmarks against AI models and view detailed reports showing how well they performed.☆81Updated this week
- Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety. [ACL 2024]☆197Updated 8 months ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆276Updated 5 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆85Updated 2 weeks ago
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]☆307Updated 5 months ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models☆499Updated 8 months ago
- TAP: An automated jailbreaking method for black-box LLMs☆150Updated 3 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆144Updated 10 months ago
- ☆165Updated last year
- A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic…☆325Updated 9 months ago
- Generative Judge for Evaluating Alignment☆230Updated last year
- [ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability☆138Updated 2 months ago
- [ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…☆305Updated last month
- Weak-to-Strong Jailbreaking on Large Language Models☆72Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆95Updated last year
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …☆254Updated last year
- PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to a…☆341Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆190Updated 5 months ago