aiverify-foundation / moonshot-data
Contains all assets to run with Moonshot Library (Connectors, Datasets and Metrics)
☆15Updated this week
Related projects: ⓘ
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆55Updated 8 months ago
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆76Updated 6 months ago
- Run safety benchmarks against AI models and view detailed reports showing how well they performed.☆50Updated this week
- 【ACL 2024】 SALAD benchmark & MD-Judge☆81Updated this week
- Moonshot - A simple and modular tool to evaluate and red-team any LLM application.☆144Updated this week
- Python package for measuring memorization in LLMs.☆107Updated this week
- Papers about red teaming LLMs and Multimodal models.☆66Updated this week
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆275Updated last month
- Dataset for the Tensor Trust project☆29Updated 6 months ago
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆29Updated 2 months ago
- Weak-to-Strong Jailbreaking on Large Language Models☆62Updated 7 months ago
- An Open Robustness Benchmark for Jailbreaking Language Models [arXiv 2024]☆169Updated last month
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [arXiv, Apr 2024]☆181Updated last month
- AI Verify☆111Updated this week
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆41Updated 4 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆89Updated 2 months ago
- Improving Alignment and Robustness with Circuit Breakers☆124Updated 2 months ago
- [ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"☆50Updated 6 months ago
- ☆92Updated 4 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆59Updated 10 months ago
- A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆106Updated 6 months ago
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs☆156Updated 3 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆110Updated 4 months ago
- Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"☆20Updated last month
- ☆47Updated last year
- ☆30Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆76Updated 3 weeks ago
- Explore, Establish, Exploit: Red Teaming Language Models from Scratch☆10Updated last year
- Official code for the paper: Evaluating Copyright Takedown Methods for Language Models☆14Updated 2 months ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆33Updated 3 weeks ago