Test LLMs against jailbreaks and unprecedented harms
☆40Oct 19, 2024Updated last year
Alternatives and similar repositories for walledeval
Users that are interested in walledeval are comparing it to the libraries listed below
Sorting:
- Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique☆18Aug 22, 2024Updated last year
- The official repository for guided jailbreak benchmark☆28Jul 28, 2025Updated 7 months ago
- ☆25Sep 3, 2025Updated 6 months ago
- Our EMNLP 2022 paper on VIP-Based Prompting for Parameter-Efficient Learning☆10Oct 22, 2022Updated 3 years ago
- ☆13Aug 26, 2024Updated last year
- The SQL-RL-GEN is an algorithm based on a Reinforcement Learning approach with a reward function generated by a LLM to guide the agent's …☆19Sep 18, 2025Updated 5 months ago
- Constructing community of LLM-based Agent in the minecraft☆16Nov 27, 2025Updated 3 months ago
- A novel jailbreak attack unveiling an overlooked attack surface inherently in the chain-of-thought reasoning trajectory of LLMs☆22Sep 18, 2025Updated 5 months ago
- Example Agents for DIAMBRA Arena Environments☆17Sep 3, 2024Updated last year
- ☆22Dec 16, 2024Updated last year
- About Official PyTorch implementation of "Query-Efficient Black-Box Red Teaming via Bayesian Optimization" (ACL'23)☆15Jul 9, 2023Updated 2 years ago
- Revisiting Character-level Adversarial Attacks for Language Models, ICML 2024☆19Feb 12, 2025Updated last year
- [ACL 2025] The official code for "AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection".☆33Aug 4, 2025Updated 6 months ago
- A gdb for fuzzing☆22Nov 26, 2021Updated 4 years ago
- ☆24Jun 17, 2025Updated 8 months ago
- Our EMNLP 2022 paper on MCQA☆23Jan 15, 2023Updated 3 years ago
- 全球AI攻防挑战赛—赛道一:大模型生图安全疫苗注入第二名解题方案☆26Nov 7, 2024Updated last year
- ☆22Mar 16, 2023Updated 2 years ago
- This repository contains the dataset and the pytorch implementations of the models from the paper CIDER: Commonsense Inference for Dialog…☆27Oct 30, 2022Updated 3 years ago
- [NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.☆187Apr 1, 2025Updated 11 months ago
- Restore safety in fine-tuned language models through task arithmetic☆32Mar 28, 2024Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆108Mar 8, 2024Updated last year
- [NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling☆34Nov 8, 2024Updated last year
- This repository provides a benchmark for prompt injection attacks and defenses in LLMs☆396Oct 29, 2025Updated 4 months ago
- Simultaneously Optimizing Perturbations and Positions for Black-box Adversarial Patch Attacks (TPAMI 2022)☆35Feb 9, 2023Updated 3 years ago
- ☆39Apr 15, 2024Updated last year
- ☆43Feb 9, 2026Updated 3 weeks ago
- [KO-Platy🥮] Korean-Open-platypus를 활용하여 llama-2-ko를 fine-tuning한 KO-platypus model☆73Aug 24, 2025Updated 6 months ago
- DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling☆36Jul 12, 2024Updated last year
- Indexing framework designed for the automated creation of structured knowledge bases in Azure AI Search☆14Jun 18, 2025Updated 8 months ago
- Evaluation of Oasis Platform - simple install, UI and API☆14Feb 9, 2026Updated 3 weeks ago
- 2020湖南省第一届人工智能大赛参赛作品☆11Feb 17, 2022Updated 4 years ago
- The Pair App is employed by the Agency of Learning for team management and communication.☆10Apr 13, 2024Updated last year
- yolo目标检测算法☆15Jul 27, 2025Updated 7 months ago
- ☆14May 1, 2023Updated 2 years ago
- [USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns☆13Mar 1, 2025Updated last year
- Precision Knowledge Editing (PKE): A novel method to reduce toxicity in LLMs while preserving performance, with robust evaluations and ha…☆11Nov 26, 2024Updated last year
- ☆11Jul 10, 2024Updated last year
- This repo contains documentation related to the operation of the OpenBytes project.☆13Oct 29, 2021Updated 4 years ago