☆31Jul 14, 2023Updated 2 years ago
Alternatives and similar repositories for explore_establish_exploit_llms
Users that are interested in explore_establish_exploit_llms are comparing it to the libraries listed below
Sorting:
- Explore, Establish, Exploit: Red Teaming Language Models from Scratch☆13Jun 21, 2023Updated 2 years ago
- All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks☆18Apr 24, 2024Updated last year
- Code and data for the NAACL 2021 paper: "XFORMAL: A Benchmark for Multilingual Formality Style Transfer"☆12Jun 7, 2021Updated 4 years ago
- A re-implementation of the "Red Teaming Language Models with Language Models" paper by Perez et al., 2022☆35Oct 9, 2023Updated 2 years ago
- ☆22Mar 31, 2022Updated 3 years ago
- Temporal Neural Networks☆28Updated this week
- Implementation of our paper, "MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models".☆18Apr 16, 2025Updated 10 months ago
- ☆28Mar 20, 2024Updated last year
- [ACL 2025] The official code for "AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection".☆33Aug 4, 2025Updated 6 months ago
- ☆75Jul 2, 2021Updated 4 years ago
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆98May 23, 2024Updated last year
- ☆70Feb 4, 2024Updated 2 years ago
- A library of translation-based text similarity measures☆25Dec 11, 2023Updated 2 years ago
- ☆29Aug 31, 2025Updated 6 months ago
- The repository contains code for Adaptive Data Optimization☆32Dec 9, 2024Updated last year
- ☆70Feb 16, 2025Updated last year
- Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models☆266May 13, 2024Updated last year
- [CIKM 2024] Trojan Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment.☆29Jul 29, 2024Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆108Mar 8, 2024Updated last year
- ☆22Sep 20, 2023Updated 2 years ago
- The original Backpack Language Model implementation, a fork of FlashAttention☆71May 29, 2023Updated 2 years ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆116Jun 13, 2024Updated last year
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆65Jan 11, 2025Updated last year
- This is a boilerplate project for building mobile applications using Expo, React, and Redux. It provides a solid foundation for creating …☆12Apr 6, 2025Updated 10 months ago
- Beyond imagenet attack (accepted by ICLR 2022) towards crafting adversarial examples for black-box domains.☆61Jun 15, 2022Updated 3 years ago
- ☆37Oct 17, 2024Updated last year
- A Human-LLM Collaborative Dataset for Generative Information-seeking with Attribution☆35Aug 2, 2023Updated 2 years ago
- Repo for "Centaur: Robust Multimodal Fusion for Human Activity Recognition"☆10Jan 9, 2024Updated 2 years ago
- official code for unigame☆19Nov 26, 2025Updated 3 months ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆341Feb 23, 2024Updated 2 years ago
- [EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"☆39Aug 20, 2025Updated 6 months ago
- Reading comprehension based question-answering model for news articles.☆11Jun 22, 2022Updated 3 years ago
- https://icml.cc/virtual/2023/poster/24354☆10Aug 15, 2023Updated 2 years ago
- [NeurIPS 2025] The official implementation of the paper "DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agen…☆39Feb 14, 2026Updated 2 weeks ago
- Lightweight library for creating services using just Python☆11Aug 1, 2023Updated 2 years ago
- [ICCV 2023] "TRM-UAP: Enhancing the Transferability of Data-Free Universal Adversarial Perturbation via Truncated Ratio Maximization", Yi…☆13Jul 17, 2024Updated last year
- Starter kit and data loading code for the Trojan Detection Challenge NeurIPS 2022 competition☆33Jul 26, 2023Updated 2 years ago
- ☆10Jun 5, 2023Updated 2 years ago
- A framework for evaluating Machine Translation models.☆12May 26, 2025Updated 9 months ago