aounon / certified-llm-safetyView external linksLinks
☆51Aug 10, 2024Updated last year
Alternatives and similar repositories for certified-llm-safety
Users that are interested in certified-llm-safety are comparing it to the libraries listed below
Sorting:
- Data for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder"☆20Oct 26, 2023Updated 2 years ago
- ☆122Nov 13, 2023Updated 2 years ago
- LLM Self Defense: By Self Examination, LLMs know they are being tricked☆48May 21, 2024Updated last year
- Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing'☆22Jun 9, 2024Updated last year
- Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.☆56Nov 13, 2023Updated 2 years ago
- ☆15Oct 5, 2020Updated 5 years ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆151Jul 19, 2024Updated last year
- ☆20Nov 4, 2025Updated 3 months ago
- PAL: Proxy-Guided Black-Box Attack on Large Language Models☆57Aug 17, 2024Updated last year
- ☆57Jun 5, 2024Updated last year
- Code for ICLR 2025 Failures to Find Transferable Image Jailbreaks Between Vision-Language Models☆37Jun 1, 2025Updated 8 months ago
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆68Oct 23, 2024Updated last year
- pointcloud analysis☆23Apr 11, 2023Updated 2 years ago
- ☆32Feb 13, 2024Updated 2 years ago
- ☆31Oct 7, 2021Updated 4 years ago
- [ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization☆29Jul 9, 2024Updated last year
- This repository provides a benchmark for prompt injection attacks and defenses in LLMs☆391Oct 29, 2025Updated 3 months ago
- The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Lang…☆152Sep 2, 2025Updated 5 months ago
- ☆55May 21, 2025Updated 8 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- [arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"☆173Feb 20, 2024Updated last year
- [USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns☆13Mar 1, 2025Updated 11 months ago
- Maps Medicare LDS claims data to the Tuva Input Layer so you can easily run the Tuva Project.☆11Dec 15, 2025Updated 2 months ago
- ☆16Aug 15, 2022Updated 3 years ago
- Python package for Geometric / Clifford Algebra with Pytorch.☆14Jan 25, 2026Updated 3 weeks ago
- A Benchmark for Evaluating Safety and Trustworthiness in Web Agents for Enterprise Scenarios☆19Feb 8, 2026Updated last week
- ☆11Sep 2, 2024Updated last year
- Debiasing Through Data Attribution☆12May 23, 2024Updated last year
- Demo repository to lambda-fy your dbt runs☆11Sep 7, 2023Updated 2 years ago
- Detection of adversarial examples using influence functions and nearest neighbors☆37Nov 22, 2022Updated 3 years ago
- ☆14Feb 2, 2025Updated last year
- Official Implementation of "Learning to Refuse: Towards Mitigating Privacy Risks in LLMs"☆10Dec 13, 2024Updated last year
- ☆44Oct 1, 2024Updated last year
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆317May 13, 2025Updated 9 months ago
- [ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…☆427Jan 22, 2025Updated last year
- ☆11Oct 8, 2021Updated 4 years ago
- ☆10Mar 6, 2022Updated 3 years ago
- Code for Rethinking Prompt Optimizers: From Prompt Merits to Optimization☆12Jan 12, 2026Updated last month
- ☆10Sep 7, 2022Updated 3 years ago