A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the safety implications, challenges, and advancements surrounding these powerful models.
☆1,789Mar 7, 2026Updated this week
Alternatives and similar repositories for Awesome-LLM-Safety
Users that are interested in Awesome-LLM-Safety are comparing it to the libraries listed below
Sorting:
- A reading list for large models safety, security, and privacy (including Awesome LLM Security, Safety, etc.).☆1,879Updated this week
- A curation of awesome tools, documents and projects about LLM Security.☆1,537Aug 20, 2025Updated 6 months ago
- Papers and resources related to the security and privacy of LLMs 🤖☆570Jun 8, 2025Updated 9 months ago
- Accepted by IJCAI-24 Survey Track☆231Aug 25, 2024Updated last year
- An easy-to-use Python framework to generate adversarial jailbreak prompts.☆820Mar 27, 2025Updated 11 months ago
- [NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey☆110Aug 7, 2024Updated last year
- ☆58Jun 13, 2024Updated last year
- Universal and Transferable Attacks on Aligned Language Models☆4,534Aug 2, 2024Updated last year
- 😎 up-to-date & curated list of awesome Attacks on Large-Vision-Language-Models papers, methods & resources.☆505Feb 17, 2026Updated 2 weeks ago
- [ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…☆430Jan 22, 2025Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆343Feb 23, 2024Updated 2 years ago
- A survey on harmful fine-tuning attack for large language model☆233Feb 25, 2026Updated last week
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]☆540Apr 4, 2025Updated 11 months ago
- Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, data…☆1,231Feb 6, 2026Updated last month
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆875Aug 16, 2024Updated last year
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆151Jul 19, 2024Updated last year
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆61Sep 11, 2025Updated 5 months ago
- Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models☆266May 13, 2024Updated last year
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆82Updated this week
- Safety at Scale: A Comprehensive Survey of Large Model Safety☆228Feb 3, 2026Updated last month
- ☆701Jul 2, 2025Updated 8 months ago
- Accepted by ECCV 2024☆192Oct 15, 2024Updated last year
- ☆75Jan 21, 2026Updated last month
- ☆178Oct 31, 2025Updated 4 months ago
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Models☆618Jun 24, 2025Updated 8 months ago
- A Survey on Jailbreak Attacks and Defenses against Multimodal Generative Models☆308Jan 11, 2026Updated last month
- Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts,用于评估和提升大模型的安全 性。☆1,132Feb 27, 2024Updated 2 years ago
- 【ACL 2024】 SALAD benchmark & MD-Judge☆171Mar 8, 2025Updated last year
- Official repo for GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts☆571Feb 27, 2026Updated last week
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆163Nov 30, 2024Updated last year
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆319May 13, 2025Updated 9 months ago
- Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]☆109Sep 27, 2024Updated last year
- Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety. [ACL 2024]☆273Jul 28, 2025Updated 7 months ago
- Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback☆1,589Nov 24, 2025Updated 3 months ago
- ☆197Nov 26, 2023Updated 2 years ago
- Reading list of hallucination in LLMs. Check out our new survey paper: "Siren’s Song in the AI Ocean: A Survey on Hallucination in Large …☆1,076Sep 27, 2025Updated 5 months ago
- Awesome-LLM-Robustness: a curated list of Uncertainty, Reliability and Robustness in Large Language Models☆812May 21, 2025Updated 9 months ago
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆174Apr 23, 2025Updated 10 months ago
- [AAAI'25 (Oral)] Jailbreaking Large Vision-language Models via Typographic Visual Prompts☆192Jun 26, 2025Updated 8 months ago