A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the safety implications, challenges, and advancements surrounding these powerful models.
☆1,835Apr 18, 2026Updated 2 weeks ago
Alternatives and similar repositories for Awesome-LLM-Safety
Users that are interested in Awesome-LLM-Safety are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A reading list for large models safety, security, and privacy (including Awesome LLM Security, Safety, etc.).☆1,949Apr 2, 2026Updated last month
- A curation of awesome tools, documents and projects about LLM Security.☆1,578Aug 20, 2025Updated 8 months ago
- ☆60Jun 13, 2024Updated last year
- An easy-to-use Python framework to generate adversarial jailbreak prompts.☆848Mar 30, 2026Updated last month
- Papers and resources related to the security and privacy of LLMs 🤖☆576Jun 8, 2025Updated 10 months ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- [NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey☆111Aug 7, 2024Updated last year
- Accepted by IJCAI-24 Survey Track☆231Aug 25, 2024Updated last year
- Universal and Transferable Attacks on Aligned Language Models☆4,644Aug 2, 2024Updated last year
- 😎 up-to-date & curated list of awesome Attacks on Large-Vision-Language-Models papers, methods & resources.☆541Apr 17, 2026Updated 2 weeks ago
- Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, data…☆1,338Mar 30, 2026Updated last month
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆348Feb 23, 2024Updated 2 years ago
- [ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…☆440Jan 22, 2025Updated last year
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]☆584Apr 4, 2025Updated last year
- A survey on harmful fine-tuning attack for large language model (ACM CSUR)☆239Updated this week
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆152Jul 19, 2024Updated last year
- ☆732Jul 2, 2025Updated 10 months ago
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆60Sep 11, 2025Updated 7 months ago
- Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety☆263Apr 12, 2026Updated 3 weeks ago
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆82Updated this week
- ☆77Jan 21, 2026Updated 3 months ago
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆936Aug 16, 2024Updated last year
- Accepted by ECCV 2024☆203Oct 15, 2024Updated last year
- Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models☆273May 13, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts,用于评估和提升大模型的安全性。☆1,153Feb 27, 2024Updated 2 years ago
- ☆185Oct 31, 2025Updated 6 months ago
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆161Nov 30, 2024Updated last year
- 【ACL 2024】 SALAD benchmark & MD-Judge☆175Mar 8, 2025Updated last year
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Models☆623Jun 24, 2025Updated 10 months ago
- Official repo for GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts☆579Feb 27, 2026Updated 2 months ago
- A Survey on Jailbreak Attacks and Defenses against Multimodal Generative Models☆317Jan 11, 2026Updated 3 months ago
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆331May 13, 2025Updated 11 months ago
- Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback☆1,600Nov 24, 2025Updated 5 months ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety. [ACL 2024]☆283Jul 28, 2025Updated 9 months ago
- ☆199Nov 26, 2023Updated 2 years ago
- Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]☆112Sep 27, 2024Updated last year
- ☆166Sep 2, 2024Updated last year
- Awesome-LLM-Robustness: a curated list of Uncertainty, Reliability and Robustness in Large Language Models☆818Apr 23, 2026Updated last week
- [AAAI'25 (Oral)] Jailbreaking Large Vision-language Models via Typographic Visual Prompts☆202Jun 26, 2025Updated 10 months ago
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆181Apr 23, 2025Updated last year