[ICML 2024] TrustLLM: Trustworthiness in Large Language Models
☆618Jun 24, 2025Updated 8 months ago
Alternatives and similar repositories for TrustLLM
Users that are interested in TrustLLM are comparing it to the libraries listed below
Sorting:
- [ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the trustworthiness of generative foundation models.☆127Aug 22, 2025Updated 6 months ago
- [NeurIPS 2024] HonestLLM: Toward an Honest and Helpful Large Language Model☆29Jun 10, 2025Updated 8 months ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆341Feb 23, 2024Updated 2 years ago
- A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide…☆1,783Feb 1, 2026Updated last month
- A reading list for large models safety, security, and privacy (including Awesome LLM Security, Safety, etc.).☆1,870Feb 23, 2026Updated last week
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆864Aug 16, 2024Updated last year
- A Comprehensive Assessment of Trustworthiness in GPT Models☆314Sep 16, 2024Updated last year
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆82Feb 27, 2026Updated last week
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…☆242Nov 3, 2023Updated 2 years ago
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]☆535Apr 4, 2025Updated 11 months ago
- Papers and resources related to the security and privacy of LLMs 🤖☆568Jun 8, 2025Updated 8 months ago
- [ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability☆176Dec 18, 2024Updated last year
- Accepted by ECCV 2024☆192Oct 15, 2024Updated last year
- RewardBench: the first evaluation tool for reward models.☆697Feb 16, 2026Updated 2 weeks ago
- ☆24Dec 8, 2024Updated last year
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.☆793May 8, 2024Updated last year
- [NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct☆191Jan 16, 2025Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparency☆953Aug 14, 2024Updated last year
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆90May 19, 2024Updated last year
- [ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…☆430Jan 22, 2025Updated last year
- A curated list of trustworthy deep learning papers. Daily updating...☆382Feb 20, 2026Updated 2 weeks ago
- Universal and Transferable Attacks on Aligned Language Models☆4,534Aug 2, 2024Updated last year
- [AAAI'25 (Oral)] Jailbreaking Large Vision-language Models via Typographic Visual Prompts☆192Jun 26, 2025Updated 8 months ago
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆886Jan 16, 2025Updated last year
- ☆197Nov 26, 2023Updated 2 years ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- [ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models.☆85Jan 19, 2025Updated last year
- [ICLR'25] DataGen: Unified Synthetic Dataset Generation via Large Language Models☆66Mar 8, 2025Updated 11 months ago
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]☆379Jan 23, 2025Updated last year
- The repository for the survey paper <<Survey on Large Language Models Factuality: Knowledge, Retrieval and Domain-Specificity>>☆341Apr 25, 2024Updated last year
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"☆544Jan 17, 2025Updated last year
- ☆14Feb 26, 2025Updated last year
- ☆12Apr 22, 2024Updated last year
- Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"☆66Apr 24, 2024Updated last year
- ☆320Sep 18, 2024Updated last year
- ☆313Jun 9, 2024Updated last year
- ☆17Dec 21, 2023Updated 2 years ago
- Aligning Large Language Models with Human: A Survey☆741Sep 11, 2023Updated 2 years ago
- A framework for few-shot evaluation of language models.☆11,540Updated this week