koo-ec / Awesome-LLM-ExplainabilityLinks
A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.
☆37Updated last month
Alternatives and similar repositories for Awesome-LLM-Explainability
Users that are interested in Awesome-LLM-Explainability are comparing it to the libraries listed below
Sorting:
- Using Explanations as a Tool for Advanced LLMs☆66Updated 11 months ago
- TrustAgent: Towards Safe and Trustworthy LLM-based Agents☆50Updated 6 months ago
- [NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"☆138Updated 4 months ago
- ☆46Updated 3 months ago
- ☆148Updated last year
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆63Updated 2 months ago
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Models☆588Updated last month
- [NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.☆166Updated 4 months ago
- ☆180Updated last year
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆144Updated 8 months ago
- ☆157Updated 2 weeks ago
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models☆135Updated 2 weeks ago
- code repo for ICLR 2024 paper "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs"☆126Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆317Updated last year
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆141Updated last year
- A Comprehensive Assessment of Trustworthiness in GPT Models☆300Updated 10 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆64Updated 3 weeks ago
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆99Updated 5 months ago
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆87Updated 8 months ago
- Toolkit for evaluating the trustworthiness of generative foundation models.☆109Updated last week
- 【ACL 2024】 SALAD benchmark & MD-Judge☆156Updated 5 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆153Updated 5 months ago
- LLM-Check: Investigating Detection of Hallucinations in Large Language Models (NeurIPS 2024)☆25Updated 8 months ago
- Code and Results of the Paper: On the Resilience of Multi-Agent Systems with Malicious Agents☆25Updated 6 months ago
- ☆100Updated 3 months ago
- [ICML 2024] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast☆111Updated last year
- (ACL 2025 Main) Code for MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents https://www.arxiv.org/pdf/2503.019…☆142Updated last week
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆159Updated last year
- LLM Unlearning☆172Updated last year
- A resource repository for representation engineering in large language models☆129Updated 9 months ago