koo-ec / Awesome-LLM-ExplainabilityLinks

A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.

☆43

Alternatives and similar repositories for Awesome-LLM-Explainability

Users that are interested in Awesome-LLM-Explainability are comparing it to the libraries listed below

Sorting:

hy-zhao23 / Explainability-for-Large-Language-Models
☆153Updated last year
JacksonWuxs / UsableXAI_LLM
Using Explanations as a Tool for Advanced LLMs
☆67Updated last year
safety-research / persona_vectors
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
☆236Updated last month
AI-secure / AgentPoison
[NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"
☆155Updated 5 months ago
zjunlp / KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers
☆157Updated 7 months ago
franciscoliu / Awesome-GenAI-Unlearning
☆167Updated last month
CLAS2024 / starter-kit
☆39Updated 10 months ago
usail-hkust / JailTrickBench
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)
☆149Updated 9 months ago
ryoungj / ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
☆165Updated last year
tianyang-x / SaySelf
Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"
☆108Updated 11 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆129Updated 3 months ago
allenai / wildguard
Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
☆91Updated 9 months ago
declare-lab / trust-align
Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…
☆66Updated 6 months ago
princeton-nlp / corpus-poisoning
[EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156
☆37Updated last year
agiresearch / TrustAgent
TrustAgent: Towards Safe and Trustworthy LLM-based Agents
☆53Updated 7 months ago
facebookresearch / SecAlign
Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"
☆70Updated 2 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆235Updated last year
AI-secure / DecodingTrust
A Comprehensive Assessment of Trustworthiness in GPT Models
☆303Updated last year
ChenWu98 / agent-attack
[ICLR 2025] Dissecting adversarial robustness of multimodal language model agents
☆106Updated 7 months ago
ThuCCSLab / JailbreakEval
[NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.
☆170Updated 5 months ago
MiaoXiong2320 / llm-uncertainty
code repo for ICLR 2024 paper "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs"
☆132Updated last year
AI4LIFE-GROUP / LLM_Explainer
Code for paper: Are Large Language Models Post Hoc Explainers?
☆33Updated last year
baixianghuang / HalluEditBench
Can Knowledge Editing Really Correct Hallucinations? (ICLR 2025)
☆25Updated last month
facebookresearch / advprompter
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
☆164Updated last year
Princeton-SysML / Jailbreak_LLM
☆183Updated last year
Zayne-sprague / To-CoT-or-not-to-CoT
☆26Updated 5 months ago
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆136Updated 10 months ago
azminewasi / Awesome-LLMs-ICLR-24
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) i…
☆63Updated last year
jlko / semantic_uncertainty
Codebase for reproducing the experiments of the semantic uncertainty paper (short-phrase and sentence-length experiments).
☆368Updated last year
deeplearning-wisc / haloscope
source code for NeurIPS'24 paper "HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection"
☆55Updated 5 months ago