koo-ec / Awesome-LLM-ExplainabilityLinks
A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.
☆36Updated 3 weeks ago
Alternatives and similar repositories for Awesome-LLM-Explainability
Users that are interested in Awesome-LLM-Explainability are comparing it to the libraries listed below
Sorting:
- ☆147Updated last year
- Using Explanations as a Tool for Advanced LLMs☆65Updated 10 months ago
- [NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"☆130Updated 3 months ago
- Can Knowledge Editing Really Correct Hallucinations? (ICLR 2025)☆19Updated last month
- TrustAgent: Towards Safe and Trustworthy LLM-based Agents☆47Updated 5 months ago
- Toolkit for evaluating the trustworthiness of generative foundation models.☆105Updated 3 weeks ago
- Code for "TrustRAG: Enhancing Robustness and Trustworthiness in RAG"☆41Updated 3 months ago
- ☆152Updated 3 months ago
- Code for paper: Are Large Language Models Post Hoc Explainers?☆33Updated 11 months ago
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆96Updated 4 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆157Updated last year
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆150Updated last year
- A curated list of resources for activation engineering☆91Updated last month
- A banchmark list for evaluation of large language models.☆130Updated 2 weeks ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆58Updated 2 weeks ago
- [NeurIPS 2024] HonestLLM: Toward an Honest and Helpful Large Language Model☆26Updated last month
- It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) i…☆64Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆99Updated 3 weeks ago
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Models☆580Updated 3 weeks ago
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆107Updated 9 months ago
- ☆113Updated 4 months ago
- [ACL'25 Oral] What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective☆70Updated 3 weeks ago
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆50Updated 7 months ago
- ☆122Updated last month
- Python package for measuring memorization in LLMs.☆159Updated this week
- [COLING'25] Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?☆79Updated 5 months ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆62Updated 6 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆149Updated 4 months ago
- ☆175Updated last year
- [NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.☆161Updated 3 months ago