koo-ec / Awesome-LLM-ExplainabilityLinks
A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.
☆43Updated 3 months ago
Alternatives and similar repositories for Awesome-LLM-Explainability
Users that are interested in Awesome-LLM-Explainability are comparing it to the libraries listed below
Sorting:
- ☆153Updated last year
- Using Explanations as a Tool for Advanced LLMs☆67Updated last year
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models☆236Updated last month
- [NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"☆155Updated 5 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆157Updated 7 months ago
- ☆167Updated last month
- ☆39Updated 10 months ago
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆149Updated 9 months ago
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆165Updated last year
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆108Updated 11 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆129Updated 3 months ago
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆91Updated 9 months ago
- Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…☆66Updated 6 months ago
- [EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156☆37Updated last year
- TrustAgent: Towards Safe and Trustworthy LLM-based Agents☆53Updated 7 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆70Updated 2 months ago
- Improving Alignment and Robustness with Circuit Breakers☆235Updated last year
- A Comprehensive Assessment of Trustworthiness in GPT Models☆303Updated last year
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆106Updated 7 months ago
- [NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.☆170Updated 5 months ago
- code repo for ICLR 2024 paper "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs"☆132Updated last year
- Code for paper: Are Large Language Models Post Hoc Explainers?☆33Updated last year
- Can Knowledge Editing Really Correct Hallucinations? (ICLR 2025)☆25Updated last month
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆164Updated last year
- ☆183Updated last year
- ☆26Updated 5 months ago
- A resource repository for representation engineering in large language models☆136Updated 10 months ago
- It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) i…☆63Updated last year
- Codebase for reproducing the experiments of the semantic uncertainty paper (short-phrase and sentence-length experiments).☆368Updated last year
- source code for NeurIPS'24 paper "HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection"☆55Updated 5 months ago