koo-ec / Awesome-LLM-ExplainabilityLinks
A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.
☆49Updated 5 months ago
Alternatives and similar repositories for Awesome-LLM-Explainability
Users that are interested in Awesome-LLM-Explainability are comparing it to the libraries listed below
Sorting:
- ☆158Updated last year
- Using Explanations as a Tool for Advanced LLMs☆68Updated last year
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models☆314Updated 4 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆76Updated 5 months ago
- ☆39Updated last year
- A curated list of resources for activation engineering☆119Updated 2 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆158Updated 6 months ago
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆156Updated last year
- This repo contains code for paper: "Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach".☆24Updated last year
- ☆178Updated last month
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆55Updated last year
- Can Knowledge Editing Really Correct Hallucinations? (ICLR 2025)☆27Updated 4 months ago
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆174Updated last year
- Python package for measuring memorization in LLMs.☆176Updated 5 months ago
- Code for paper: Are Large Language Models Post Hoc Explainers?☆34Updated last year
- [NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"☆181Updated 8 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆160Updated last month
- ☆83Updated 4 months ago
- Improving Alignment and Robustness with Circuit Breakers☆251Updated last year
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆152Updated last year
- code repo for ICLR 2024 paper "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs"☆137Updated last year
- A curated list of Awesome-LLM-Ensemble papers for the survey "Harnessing Multiple Large Language Models: A Survey on LLM Ensemble"☆176Updated this week
- [ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"☆65Updated last year
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆84Updated last year
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Models☆619Updated 6 months ago
- ☆191Updated 2 years ago
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆100Updated last year
- The official repository for guided jailbreak benchmark☆26Updated 4 months ago
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆73Updated 7 months ago
- [EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156☆45Updated 2 years ago