koo-ec / Awesome-LLM-ExplainabilityLinks
A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.
☆50Updated 7 months ago
Alternatives and similar repositories for Awesome-LLM-Explainability
Users that are interested in Awesome-LLM-Explainability are comparing it to the libraries listed below
Sorting:
- ☆158Updated 2 years ago
- Using Explanations as a Tool for Advanced LLMs☆69Updated last year
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models☆344Updated 6 months ago
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆160Updated last year
- TrustAgent: Towards Safe and Trustworthy LLM-based Agents☆56Updated 11 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆163Updated 2 months ago
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆71Updated 8 months ago
- ☆47Updated last week
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆163Updated 7 months ago
- ☆184Updated 2 months ago
- [EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156☆47Updated 2 years ago
- Code and data for the paper: On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents☆41Updated last month
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆104Updated last year
- JAILJUDGE: A comprehensive evaluation benchmark which includes a wide range of risk scenarios with complex malicious prompts (e.g., synth…☆58Updated last year
- LLM-Check: Investigating Detection of Hallucinations in Large Language Models (NeurIPS 2024)☆36Updated last year
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Models☆618Updated 7 months ago
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆123Updated 11 months ago
- ☆141Updated 10 months ago
- ☆193Updated 2 years ago
- code repo for ICLR 2024 paper "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs"☆143Updated last year
- Survey of Small Language Models from Penn State, ...☆240Updated 2 months ago
- [NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"☆193Updated 9 months ago
- A curated list of Awesome-LLM-Ensemble papers for the survey "Harnessing Multiple Large Language Models: A Survey on LLM Ensemble"☆191Updated last month
- ☆89Updated 5 months ago
- A curated list of resources for activation engineering☆122Updated 3 months ago
- [ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the trustworthiness of generative foundation models.☆125Updated 5 months ago
- Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…☆71Updated 10 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆83Updated 6 months ago
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆98Updated 3 weeks ago
- It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) i…☆67Updated last year