Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Jun 19, 2024Updated last year
Alternatives and similar repositories for LLM-LieDetector
Users that are interested in LLM-LieDetector are comparing it to the libraries listed below
Sorting:
- ☆52Oct 23, 2023Updated 2 years ago
- Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State☆20Oct 24, 2025Updated 4 months ago
- A benchmark for mechanistic discovery of circuits in Transformers☆16Dec 15, 2024Updated last year
- ☆35Jun 13, 2023Updated 2 years ago
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆217Feb 23, 2026Updated last week
- Minimum Description Length probing for neural network representations☆20Jan 28, 2025Updated last year
- Official code for the paper: "Metadata Archaeology"☆19May 10, 2023Updated 2 years ago
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆243Feb 23, 2026Updated last week
- Pile Deduplication Code☆18May 15, 2023Updated 2 years ago
- Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning☆33Jan 9, 2025Updated last year
- ☆24Jul 25, 2024Updated last year
- ☆10Nov 17, 2022Updated 3 years ago
- ☆48Sep 29, 2024Updated last year
- ☆284Mar 2, 2024Updated 2 years ago
- ☆24Jan 28, 2025Updated last year
- Reference implementation of models from Nyonic Model Factory☆12May 13, 2024Updated last year
- ☆11Apr 10, 2024Updated last year
- NeuroSurgeon is a package that enables researchers to uncover and manipulate subnetworks within models in Huggingface Transformers☆43Feb 12, 2025Updated last year
- Codes for "Benchmarking the Generation of Fact Checking Explanations"☆10Aug 16, 2024Updated last year
- Radiantloom Email Assist 7B is an email-assistant large language model fine-tuned from Zephyr-7B-Beta, over a custom-curated dataset of 1…☆14Jan 19, 2024Updated 2 years ago
- ☆15Jan 9, 2026Updated last month
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆57Oct 30, 2025Updated 4 months ago
- Tools for understanding how transformer predictions are built layer-by-layer☆567Aug 7, 2025Updated 6 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆28May 23, 2024Updated last year
- ☆31Nov 7, 2024Updated last year
- Code to support the guide to logical induction for software engineers☆11Mar 24, 2025Updated 11 months ago
- ☆10Feb 3, 2025Updated last year
- Scalable Computation of Hessian Diagonals☆14Jun 2, 2024Updated last year
- ☆19Jul 31, 2025Updated 7 months ago
- ☆16Mar 22, 2025Updated 11 months ago
- Utilities for the HuggingFace transformers library☆75Jan 21, 2023Updated 3 years ago
- This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…☆61May 9, 2023Updated 2 years ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆85Mar 7, 2025Updated 11 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆116Jun 13, 2024Updated last year
- ☆23Jan 27, 2026Updated last month
- [ICML 2023] "Robust Weight Signatures: Gaining Robustness as Easy as Patching Weights?" by Ruisi Cai, Zhenyu Zhang, Zhangyang Wang☆16May 4, 2023Updated 2 years ago
- A python sdk for LLM finetuning and inference on runpod infrastructure☆19Feb 16, 2026Updated 2 weeks ago
- Repo for ICML23 "Why do Nearest Neighbor Language Models Work?"☆59Jan 12, 2023Updated 3 years ago
- ☆25Nov 14, 2022Updated 3 years ago