rimon15 / evil_twinsLinks

Code for the paper: Prompts have evil twins (EMNLP 2024)

☆18

Alternatives and similar repositories for evil_twins

Users that are interested in evil_twins are comparing it to the libraries listed below

Sorting:

shengliu66 / ICV
Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
☆185Updated 6 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆125Updated 2 months ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆72Updated last year
Vaidehi99 / InfoDeletionAttacks
☆44Updated 6 months ago
UW-Madison-Lee-Lab / LanguageInterfacedFineTuning
Code for Language-Interfaced FineTuning for Non-Language Machine Learning Tasks.
☆130Updated 9 months ago
tianyang-x / SaySelf
Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"
☆110Updated 11 months ago
bloomberg / dataless-model-merging
Code release for Dataless Knowledge Fusion by Merging Weights of Language Models (https://openreview.net/forum?id=FCnohuR6AnM)
☆90Updated 2 years ago
neelsjain / BYOD
The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"
☆107Updated last year
probabilistic-inference-scaling / probabilistic-inference-scaling
☆51Updated 5 months ago
jonhue / activeft
PyTorch library for Active Fine-Tuning
☆90Updated last week
zlin7 / UQ-NLG
☆99Updated last year
prateeky2806 / ties-merging
☆188Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
hartvigsen-group / composable-interventions
☆28Updated 6 months ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆116Updated last year
QingruZhang / PASTA
PASTA: Post-hoc Attention Steering for LLMs
☆122Updated 9 months ago
zjunlp / KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers
☆155Updated 6 months ago
ejones313 / auditing-llms
☆57Updated 2 years ago
UCSB-NLP-Chang / llm_uncertainty
☆34Updated last year
declare-lab / trust-align
Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…
☆63Updated 6 months ago
MiaoXiong2320 / llm-uncertainty
code repo for ICLR 2024 paper "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs"
☆131Updated last year
Thartvigsen / GRACE
[NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
☆79Updated 8 months ago
XiangLi1999 / AutoBencher
☆31Updated last year
deeplearning-wisc / haloscope
source code for NeurIPS'24 paper "HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection"
☆53Updated 4 months ago
msclar / formatspread
Code accompanying "How I learned to start worrying about prompt formatting".
☆109Updated 2 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆178Updated last year
HannahKirk / prism-alignment
The Prism Alignment Project
☆79Updated last year
milesaturpin / cot-unfaithfulness
☆47Updated last year
thestephencasper / explore_establish_exploit_llms
☆31Updated 2 years ago
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆99Updated last week