redwoodresearch / Text-Steganography-Benchmark
Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.
☆13Updated 9 months ago
Related projects ⓘ
Alternatives and complementary repositories for Text-Steganography-Benchmark
- Weak-to-Strong Jailbreaking on Large Language Models☆67Updated 9 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆84Updated 8 months ago
- This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…☆17Updated 7 months ago
- ☆28Updated last year
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆57Updated 9 months ago
- Codebase for Inference-Time Policy Adapters☆22Updated last year
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated 6 months ago
- Measuring the situational awareness of language models☆33Updated 9 months ago
- ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"☆30Updated last month
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆83Updated 6 months ago
- ☆26Updated 2 months ago
- Llemma formal2formal (tactic prediction) theorem proving experiments☆18Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆64Updated 9 months ago
- ☆32Updated last year
- ☆14Updated last month
- Evaluating the Moral Beliefs Encoded in LLMs☆21Updated 9 months ago
- This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"☆17Updated 9 months ago
- ☆153Updated last year
- Repo for: When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment☆38Updated last year
- Official PyTorch Implementation for Meaning Representations from Trajectories in Autoregressive Models (ICLR 2024)☆18Updated 6 months ago
- [NeurIPS 2023] PyTorch code for Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind☆67Updated 11 months ago
- ☆31Updated last year
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆39Updated 8 months ago
- Repo for the research paper "Aligning LLMs to Be Robust Against Prompt Injection"☆19Updated 3 weeks ago
- ☆31Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆154Updated 2 months ago
- ☆21Updated 3 weeks ago
- A collection of papers tackling automatic fact-checking (particularly of AI-generated content)☆14Updated last year
- [NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"☆61Updated 3 months ago
- Finding semantically meaningful and accurate prompts.☆46Updated last year