microsoft / TaskTracker
TaskTracker is an approach to detecting task drift in Large Language Models (LLMs) by analysing their internal activations. It provides a simple linear probe-based method and a more sophisticated metric learning method to achieve this. The project also releases the computationally expensive activation data to stimulate further AI safety research…
☆48Updated last week
Alternatives and similar repositories for TaskTracker:
Users that are interested in TaskTracker are comparing it to the libraries listed below
- A benchmark for evaluating the robustness of LLMs and defenses to indirect prompt injection attacks.☆60Updated 10 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆39Updated last month
- A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.☆104Updated last week
- PAL: Proxy-Guided Black-Box Attack on Large Language Models☆49Updated 6 months ago
- [ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability☆138Updated 2 months ago
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆42Updated 4 months ago
- ☆31Updated 4 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆67Updated last year
- LLM Self Defense: By Self Examination, LLMs know they are being tricked☆31Updated 9 months ago
- ☆91Updated last year
- ☆85Updated last week
- Fluent student-teacher redteaming☆19Updated 7 months ago
- Code to generate NeuralExecs (prompt injection for LLMs)☆20Updated 3 months ago
- A repository of Language Model Vulnerabilities and Exposures (LVEs).☆108Updated last year
- TAP: An automated jailbreaking method for black-box LLMs☆150Updated 3 months ago
- AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks☆38Updated 9 months ago
- Accompanying code and SEP dataset for the "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?" paper.☆49Updated this week
- [NDSS'25 Poster] A collection of automated evaluators for assessing jailbreak attempts.☆120Updated this week
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆130Updated 11 months ago
- [NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"☆100Updated last month
- official implementation of [USENIX Sec'25] StruQ: Defending Against Prompt Injection with Structured Queries☆29Updated 2 months ago
- Improving Alignment and Robustness with Circuit Breakers☆189Updated 5 months ago
- ☆54Updated 8 months ago
- ☆51Updated 2 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆120Updated 7 months ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆93Updated last year
- Dataset for the Tensor Trust project☆37Updated 11 months ago
- ☆81Updated last year
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated 10 months ago
- Whispers in the Machine: Confidentiality in LLM-integrated Systems☆34Updated last week