microsoft / TaskTracker
TaskTracker is an approach to detecting task drift in Large Language Models (LLMs) by analysing their internal activations. It provides a simple linear probe-based method and a more sophisticated metric learning method to achieve this. The project also releases the computationally expensive activation data to stimulate further AI safety research…
☆43Updated last month
Alternatives and similar repositories for TaskTracker:
Users that are interested in TaskTracker are comparing it to the libraries listed below
- A benchmark for evaluating the robustness of LLMs and defenses to indirect prompt injection attacks.☆55Updated 9 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆33Updated last week
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆66Updated 11 months ago
- A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.☆79Updated this week
- Accompanying code and SEP dataset for the "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?" paper.☆46Updated 7 months ago
- The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".☆39Updated 3 months ago
- A collection of automated evaluators for assessing jailbreak attempts.☆102Updated this week
- PAL: Proxy-Guided Black-Box Attack on Large Language Models☆47Updated 5 months ago
- Red-Teaming Language Models with DSPy☆154Updated 9 months ago
- [NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"☆90Updated this week
- Jailbreak artifacts for JailbreakBench☆47Updated 2 months ago
- A repository of Language Model Vulnerabilities and Exposures (LVEs).☆108Updated 10 months ago
- Improving Alignment and Robustness with Circuit Breakers☆176Updated 4 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆113Updated 6 months ago
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]☆251Updated last week
- Fluent student-teacher redteaming☆19Updated 6 months ago
- Dataset for the Tensor Trust project☆36Updated 10 months ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated 9 months ago
- ☆70Updated 2 months ago
- ☆49Updated last month
- LLM Self Defense: By Self Examination, LLMs know they are being tricked☆31Updated 8 months ago
- A prompt injection game to collect data for robust ML research☆50Updated this week
- ☆89Updated last year
- official implementation of [USENIX Sec'25] StruQ: Defending Against Prompt Injection with Structured Queries☆25Updated last month
- This repository provides implementation to formalize and benchmark Prompt Injection attacks and defenses☆167Updated last week
- ☆47Updated 6 months ago
- TAP: An automated jailbreaking method for black-box LLMs☆138Updated last month
- Thorn in a HaizeStack test for evaluating long-context adversarial robustness.☆26Updated 5 months ago
- ☆26Updated 2 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆44Updated last week