microsoft / TaskTracker
TaskTracker is an approach to detecting task drift in Large Language Models (LLMs) by analysing their internal activations. It provides a simple linear probe-based method and a more sophisticated metric learning method to achieve this. The project also releases the computationally expensive activation data to stimulate further AI safety research…
☆27Updated last month
Related projects: ⓘ
- Accompanying code and SEP dataset for the "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?" paper.☆44Updated 3 months ago
- A benchmark for evaluating the robustness of LLMs and defenses to indirect prompt injection attacks.☆43Updated 5 months ago
- Thorn in a HaizeStack test for evaluating long-context adversarial robustness.☆26Updated last month
- A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.☆48Updated last week
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆59Updated 6 months ago
- ☆59Updated 11 months ago
- Does Refusal Training in LLMs Generalize to the Past Tense? [arXiv, July 2024]☆49Updated 2 months ago
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [arXiv, Apr 2024]☆181Updated last month
- ☆34Updated last week
- Red-Teaming Language Models with DSPy☆116Updated 5 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆89Updated 2 months ago
- PAL: Proxy-Guided Black-Box Attack on Large Language Models☆45Updated last month
- Mark web pages for use with vision-language models☆14Updated last week
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆41Updated 4 months ago
- Improving Alignment and Robustness with Circuit Breakers☆124Updated 2 months ago
- ☆30Updated last year
- ☆38Updated this week
- Code to break Llama Guard☆27Updated 9 months ago
- Run embedding models using ONNX☆23Updated 7 months ago
- Python package for measuring memorization in LLMs.☆107Updated this week
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆81Updated 6 months ago
- This is the code for the paper "Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation".☆33Updated 5 months ago
- utilities for loading and running text embeddings with onnx☆39Updated last month
- A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆106Updated 6 months ago
- This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative t…☆16Updated 3 months ago
- Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting☆10Updated last month
- RepoQA: Evaluating Long-Context Code Understanding☆96Updated this week
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆48Updated 3 weeks ago
- Small, simple agent task environments for training and evaluation☆13Updated last week
- GraphRag vs Embeddings☆12Updated 2 months ago