liamdugan / raid
RAID is the largest and most challenging benchmark for machine-generated text detectors. (ACL 2024)
☆48Updated this week
Alternatives and similar repositories for raid:
Users that are interested in raid are comparing it to the libraries listed below
- ☆15Updated 3 months ago
- ☆118Updated last year
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆64Updated 10 months ago
- Improving Alignment and Robustness with Circuit Breakers☆174Updated 3 months ago
- Weak-to-Strong Jailbreaking on Large Language Models☆73Updated 10 months ago
- Official code for the paper: Evaluating Copyright Takedown Methods for Language Models☆16Updated 6 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆134Updated 8 months ago
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆77Updated last year
- Official repository for our NeurIPS 2023 paper "Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense…☆150Updated last year
- Repository for the Bias Benchmark for QA dataset.☆94Updated last year
- ☆158Updated last year
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆83Updated 4 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆75Updated 8 months ago
- ☆57Updated 3 weeks ago
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆88Updated 10 months ago
- Can AI-Generated Text be Reliably Detected?☆65Updated last year
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆112Updated 5 months ago
- ☆36Updated last year
- Python package for measuring memorization in LLMs.☆134Updated last month
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆12Updated 10 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆85Updated last year
- ☆44Updated 6 months ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆51Updated 2 months ago
- The lastest paper about detection of LLM-generated text and code☆244Updated last week
- ☆44Updated last year
- ☆24Updated 3 months ago
- "Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning" by Chongyu Fan*, Jiancheng Liu*, Licong Lin*, Jingh…☆21Updated this week
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆82Updated this week
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆120Updated 6 months ago
- SeqXGPT: An advance method for sentence-level AI-generated text detection.☆79Updated last year