google-research / lm-extraction-benchmark
☆268Updated last month
Related projects: ⓘ
- Training data extraction on GPT-2☆166Updated last year
- Repository for research in the field of Responsible NLP at Meta.☆180Updated last month
- Differentially-private transformers using HuggingFace and Opacus☆108Updated 3 weeks ago
- Python package for measuring memorization in LLMs.☆107Updated this week
- A codebase that makes differentially private training of transformers easy.☆151Updated last year
- This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.☆270Updated 3 months ago
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆77Updated 4 months ago
- MEND: Fast Model Editing at Scale☆229Updated last year
- Landing Page for TOFU☆79Updated 3 months ago
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆275Updated last month
- Aligning AI With Shared Human Values (ICLR 2021)☆233Updated last year
- A survey of privacy problems in Large Language Models (LLMs). Contains summary of the corresponding paper along with relevant code☆58Updated 3 months ago
- ☆101Updated last year
- Repo for arXiv preprint "Gradient-based Adversarial Attacks against Text Transformers"☆98Updated last year
- ☆159Updated 6 months ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆250Updated this week
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆72Updated 4 months ago
- A fast, effective data attribution method for neural networks in PyTorch☆170Updated this week
- Improving Alignment and Robustness with Circuit Breakers☆124Updated 2 months ago
- A resource repository for machine unlearning in large language models☆131Updated this week
- Official repository for our NeurIPS 2023 paper "Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense…☆131Updated 10 months ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆219Updated 6 months ago
- AI Logging for Interpretability and Explainability🔬☆74Updated 3 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆61Updated 4 months ago
- A re-implementation of the "Extracting Training Data from Large Language Models" paper by Carlini et al., 2020☆30Updated 2 years ago
- PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an…☆269Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆76Updated 6 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆110Updated 4 months ago
- [EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156☆21Updated 9 months ago
- ☆61Updated 2 years ago