RapidResponseBench / rapidresponsebenchLinks

☆35

Alternatives and similar repositories for rapidresponsebench

Users that are interested in rapidresponsebench are comparing it to the libraries listed below

Sorting:

ShanglunFengatETHZ / PrivacyBackdoor
Privacy backdoors
☆51Updated last year
max-andr / adversarial-random-search-gpt4
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]
☆43Updated last year
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆70Updated last year
safety-research / open-source-alignment-faking
Open Source Replication of Anthropic's Alignment Faking Paper
☆51Updated 7 months ago
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆155Updated 5 months ago
tml-epfl / llm-past-tense
Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]
☆77Updated 10 months ago
azshue / AutoPoison
The official repository of the paper "On the Exploitability of Instruction Tuning".
☆65Updated last year
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆122Updated last year
andyzoujm / breaking-llama-guard
Code to break Llama Guard
☆32Updated last year
ahans30 / goldfish-loss
[NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs
☆92Updated last year
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆141Updated 4 months ago
JoshEngels / SAE-Probes
Code for reproducing our paper "Are Sparse Autoencoders Useful? A Case Study in Sparse Probing"
☆31Updated 7 months ago
arobey1 / advbench
☆44Updated 2 years ago
google-deepmind / dangerous-capability-evaluations
☆62Updated last month
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆242Updated last year
cohere-ai / magikarp
Code for the paper "Fishing for Magikarp"
☆174Updated 6 months ago
LukeBailey181 / obfuscated-activations
Codebase for Obfuscated Activations Bypass LLM Latent-Space Defenses
☆25Updated 9 months ago
safety-research / SHADE-Arena
☆20Updated 5 months ago
allenai / wildteaming
☆37Updated last year
facebookresearch / SecAlign
Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"
☆75Updated 3 months ago
chawins / pal
PAL: Proxy-Guided Black-Box Attack on Large Language Models
☆55Updated last year
amudide / switch_sae
Efficient Dictionary Learning with Switch Sparse Autoencoders (SAEs)
☆25Updated 11 months ago
sail-sg / Cheating-LLM-Benchmarks
[ICLR 2025] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates (Oral)
☆84Updated last year
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
ethz-spylab / rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆115Updated last year
katiekang1998 / reasoning_generalization
☆33Updated 10 months ago
tianyu139 / meaning-as-trajectories
Official PyTorch Implementation for Meaning Representations from Trajectories in Autoregressive Models (ICLR 2024)
☆22Updated last year
sail-sg / Rigging-ChatbotArena
Improving Your Model Ranking on Chatbot Arena by Vote Rigging (ICML 2025)
☆24Updated 8 months ago
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆89Updated 6 months ago
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆63Updated 5 months ago