mdoumbouya / h4rm3lLinks
A Domain-Specific Language, Jailbreak Attack Synthesizer and Dynamic LLM Redteaming Toolkit
☆26Updated last year
Alternatives and similar repositories for h4rm3l
Users that are interested in h4rm3l are comparing it to the libraries listed below
Sorting:
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆109Updated last year
- [ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"☆96Updated last year
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Updated 4 months ago
- ☆32Updated last year
- ☆44Updated last year
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆90Updated 7 months ago
- Restore safety in fine-tuned language models through task arithmetic☆31Updated last year
- ☆57Updated last year
- Align your LM to express calibrated verbal statements of confidence in its long-form generations.☆28Updated last year
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Updated last year
- This repository contains data, code and models for contextual noncompliance.☆24Updated last year
- Code for "Tracing Knowledge in Language Models Back to the Training Data"☆39Updated 2 years ago
- [ICLR 2025] General-purpose activation steering library☆130Updated 3 months ago
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆80Updated last year
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆89Updated last year
- Code for the paper "Fishing for Magikarp"☆176Updated 7 months ago
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆66Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆117Updated 9 months ago
- Official Repository for Dataset Inference for LLMs☆43Updated last year
- Github repository for "FELM: Benchmarking Factuality Evaluation of Large Language Models" (NeurIPS 2023)☆62Updated last year
- Data and code for the preprint "In-Context Learning with Long-Context Models: An In-Depth Exploration"☆41Updated last year
- ☆28Updated last year
- [NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…☆36Updated 2 years ago
- Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆12Updated 10 months ago
- ☆59Updated 2 years ago
- Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"☆33Updated last year
- A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)☆26Updated 4 years ago
- Inspecting and Editing Knowledge Representations in Language Models☆119Updated 2 years ago
- Teaching Models to Express Their Uncertainty in Words☆39Updated 3 years ago
- [ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability☆173Updated last year