shadowkiller33 / Language_attackLinks
A repo for LLM jailbreak
☆14Updated 2 years ago
Alternatives and similar repositories for Language_attack
Users that are interested in Language_attack are comparing it to the libraries listed below
Sorting:
- [ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"☆94Updated last year
- ☆18Updated 7 months ago
- ☆28Updated last year
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆89Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆107Updated last year
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Updated last year
- ☆22Updated last year
- ☆87Updated 11 months ago
- Multilingual safety benchmark for Large Language Models☆54Updated last year
- ☆188Updated 2 years ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆100Updated 6 months ago
- Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"☆69Updated last year
- Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT☆35Updated 2 years ago
- ☆57Updated 2 years ago
- Augmenting Statistical Models with Natural Language Parameters☆29Updated last year
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆89Updated 6 months ago
- ☆25Updated 5 months ago
- Official Code for EMNLP2023 Main Conference paper: "KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detec…☆30Updated 2 years ago
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Updated 3 months ago
- Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model☆69Updated 3 years ago
- Mostly recording papers about models' trustworthy applications. Intending to include topics like model evaluation & analysis, security, c…☆21Updated 2 years ago
- Code for "Goodtriever: Toxicity Mitigation with Retrieval-augmented Language Models"☆23Updated last year
- [ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Models☆84Updated last year
- The Dataset and Official Implementation for <Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models’ Understandi…☆18Updated last year
- Repository for the Bias Benchmark for QA dataset.☆132Updated last year
- Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers☆66Updated last year
- Restore safety in fine-tuned language models through task arithmetic☆29Updated last year
- Code for "Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal" (ACL 2024)☆16Updated last year
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆67Updated last year
- The most comprehensive and accurate LLM jailbreak attack benchmark by far☆21Updated 8 months ago