BatsResearch / cross-lingual-detox
Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages." Paper accepted at Findings of EMNLP 2024
☆16Updated 4 months ago
Alternatives and similar repositories for cross-lingual-detox:
Users that are interested in cross-lingual-detox are comparing it to the libraries listed below
- ☆30Updated 9 months ago
- ☆61Updated last year
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models☆17Updated 7 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆62Updated 3 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆83Updated last week
- AI Logging for Interpretability and Explainability🔬☆103Updated 8 months ago
- ☆52Updated last year
- Restore safety in fine-tuned language models through task arithmetic☆26Updated 10 months ago
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆17Updated 9 months ago
- ☆44Updated 5 months ago
- Github repository for "FELM: Benchmarking Factuality Evaluation of Large Language Models" (NeurIPS 2023)☆57Updated last year
- ☆49Updated last year
- ☆20Updated 7 months ago
- ☆30Updated 2 months ago
- Augmenting Statistical Models with Natural Language Parameters☆22Updated 5 months ago
- Align your LM to express calibrated verbal statements of confidence in its long-form generations.☆22Updated 8 months ago
- [NAACL'25] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆46Updated 2 months ago
- Code for "Universal Adversarial Triggers Are Not Universal."☆16Updated 9 months ago
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆29Updated 3 weeks ago
- [NeurIPS 2023] Github repository for "Composing Parameter-Efficient Modules with Arithmetic Operations"☆60Updated last year
- Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model☆66Updated 2 years ago
- ☆85Updated 2 years ago
- Repo for the paper "Large Language Models Struggle to Learn Long-Tail Knowledge"☆74Updated last year
- LoFiT: Localized Fine-tuning on LLM Representations☆32Updated last month
- [ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Models☆79Updated 5 months ago
- [ICLR 2023] Code for our paper "Selective Annotation Makes Language Models Better Few-Shot Learners"☆108Updated last year
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆65Updated 11 months ago
- A library for efficient patching and automatic circuit discovery.☆53Updated this week
- This repository contains data, code and models for contextual noncompliance.☆20Updated 7 months ago