[ACL 2024 main] Aligning Large Language Models with Human Preferences through Representation Engineering (https://aclanthology.org/2024.acl-long.572/)
☆28Sep 25, 2024Updated last year
Alternatives and similar repositories for RAHF
Users that are interested in RAHF are comparing it to the libraries listed below
Sorting:
- Implementation code for ACL2024:Advancing Parameter Efficiency in Fine-tuning via Representation Editing☆15Apr 20, 2024Updated last year
- Code for paper: Aligning Large Language Models with Representation Editing: A Control Perspective☆35Jan 31, 2025Updated last year
- Code for the "Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning" paper.☆16Nov 21, 2025Updated 3 months ago
- Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!☆11Oct 16, 2024Updated last year
- Experiments with representation engineering☆13Feb 28, 2024Updated 2 years ago
- ☆20Jun 16, 2025Updated 8 months ago
- Code to the paper: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence☆26Jul 31, 2025Updated 7 months ago
- ☆44Oct 1, 2024Updated last year
- This repo is reproduction resources for linear alignment paper, still working☆18May 19, 2024Updated last year
- ☆32Aug 9, 2024Updated last year
- ☆24Jun 17, 2025Updated 8 months ago
- ☆57Jun 13, 2024Updated last year
- Repo to reproduce results for Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning☆25Apr 14, 2023Updated 2 years ago
- ☆27Oct 6, 2024Updated last year
- This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆30Oct 27, 2025Updated 4 months ago
- Code repo for the model organisms and convergent directions of EM papers.☆53Sep 22, 2025Updated 5 months ago
- [EMNLP 2024] "Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective"☆32Jul 22, 2024Updated last year
- Algebraic value editing in pretrained language models☆69Nov 1, 2023Updated 2 years ago
- ☆73Jul 15, 2024Updated last year
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆73Jan 16, 2026Updated last month
- [NeurIPS 2024] Large Language Model Unlearning via Embedding-Corrupted Prompts☆38Sep 26, 2024Updated last year
- A resource repository for representation engineering in large language models☆148Nov 14, 2024Updated last year
- [EMNLP 2022] Code and data for "Controllable Dialogue Simulation with In-Context Learning"☆34Feb 22, 2023Updated 3 years ago
- Package for computing multiparameter persistence summaries☆10Oct 27, 2023Updated 2 years ago
- NAACL 2022 paper on Analyzing Modality Robustness in Multimodal Sentiment Analysis☆31Jan 21, 2023Updated 3 years ago
- [NeurIPS25] Official repo for "Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning"☆42Oct 3, 2025Updated 5 months ago
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆174Apr 23, 2025Updated 10 months ago
- This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL☆61Sep 5, 2025Updated 6 months ago
- Official code for NeurRIPS 2020 paper "Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D"☆32Dec 12, 2024Updated last year
- code for progressive gsl☆12Jan 15, 2026Updated last month
- [KDD'23] This is the code repo for our KDD'23 paper "DyGen: Learning from Noisy Labels via Dynamics-Enhanced Generative Modeling".☆11Jun 14, 2023Updated 2 years ago
- [ICLR 2024] This is the official implementation for the paper: "Beyond imitation: Leveraging fine-grained quality signals for alignment"☆10May 5, 2024Updated last year
- ☆24Feb 18, 2026Updated 2 weeks ago
- Code for experiments on self-prediction as a way to measure introspection in LLMs☆16Dec 10, 2024Updated last year
- Residual Quantization Autoencoder, used for interpreting LLMs☆14Jan 1, 2025Updated last year
- ☆12Aug 15, 2023Updated 2 years ago
- Official repository for "DYPLOC: Dynamic Planning of Content Using Mixed Language Models for Opinion Text Generation"☆10May 20, 2022Updated 3 years ago
- Math24o: 高中奥林匹克数学竞赛测评集 High School Olympiad Mathematics Chinese Benchmark☆11Mar 27, 2025Updated 11 months ago
- ☆44Jan 21, 2025Updated last year