BunsenFeng / modular_pluralismLinks
Modular Pluralism @ EMNLP 2024
☆18Updated 9 months ago
Alternatives and similar repositories for modular_pluralism
Users that are interested in modular_pluralism are comparing it to the libraries listed below
Sorting:
- A resource repository for representation engineering in large language models☆127Updated 8 months ago
- ☆171Updated last year
- ☆51Updated 2 years ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆74Updated 4 months ago
- This repo contains code for our NeurIPS 2023 spotlight paper: Evaluating and Inducing Personality in Pre-trained Language Models☆52Updated last year
- ☆95Updated last year
- ☆110Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆95Updated last year
- ☆18Updated last year
- The Prism Alignment Project☆79Updated last year
- ☆219Updated last year
- General-purpose activation steering library☆84Updated 2 months ago
- Steering Llama 2 with Contrastive Activation Addition☆164Updated last year
- ☆140Updated last year
- ☆43Updated last year
- ☆29Updated last year
- ☆95Updated last year
- Code for the ICML 2024 paper "Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment"☆73Updated last month
- Official reposity for paper "High-Dimension Human Value Representation in Large Language Models" (NAACL'25 Main)☆23Updated last year
- ☆25Updated last month
- ☆25Updated 8 months ago
- Function Vectors in Large Language Models (ICLR 2024)☆172Updated 3 months ago
- Sparse probing paper full code.☆58Updated last year
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆82Updated last year
- ☆163Updated 7 months ago
- The Paper List on Data Contamination for Large Language Models Evaluation.☆95Updated 3 months ago
- Repository for the Bias Benchmark for QA dataset.☆123Updated last year
- Official code for "Decoding-Time Language Model Alignment with Multiple Objectives".☆25Updated 8 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆136Updated this week
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆101Updated 4 months ago