DLR-SC / style-vectors-for-steering-llms
Code release for the paper "Style Vectors for Steering Generative Large Language Models", accepted to the Findings of the EACL 2024.
☆19Updated last month
Related projects ⓘ
Alternatives and complementary repositories for style-vectors-for-steering-llms
- A resource repository for representation engineering in large language models☆50Updated 2 months ago
- Function Vectors in Large Language Models (ICLR 2024)☆118Updated last month
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆52Updated last week
- ☆79Updated last year
- ☆49Updated last year
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆57Updated 8 months ago
- ☆26Updated 6 months ago
- Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"☆59Updated 8 months ago
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆143Updated 3 weeks ago
- [NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't…☆83Updated 4 months ago
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆61Updated 10 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆45Updated this week
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆78Updated last year
- Code associated with Tuning Language Models by Proxy (Liu et al., 2024)☆96Updated 7 months ago
- Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆21Updated 3 weeks ago
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆69Updated 8 months ago
- Official repository for paper "Weak-to-Strong Extrapolation Expedites Alignment"☆67Updated 5 months ago
- Weak-to-Strong Jailbreaking on Large Language Models☆65Updated 8 months ago
- 【ACL 2024】 SALAD benchmark & MD-Judge☆103Updated last month
- ☆168Updated 8 months ago
- ☆68Updated 3 months ago
- The Paper List on Data Contamination for Large Language Models Evaluation.☆74Updated this week
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆70Updated 2 months ago
- Algebraic value editing in pretrained language models☆57Updated last year
- ☆63Updated 5 months ago
- [ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"☆59Updated 7 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆70Updated 6 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆115Updated 4 months ago
- ☆51Updated 7 months ago
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆45Updated 7 months ago