HenryCai11 / LLM-Self-Control
The official repo of paper "Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller"
☆18Updated 7 months ago
Alternatives and similar repositories for LLM-Self-Control:
Users that are interested in LLM-Self-Control are comparing it to the libraries listed below
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆52Updated 4 months ago
- code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models☆29Updated 4 months ago
- Code for paper: Aligning Large Language Models with Representation Editing: A Control Perspective☆25Updated 2 months ago
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆57Updated last year
- Code associated with Tuning Language Models by Proxy (Liu et al., 2024)☆107Updated last year
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆73Updated last month
- A resource repository for representation engineering in large language models☆116Updated 4 months ago
- Collection of Reverse Engineering in Large Model☆32Updated 2 months ago
- awesome SAE papers☆24Updated last month
- ☆19Updated last month
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆71Updated 3 weeks ago
- [NeurIPS 2024] "Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?"☆35Updated 2 months ago
- Code for Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities (NeurIPS'24)☆17Updated 3 months ago
- ☆64Updated 2 months ago
- ☆30Updated last year
- ☆93Updated last year
- LoFiT: Localized Fine-tuning on LLM Representations☆34Updated 2 months ago
- Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)☆23Updated 9 months ago
- The code of “Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning”☆16Updated last year
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆48Updated 3 months ago
- ☆49Updated 7 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆105Updated 2 weeks ago
- ☆21Updated last year
- A curated list of resources for activation engineering☆52Updated 2 weeks ago
- Function Vectors in Large Language Models (ICLR 2024)☆153Updated 2 weeks ago
- Implementation code for ACL2024:Advancing Parameter Efficiency in Fine-tuning via Representation Editing☆13Updated 11 months ago
- ☆82Updated 7 months ago
- Lightweight Adapting for Black-Box Large Language Models☆23Updated last year
- ☆162Updated 9 months ago
- ☆34Updated 5 months ago