zepingyu0512 / awesome-SAE
awesome SAE papers
☆11Updated last week
Alternatives and similar repositories for awesome-SAE:
Users that are interested in awesome-SAE are comparing it to the libraries listed below
- ☆14Updated 10 months ago
- A resource repository for representation engineering in large language models☆85Updated last month
- ☆20Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆61Updated 6 months ago
- code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models☆25Updated last month
- Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆65Updated 3 months ago
- ☆20Updated 5 months ago
- [NeurIPS 2024] "Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?"☆28Updated 2 weeks ago
- This paper list focuses on the theoretical and empirical analysis of language models, especially large language models (LLMs). The papers…☆69Updated last month
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆70Updated 2 weeks ago
- ☆47Updated last year
- ☆151Updated 6 months ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆83Updated 4 months ago
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)☆57Updated 3 months ago
- [ACL 2024] Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models☆41Updated 4 months ago
- Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)☆22Updated 6 months ago
- The official repo of paper "Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller"☆18Updated 4 months ago
- ☆44Updated last week
- LLM Unlearning☆139Updated last year
- This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)☆28Updated last month
- ☆10Updated 8 months ago
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆49Updated 9 months ago
- A survey on harmful fine-tuning attack for large language model☆119Updated 2 weeks ago
- ☆42Updated 5 months ago
- A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..☆196Updated 2 months ago
- ☆37Updated 6 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆74Updated 8 months ago
- FeatureAlignment = Alignment + Mechanistic Interpretability☆26Updated 2 weeks ago
- ☆24Updated 3 months ago
- Code for Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities (NeurIPS'24)☆13Updated 3 weeks ago