aypan17 / latentqa
☆18Updated last week
Alternatives and similar repositories for latentqa:
Users that are interested in latentqa are comparing it to the libraries listed below
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆71Updated last month
- ☆13Updated last year
- PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆35Updated 5 months ago
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆53Updated 4 months ago
- ☆29Updated 11 months ago
- ☆38Updated last year
- Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆11Updated 2 months ago
- Augmenting Statistical Models with Natural Language Parameters☆24Updated 6 months ago
- ☆82Updated 8 months ago
- ☆14Updated 9 months ago
- A library for efficient patching and automatic circuit discovery.☆62Updated last month
- ☆93Updated last year
- General-purpose activation steering library☆57Updated 3 months ago
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆48Updated 4 months ago
- The code of “Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning”☆16Updated last year
- ☆38Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆63Updated last week
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆51Updated last month
- ☆10Updated last month
- ☆26Updated 9 months ago
- This is an official implementation of the Reward rAnked Fine-Tuning Algorithm (RAFT), also known as iterative best-of-n fine-tuning or re…☆27Updated 6 months ago
- ☆49Updated 8 months ago
- Official code for "Decoding-Time Language Model Alignment with Multiple Objectives".☆19Updated 5 months ago
- ☆53Updated 2 years ago
- Function Vectors in Large Language Models (ICLR 2024)☆154Updated 3 weeks ago
- Algebraic value editing in pretrained language models☆64Updated last year
- ☆23Updated last month
- This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆25Updated last year
- Align your LM to express calibrated verbal statements of confidence in its long-form generations.☆22Updated 10 months ago
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆29Updated 2 months ago