sciai-lab / Truth_is_Universal
☆22Updated 4 months ago
Alternatives and similar repositories for Truth_is_Universal:
Users that are interested in Truth_is_Universal are comparing it to the libraries listed below
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆52Updated 4 months ago
- ☆38Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆91Updated last year
- A resource repository for representation engineering in large language models☆116Updated 4 months ago
- ☆23Updated last month
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆71Updated 3 weeks ago
- A library for efficient patching and automatic circuit discovery.☆62Updated last month
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆95Updated last month
- ☆128Updated last year
- code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models☆29Updated 4 months ago
- Steering Llama 2 with Contrastive Activation Addition☆134Updated 10 months ago
- Repository for the Bias Benchmark for QA dataset.☆106Updated last year
- General-purpose activation steering library☆54Updated 2 months ago
- ☆17Updated 3 weeks ago
- Sparse probing paper full code.☆55Updated last year
- ☆29Updated 11 months ago
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆74Updated 3 months ago
- Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆11Updated 2 months ago
- ☆47Updated last year
- LoFiT: Localized Fine-tuning on LLM Representations☆34Updated 2 months ago
- Materials for EACL2024 tutorial: Transformer-specific Interpretability☆45Updated last year
- ☆82Updated 7 months ago
- AI Logging for Interpretability and Explainability🔬☆110Updated 9 months ago
- Conformal Language Modeling☆28Updated last year
- ☆80Updated this week
- Steering vectors for transformer language models in Pytorch / Huggingface☆91Updated last month
- Augmenting Statistical Models with Natural Language Parameters☆23Updated 6 months ago
- ☆14Updated 9 months ago
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆42Updated 5 months ago
- Using sparse coding to find distributed representations used by neural networks.☆226Updated last year