AIRI-Institute / SAE-ReasoningLinks
☆84Updated 5 months ago
Alternatives and similar repositories for SAE-Reasoning
Users that are interested in SAE-Reasoning are comparing it to the libraries listed below
Sorting:
- Function Vectors in Large Language Models (ICLR 2024)☆179Updated 5 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆146Updated last week
- ☆62Updated 6 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆128Updated 2 months ago
- ☆168Updated 10 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆116Updated last year
- Collection of Reverse Engineering in Large Model☆34Updated 8 months ago
- A Sober Look at Language Model Reasoning☆83Updated this week
- Open source replication of Anthropic's Crosscoders for Model Diffing☆59Updated 10 months ago
- ☆97Updated last year
- This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆28Updated last year
- FeatureAlignment = Alignment + Mechanistic Interpretability☆29Updated 6 months ago
- Steering Llama 2 with Contrastive Activation Addition☆180Updated last year
- ☆52Updated 5 months ago
- [ACL'25 Oral] What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective☆72Updated 2 months ago
- [ICLR 2025] Monet: Mixture of Monosemantic Experts for Transformers☆70Updated 2 months ago
- A library for efficient patching and automatic circuit discovery.☆76Updated last month
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆78Updated 6 months ago
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆211Updated last week
- [NeurIPS 2024 Spotlight] Code and data for the paper "Finding Transformer Circuits with Edge Pruning".☆59Updated last month
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆79Updated last year
- ☆44Updated last year
- Multi-Layer Sparse Autoencoders (ICLR 2025)☆24Updated 7 months ago
- [COLING'25] Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?☆80Updated 7 months ago
- [NeurIPS 2024] How do Large Language Models Handle Multilingualism?☆39Updated 10 months ago
- ☆186Updated 2 months ago
- [ICLR 2025] General-purpose activation steering library☆102Updated 2 weeks ago
- A brief and partial summary of RLHF algorithms.☆132Updated 6 months ago
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆63Updated 9 months ago
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"☆99Updated last month