openai / sparse_autoencoder
☆267Updated 2 months ago
Related projects: ⓘ
- Using sparse coding to find distributed representations used by neural networks.☆162Updated 10 months ago
- Training Sparse Autoencoders on Language Models☆367Updated this week
- Sparse autoencoders☆297Updated last week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆133Updated last month
- ☆174Updated 4 months ago
- Sparse Autoencoder for Mechanistic Interpretability☆173Updated last month
- RewardBench: the first evaluation tool for reward models.☆352Updated last week
- ☆159Updated 6 months ago
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model☆436Updated 3 weeks ago
- This repository collects all relevant resources about interpretability in LLMs☆230Updated last week
- ☆91Updated last month
- Representation Engineering: A Top-Down Approach to AI Transparency☆693Updated last month
- Editing Models with Task Arithmetic☆405Updated 8 months ago
- Mechanistic Interpretability Visualizations using React☆175Updated 2 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆52Updated last month
- ☆110Updated 3 weeks ago
- Mass-editing thousands of facts into a transformer memory (ICLR 2023)☆423Updated 7 months ago
- ☆239Updated 10 months ago
- ☆99Updated 10 months ago
- Improving Alignment and Robustness with Circuit Breakers☆124Updated 2 months ago
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆356Updated this week
- LLM-Merging: Building LLMs Efficiently through Merging☆165Updated last week
- Steering Llama 2 with Contrastive Activation Addition☆83Updated 3 months ago
- Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions☆601Updated last week
- RuLES: a benchmark for evaluating rule-following in language models☆209Updated this week
- ☆75Updated this week
- A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).☆695Updated last week
- Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (2024)☆169Updated 3 months ago
- Function Vectors in Large Language Models (ICLR 2024)☆107Updated last month