wesg52 / world-models
Extracting spatial and temporal world models from LLMs
☆248Updated last year
Alternatives and similar repositories for world-models:
Users that are interested in world-models are comparing it to the libraries listed below
- ☆404Updated 5 months ago
- Representation Engineering: A Top-Down Approach to AI Transparency☆775Updated 5 months ago
- Code for Arxiv 2023: Improving Language Model Negociation with Self-Play and In-Context Learning from AI Feedback☆204Updated last year
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆175Updated last month
- This repository collects all relevant resources about interpretability in LLMs☆305Updated 2 months ago
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model☆493Updated 3 months ago
- Sparse autoencoders☆407Updated this week
- Using sparse coding to find distributed representations used by neural networks.☆207Updated last year
- Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)☆176Updated this week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆176Updated last month
- RewardBench: the first evaluation tool for reward models.☆491Updated last week
- ☆264Updated 6 months ago
- ☆201Updated 3 months ago
- Inspecting and Editing Knowledge Representations in Language Models☆111Updated last year
- ☆106Updated 5 months ago
- Emergent world representations: Exploring a sequence model trained on a synthetic task☆173Updated last year
- ☆168Updated last year
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"☆452Updated 8 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆149Updated 2 months ago
- ICML 2024: Improving Factuality and Reasoning in Language Models through Multiagent Debate☆386Updated last year
- [NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking☆260Updated 6 months ago
- ☆277Updated last year
- ☆75Updated 5 months ago
- Tools for understanding how transformer predictions are built layer-by-layer☆459Updated 7 months ago
- A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).☆785Updated 2 weeks ago
- Mass-editing thousands of facts into a transformer memory (ICLR 2023)☆458Updated 11 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆82Updated this week
- Interpretable text embeddings by asking LLMs yes/no questions (NeurIPS 2024)☆32Updated 2 months ago
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"☆541Updated 2 weeks ago