wesg52 / world-models
Extracting spatial and temporal world models from LLMs
☆248Updated last year
Alternatives and similar repositories for world-models:
Users that are interested in world-models are comparing it to the libraries listed below
- Emergent world representations: Exploring a sequence model trained on a synthetic task☆172Updated last year
- ☆104Updated 4 months ago
- Representation Engineering: A Top-Down Approach to AI Transparency☆744Updated 4 months ago
- ☆256Updated 9 months ago
- ☆256Updated 5 months ago
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model☆483Updated 2 months ago
- Code for Arxiv 2023: Improving Language Model Negociation with Self-Play and In-Context Learning from AI Feedback☆202Updated last year
- Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)☆170Updated this week
- Reasoning with Language Model is Planning with World Model☆153Updated last year
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆170Updated 2 weeks ago
- ☆161Updated 11 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆170Updated this week
- RewardBench: the first evaluation tool for reward models.☆462Updated this week
- Using sparse coding to find distributed representations used by neural networks.☆196Updated last year
- ☆194Updated 2 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆132Updated last month
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and training☆233Updated 6 months ago
- ☆380Updated 4 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆64Updated this week
- ☆124Updated this week
- Sparse autoencoders☆379Updated last week
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)☆173Updated last year
- Simple next-token-prediction for RLHF☆219Updated last year
- ☆115Updated 2 months ago
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆201Updated 2 months ago
- Self-Alignment with Principle-Following Reward Models☆147Updated 9 months ago
- [NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking☆253Updated 5 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆64Updated last year
- This repository collects all relevant resources about interpretability in LLMs☆295Updated last month