wesg52 / world-models
Extracting spatial and temporal world models from LLMs
☆249Updated last year
Alternatives and similar repositories for world-models:
Users that are interested in world-models are comparing it to the libraries listed below
- ☆421Updated 7 months ago
- Tools for understanding how transformer predictions are built layer-by-layer☆474Updated 8 months ago
- ☆262Updated 11 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆182Updated 2 months ago
- This repository collects all relevant resources about interpretability in LLMs☆321Updated 3 months ago
- ☆269Updated 8 months ago
- Sparsify transformers with SAEs and transcoders☆464Updated this week
- Code for Arxiv 2023: Improving Language Model Negociation with Self-Play and In-Context Learning from AI Feedback☆204Updated last year
- Using sparse coding to find distributed representations used by neural networks.☆213Updated last year
- ☆499Updated 6 months ago
- ☆130Updated last year
- RewardBench: the first evaluation tool for reward models.☆508Updated this week
- Representation Engineering: A Top-Down Approach to AI Transparency☆789Updated 6 months ago
- ☆205Updated 4 months ago
- Emergent world representations: Exploring a sequence model trained on a synthetic task☆175Updated last year
- ☆122Updated last week
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆182Updated 2 months ago
- ☆282Updated last year
- ☆171Updated last year
- ☆109Updated 6 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆93Updated this week
- ☆150Updated last year
- Extract full next-token probabilities via language model APIs☆229Updated 11 months ago
- Scaling Data-Constrained Language Models☆333Updated 5 months ago
- A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).☆804Updated last week
- Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)☆186Updated this week
- Interpretable text embeddings by asking LLMs yes/no questions (NeurIPS 2024)☆35Updated 3 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆71Updated last year
- Public repository for "The Surprising Effectiveness of Test-Time Training for Abstract Reasoning"☆291Updated 3 months ago
- Mechanistic Interpretability Visualizations using React☆232Updated 2 months ago