Algorithmic-Alignment-Lab / contracts
Formal Contracts for Multi-Agent Reinforcement Learning
☆16Updated 10 months ago
Related projects: ⓘ
- ☆39Updated 2 months ago
- ☆120Updated 2 months ago
- ☆46Updated 7 months ago
- ☆54Updated last week
- ☆33Updated 3 months ago
- Interpreting how transformers simulate agents performing RL tasks☆62Updated 10 months ago
- Universal Neurons in GPT2 Language Models☆25Updated 3 months ago
- ☆24Updated 5 months ago
- ☆65Updated 2 months ago
- Sparse and discrete interpretability tool for neural networks☆51Updated 7 months ago
- Materials for ConceptARC paper☆71Updated 4 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆70Updated 5 months ago
- Redwood Research's transformer interpretability tools☆11Updated 2 years ago
- Language-annotated Abstraction and Reasoning Corpus☆76Updated last year
- Sparse Autoencoder Training Library☆18Updated last month
- Dataset and benchmark for assessing LLMs in translating natural language descriptions of planning problems into PDDL☆19Updated last week
- Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State☆14Updated 8 months ago
- ☆11Updated 2 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆76Updated 3 weeks ago
- ☆18Updated last year
- Measuring the situational awareness of language models☆31Updated 7 months ago
- Can Language Models Solve Olympiad Programming?☆92Updated last month
- Repository for the code of the "PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided Decoding" paper, NAACL'22☆62Updated last year
- Evaluation of neuro-symbolic engines☆29Updated last month
- Q-Probe: A Lightweight Approach to Reward Maximization for Language Models☆37Updated 3 months ago
- Intrinsic Motivation from Artificial Intelligence Feedback☆117Updated 10 months ago
- Repo to reproduce the First-Explore paper results☆36Updated last year
- 🧠 Starter templates for doing interpretability research☆59Updated last year
- A domain-specific probabilistic programming language for modeling and inference with language models☆111Updated 11 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆23Updated 3 months ago