understanding-search / maze-datasetLinks
maze datasets for investigating OOD behavior of ML systems
☆46Updated last week
Alternatives and similar repositories for maze-dataset
Users that are interested in maze-dataset are comparing it to the libraries listed below
Sorting:
- Rewarded soups official implementation☆58Updated last year
- ☆93Updated 11 months ago
- This is code for most of the experiments in the paper Understanding the Effects of RLHF on LLM Generalisation and Diversity☆43Updated last year
- A library for efficient patching and automatic circuit discovery.☆65Updated last month
- Code for "Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining"☆16Updated last month
- Reinforcement Learning via Regressing Relative Rewards☆32Updated 5 months ago
- ☆40Updated last year
- Official pytorch implementation of "Interpreting the Second-Order Effects of Neurons in CLIP"☆39Updated 6 months ago
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆120Updated 8 months ago
- ☆93Updated 3 months ago
- ☆85Updated last year
- Official PyTorch Implementation of the Longhorn Deep State Space Model☆50Updated 6 months ago
- Code for "Reasoning to Learn from Latent Thoughts"☆104Updated 2 months ago
- ☆14Updated last year
- ☆51Updated last month
- What Makes a Reward Model a Good Teacher? An Optimization Perspective☆31Updated last month
- A repo for RLHF training and BoN over LLMs, with support for reward model ensembles.☆43Updated 4 months ago
- Universal Neurons in GPT2 Language Models☆29Updated last year
- ☆23Updated 4 months ago
- Official repository for our paper, Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Mode…☆16Updated 6 months ago
- This repo is built to facilitate the training and analysis of autoregressive transformers on maze-solving tasks.☆29Updated 9 months ago
- Code for the paper "VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment"☆159Updated last week
- ☆27Updated 9 months ago
- [ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)☆85Updated 7 months ago
- BASALT Benchmark datasets, evaluation code and agent training example.☆20Updated last year
- Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-tur…☆15Updated 6 months ago
- ☆31Updated last year
- [ICLR 2025] Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization☆28Updated 4 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆107Updated last year
- official code for paper Probing the Decision Boundaries of In-context Learning in Large Language Models. https://arxiv.org/abs/2406.11233…☆18Updated 9 months ago