MichaelEinhorn / trl-textworld
☆12Updated last year
Alternatives and similar repositories for trl-textworld:
Users that are interested in trl-textworld are comparing it to the libraries listed below
- Redwood Research's transformer interpretability tools☆14Updated 2 years ago
- A repository for transformer critique learning and generation☆88Updated last year
- Official Code for M-RᴇᴡᴀʀᴅBᴇɴᴄʜ: Evaluating Reward Models in Multilingual Settings☆24Updated 2 weeks ago
- A library for efficient patching and automatic circuit discovery.☆54Updated 2 weeks ago
- Experiments with representation engineering☆11Updated last year
- Measuring the situational awareness of language models☆34Updated last year
- Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State☆18Updated last year
- ☆18Updated last year
- Code accompanying the paper Pretraining Language Models with Human Preferences☆180Updated last year
- 👻 Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"☆52Updated 9 months ago
- ☆160Updated last year
- ☆22Updated last year
- Utilities for the HuggingFace transformers library☆64Updated 2 years ago
- Experiments with generating opensource language model assistants☆97Updated last year
- ☆26Updated 10 months ago
- Super fast implementations of common benchmark text world games☆45Updated 2 months ago
- ☆73Updated last year
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆29Updated 9 months ago
- Minimum Bayes Risk Decoding for Hugging Face Transformers☆55Updated 8 months ago
- ☆79Updated 8 months ago
- A repo for RLHF training and BoN over LLMs, with support for reward model ensembles.☆35Updated last month
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆65Updated 8 months ago
- ☆36Updated 9 months ago
- ☆27Updated 11 months ago
- Mechanistic Interpretability for Transformer Models☆49Updated 2 years ago
- The official code of EMNLP 2022, "SCROLLS: Standardized CompaRison Over Long Language Sequences".☆69Updated last year
- ☆11Updated 8 months ago
- FeedbackQA: Improving Question Answering Post-Deployment with Interactive Feedback☆11Updated 2 years ago
- ☆60Updated last month
- A framework for few-shot evaluation of autoregressive language models.☆102Updated last year