aypan17 / reward-misspecificationLinks
☆10Updated 2 years ago
Alternatives and similar repositories for reward-misspecification
Users that are interested in reward-misspecification are comparing it to the libraries listed below
Sorting:
- ☆19Updated 11 months ago
- AutoLibra: Metric Induction for Agents from Open-Ended Human Feedback☆15Updated 2 weeks ago
- ☆20Updated last year
- ☆31Updated 2 years ago
- This repository contains some of the code used in the paper "Training Language Models with Langauge Feedback at Scale"☆27Updated 2 years ago
- Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Ou…☆32Updated last year
- ☆13Updated 3 months ago
- ☆12Updated 5 years ago
- Group-conditional DRO to alleviate spurious correlations☆15Updated 4 years ago
- Self-Supervised Alignment with Mutual Information☆21Updated last year
- TaskMet Task-driven Metric Learning for Model Learning☆19Updated last year
- Post-processing for fair classification☆16Updated 4 months ago
- MUA-RL: MULTI-TURN USER-INTERACTING AGENT REINFORCEMENT LEARNING FOR AGENTIC TOOL USE☆38Updated last month
- Representation Learning in RL☆13Updated 3 years ago
- ☆15Updated last year
- Code Release for the 2023 NeurIPS Paper How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained langua…☆15Updated 10 months ago
- ☆27Updated 2 years ago
- Code for reproducing our paper "Low Rank Adapting Models for Sparse Autoencoder Features"☆17Updated 7 months ago
- ☆17Updated 2 years ago
- Code for the paper "Distinguishing the Knowable from the Unknowable with Language Models"☆10Updated last year
- ☆34Updated 2 years ago
- Code for Paper (Policy Optimization in RLHF: The Impact of Out-of-preference Data)☆28Updated last year
- Code for "The Expressive Power of Low-Rank Adaptation".☆20Updated last year
- Code for the paper "Optimal Off-Policy Evaluation from Multiple Logging Policies"☆15Updated 4 years ago
- ☆20Updated 5 years ago
- Code for our paper: "GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models"☆57Updated 2 years ago
- [ACL 2023 Findings] What In-Context Learning “Learns” In-Context: Disentangling Task Recognition and Task Learning☆21Updated 2 years ago
- ☆20Updated 11 months ago
- Tools for robustness evaluation in interpretability methods☆10Updated 4 years ago
- A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models☆17Updated 5 months ago