wschella / llm-reliability
Code for the paper "Larger and more instructable language models become less reliable"
☆29Updated 7 months ago
Alternatives and similar repositories for llm-reliability
Users that are interested in llm-reliability are comparing it to the libraries listed below
Sorting:
- ☆21Updated 2 months ago
- Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.☆40Updated 2 months ago
- ☆40Updated 10 months ago
- [ACL 2024] <Large Language Models for Automated Open-domain Scientific Hypotheses Discovery>. It has also received the best poster award …☆40Updated 6 months ago
- Source code for the collaborative reasoner research project at Meta FAIR.☆74Updated last month
- Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025☆21Updated 3 weeks ago
- Code, datasets, and checkpoints for the paper "CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval an…☆29Updated 8 months ago
- Official Code Release for "Training a Generally Curious Agent"☆20Updated last month
- CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.☆44Updated 6 months ago
- Verifiers for LLM Reinforcement Learning☆50Updated last month
- ☆21Updated 7 months ago
- Official Implementation of the Baby-AIGS system☆23Updated 5 months ago
- Code and data for the paper "Why think step by step? Reasoning emerges from the locality of experience"☆60Updated last month
- implementation of dualformer☆17Updated 2 months ago
- PyTorch implementation for MRL☆18Updated last year
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆85Updated 2 weeks ago
- ☆64Updated last month
- SCREWS: A Modular Framework for Reasoning with Revisions☆27Updated last year
- Aioli: A unified optimization framework for language model data mixing☆25Updated 3 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆57Updated 8 months ago
- Synthetic data generation and benchmark implementation for "Episodic Memories Generation and Evaluation Benchmark for Large Language Mode…☆42Updated last month
- ☆27Updated this week
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers☆18Updated 2 months ago
- ☆42Updated last month
- Exploration of automated dataset selection approaches at large scales.☆40Updated 2 months ago
- Official Repository of Are Your LLMs Capable of Stable Reasoning?☆25Updated last month
- ☆50Updated 2 months ago
- Code, results and other artifacts from the paper introducing the WildChat-50m dataset and the Re-Wild model family.☆29Updated last month
- LitQA Eval: A difficult set of scientific questions that require context of full-text research papers to answer☆39Updated 4 months ago
- ☆17Updated last year