wschella / llm-reliabilityLinks
Code for the paper "Larger and more instructable language models become less reliable"
☆31Updated last year
Alternatives and similar repositories for llm-reliability
Users that are interested in llm-reliability are comparing it to the libraries listed below
Sorting:
- Discovering Data-driven Hypotheses in the Wild☆129Updated 8 months ago
- SCREWS: A Modular Framework for Reasoning with Revisions☆27Updated 2 years ago
- [ACL 2024] <Large Language Models for Automated Open-domain Scientific Hypotheses Discovery>. It has also received the best poster award …☆42Updated last year
- CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.☆48Updated 3 months ago
- ☆19Updated 6 months ago
- ReBase: Training Task Experts through Retrieval Based Distillation☆29Updated last year
- Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025☆29Updated 9 months ago
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆124Updated 5 months ago
- Official Implementation of the Baby-AIGS system☆24Updated last year
- Understanding the correlation between different LLM benchmarks☆29Updated 2 years ago
- A virtual environment for developing and evaluating automated scientific discovery agents.☆199Updated 11 months ago
- Tree prompting: easy-to-use scikit-learn interface for improved prompting.☆41Updated 2 years ago
- [NeurIPS'24 LanGame workshop] On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability☆42Updated 7 months ago
- ☆25Updated 8 months ago
- The code implementation of MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models…☆40Updated 2 years ago
- Source code for the collaborative reasoner research project at Meta FAIR.☆112Updated 9 months ago
- Learning to Retrieve by Trying - Source code for Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval☆51Updated last year
- Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.☆47Updated 10 months ago
- Co-LLM: Learning to Decode Collaboratively with Multiple Language Models☆126Updated last year
- ☆49Updated 2 years ago
- Data from BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology paper☆27Updated last year
- Analysis code for Neurips 2025 paper "SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks"☆56Updated 6 months ago
- Codebase accompanying the Summary of a Haystack paper.☆80Updated last year
- Official implementation of the ACL 2024: Scientific Inspiration Machines Optimized for Novelty☆93Updated last year
- (ICLR 2026) Optimas: Optimizing Compound AI Systems☆68Updated this week
- [ICLR 2025]ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning https://arxiv.org/abs/2501.06590☆80Updated 6 months ago
- This repository contains expert evaluation interface and data evaluation script for the OpenScholar project.☆32Updated last year
- ☆141Updated 4 months ago
- Code release for "SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers" [NeurIPS D&B, 2024]☆72Updated last year
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆35Updated 9 months ago