scicode-bench / SciCode
A benchmark that challenges language models to code solutions for scientific problems
☆97Updated this week
Alternatives and similar repositories for SciCode:
Users that are interested in SciCode are comparing it to the libraries listed below
- Repository for the paper Stream of Search: Learning to Search in Language☆125Updated 5 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆156Updated 3 months ago
- ☆56Updated last week
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆136Updated last month
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym☆230Updated 2 weeks ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆167Updated 6 months ago
- ☆111Updated 6 months ago
- ☆116Updated 3 months ago
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆51Updated 10 months ago
- ☆94Updated 7 months ago
- ☆80Updated this week
- Can Language Models Solve Olympiad Programming?☆108Updated 2 weeks ago
- Implementation of the Quiet-STAR paper (https://arxiv.org/pdf/2403.09629.pdf)☆50Updated 5 months ago
- Flow of Reasoning: Training LLMs for Divergent Problem Solving with Minimal Examples☆58Updated last week
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆176Updated last month
- Replicating O1 inference-time scaling laws☆73Updated last month
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Updated 11 months ago
- Toy implementation of Strawberry☆30Updated 4 months ago
- Code for the paper "VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment"☆114Updated 2 months ago
- 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.☆98Updated this week
- Benchmarking LLMs with Challenging Tasks from Real Users☆208Updated 2 months ago
- Functional Benchmarks and the Reasoning Gap☆82Updated 3 months ago
- ☆142Updated last week
- ☆98Updated this week
- A simple unified framework for evaluating LLMs☆172Updated this week
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆203Updated 8 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆158Updated 2 weeks ago
- Reproducible, flexible LLM evaluations☆127Updated last month
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆130Updated 4 months ago