scicode-bench / SciCode
A benchmark that challenges language models to code solutions for scientific problems
☆87Updated this week
Related projects ⓘ
Alternatives and complementary repositories for SciCode
- Can Language Models Solve Olympiad Programming?☆100Updated 3 months ago
- Discovering Data-driven Hypotheses in the Wild☆41Updated this week
- Repository for the paper Stream of Search: Learning to Search in Language☆91Updated 3 months ago
- Implementation of the Quiet-STAR paper (https://arxiv.org/pdf/2403.09629.pdf)☆42Updated 3 months ago
- [EMNLP 2024] A Retrieval Benchmark for Scientific Literature Search☆61Updated 4 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆110Updated 3 weeks ago
- A banchmark list for evaluation of large language models.☆68Updated 4 months ago
- Replicating O1 inference-time scaling laws☆49Updated last month
- ☆103Updated 4 months ago
- ☆90Updated 4 months ago
- Official implementation for <Large Language Models for Automated Open-domain Scientific Hypotheses Discovery>, accepted by ACL 2024. It a…☆36Updated 3 weeks ago
- ☆49Updated 6 months ago
- Evaluating LLMs with CommonGen-Lite☆85Updated 8 months ago
- ☆102Updated last month
- ☆112Updated last month
- Scalable Meta-Evaluation of LLMs as Evaluators☆41Updated 9 months ago
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆44Updated 10 months ago
- ☆101Updated 3 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆158Updated 4 months ago
- RepoQA: Evaluating Long-Context Code Understanding☆100Updated 2 weeks ago
- Repository for paper Tools Are Instrumental for Language Agents in Complex Environments☆32Updated last month
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆52Updated last month
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆124Updated 3 weeks ago
- Functional Benchmarks and the Reasoning Gap☆78Updated last month
- Code for the paper 🌳 Tree Search for Language Model Agents☆138Updated 3 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆128Updated 3 weeks ago
- CodeUltraFeedback: aligning large language models to coding preferences☆65Updated 4 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆195Updated 2 weeks ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆129Updated this week
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆62Updated last year