scicode-bench / SciCodeLinks

A benchmark that challenges language models to code solutions for scientific problems

☆127

Alternatives and similar repositories for SciCode

Users that are interested in SciCode are comparing it to the libraries listed below

Sorting:

kanishkg / stream-of-search
Repository for the paper Stream of Search: Learning to Search in Language
☆149Updated 6 months ago
allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆233Updated 9 months ago
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆175Updated 4 months ago
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆235Updated 3 months ago
Yu-Fangxu / FoR
[ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples
☆103Updated last week
hughbzhang / o1_inference_scaling_laws
Replicating O1 inference-time scaling laws
☆89Updated 8 months ago
princeton-nlp / USACO
Can Language Models Solve Olympiad Programming?
☆119Updated 6 months ago
princeton-nlp / intercode
[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
☆223Updated last year
RulinShao / retrieval-scaling
Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".
☆207Updated 2 months ago
LeonGuertler / TextArena
A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning
☆225Updated this week
SalesforceAIResearch / LaTRO
☆118Updated 5 months ago
METR / RE-Bench
☆95Updated 3 months ago
huggingface / ioi
☆38Updated 4 months ago
OSU-NLP-Group / ScienceAgentBench
[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
☆94Updated last month
eth-sri / matharena
Evaluation of LLMs on latest math competitions
☆155Updated 2 weeks ago
aorwall / moatless-tree-search
☆99Updated 2 months ago
da03 / Internalize_CoT_Step_by_Step
☆187Updated 3 months ago
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆88Updated 10 months ago
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆226Updated 2 weeks ago
allenai / discoveryworld
A virtual environment for developing and evaluating automated scientific discovery agents.
☆166Updated 4 months ago
R2E-Gym / R2E-Gym
Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆139Updated 3 weeks ago
openai / safety-rbr-code-and-data
Code and example data for the paper: Rule Based Rewards for Language Model Safety
☆190Updated last year
BigComputer-Project / SWE-Arena
SWE Arena
☆33Updated last month
Berkeley-NLP / Agent-Eval-Refine
Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]
☆139Updated 8 months ago
MLE-Dojo / MLE-Dojo
☆61Updated last week
goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆157Updated 3 months ago
zhangxjohn / LLM-Agent-Benchmark-List
A banchmark list for evaluation of large language models.
☆134Updated last month
InternLM / SWE-Fixer
☆108Updated 2 months ago
StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…
☆232Updated 2 months ago
THUDM / T1
RL Scaling and Test-Time Scaling (ICML'25)
☆109Updated 6 months ago