mariushobbhahn / SWEBench-verified-miniLinks

☆21

Alternatives and similar repositories for SWEBench-verified-mini

Users that are interested in SWEBench-verified-mini are comparing it to the libraries listed below

Sorting:

princeton-nlp / intercode
[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
☆227Updated last year
R2E-Gym / R2E-Gym
[COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆176Updated 3 months ago
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆154Updated last year
scicode-bench / SciCode
A benchmark that challenges language models to code solutions for scientific problems
☆145Updated last week
r2e-project / r2e
[ICML '24] R2E: Turn any GitHub Repository into a Programming Agent Environment
☆133Updated 6 months ago
gso-bench / gso
[NeurIPS '25] Challenging Software Optimization Tasks for Evaluating SWE-Agents
☆54Updated this week
aorwall / SWE-bench-docker
☆99Updated last year
Leolty / repobench
✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
☆174Updated last year
METR / RE-Bench
☆113Updated last week
SWE-bench / experiments
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
☆218Updated this week
qishenghu / InstructCoder
InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw
☆62Updated last year
princeton-nlp / USACO
Can Language Models Solve Olympiad Programming?
☆119Updated 9 months ago
Yu-Fangxu / FoR
[ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples
☆108Updated 3 months ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆117Updated last year
BigComputer-Project / SWE-Arena
SWE Arena
☆35Updated 3 months ago
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆94Updated 2 years ago
amazon-science / cceval
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
☆159Updated 2 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆78Updated 3 months ago
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆209Updated 11 months ago
StonyBrookNLP / appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource…
☆290Updated last week
huggingface / ioi
☆40Updated 7 months ago
SWE-Gym / SWE-Gym
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆553Updated 2 months ago
suzgunmirac / dynamic-cheatsheet
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
☆159Updated 5 months ago
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆254Updated 6 months ago
ntunlp / ExecEval
A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
☆56Updated last year
QwenLM / CodeElo
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
☆55Updated 8 months ago
aorwall / moatless-tree-search
☆120Updated 4 months ago
ars22 / scaling-LLM-math-synthetic-data
Code and data used in the paper: "Training on Incorrect Synthetic Data via RL Scales LLM Math Reasoning Eight-Fold"
☆31Updated last year
eth-sri / matharena
Evaluation of LLMs on latest math competitions
☆172Updated last week
evalplus / repoqa
RepoQA: Evaluating Long-Context Code Understanding
☆119Updated 11 months ago