gso-bench/gso

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/gso-bench/gso)

gso-bench / gso

[NeurIPS '25] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

☆87

Alternatives and similar repositories for gso

Users that are interested in gso are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

r2e-project / r2e
View on GitHub
[ICML '24] R2E: Turn any GitHub Repository into a Programming Agent Environment
☆149Apr 20, 2025Updated last year
MaoZiming / papers
View on GitHub
Paper-reading notes for Berkeley OS prelim exam.
☆14Aug 28, 2024Updated last year
swefficiency / swefficiency
View on GitHub
Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize Real World Repositories on Real World Workloads?"
☆21Feb 17, 2026Updated 5 months ago
SWE-Gym / SWE-Bench-Fork
View on GitHub
☆13Mar 5, 2025Updated last year
SWE-Gym / SWE-Gym
View on GitHub
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆711Jul 29, 2025Updated 11 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
R2E-Gym / R2E-Gym
View on GitHub
[COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆310Jul 13, 2025Updated last year
FrontierCS / Frontier-CS
View on GitHub
A benchmark for evaluating LLMs on open-ended CS problems. Exploring the Next Frontier of Computer Science.
☆286Updated this week
SWE-bench / SWE-smith
View on GitHub
[NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents
☆711Updated this week
multi-swe-bench / multi-swe-bench
View on GitHub
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
☆354Dec 18, 2025Updated 7 months ago
LearningOpt / pie
View on GitHub
☆59Jul 18, 2024Updated 2 years ago
SWE-Perf / SWE-Perf
View on GitHub
☆52Oct 28, 2025Updated 8 months ago
oripress / AlgoTune
View on GitHub
AlgoTune is a NeurIPS 2025 benchmark made up of 154 math, physics, and computer science problems. The goal is write code that solves each…
☆110Jun 24, 2026Updated last month
facebookresearch / swe-rl
View on GitHub
[NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆712Mar 16, 2025Updated last year
SakanaAI / robust-kbench
View on GitHub
☆101Nov 22, 2025Updated 8 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
farukakgul / ReasonMaxxer
View on GitHub
☆18May 8, 2026Updated 2 months ago
google-research / pynsy
View on GitHub
Heavyweight Python dynamic analysis framework
☆18Apr 17, 2024Updated 2 years ago
Deep-Learning-Profiling-Tools / fasten
View on GitHub
☆14Apr 24, 2024Updated 2 years ago
open-thoughts / OpenThoughts-TBLite
View on GitHub
A Difficulty-Calibrated Benchmark for Building Terminal Agents
☆27Feb 20, 2026Updated 5 months ago
thinking-machines-lab / tinker-project-ideas
View on GitHub
Ideas for projects related to Tinker
☆191Nov 6, 2025Updated 8 months ago
lm-sys / llm-decontaminator
View on GitHub
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆325Dec 20, 2023Updated 2 years ago
microsoft / verus-copilot-vscode
View on GitHub
☆16Jan 23, 2026Updated 6 months ago
DeepCommit-ai / SWE-Milestone
View on GitHub
A Continuous Task Evaluation Playground for AI Harness
☆64Updated this week
JiayiGeng / CAID
View on GitHub
Code repo for paper: Effective Strategies for Asynchronous Software Engineering Agents
☆65Apr 2, 2026Updated 3 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
Proximal-Labs / frontier-swe
View on GitHub
FrontierSWE is an ultra long-horizon coding agent benchmark that tests implementation, performance eng and ML research
☆193Jul 17, 2026Updated last week
JoshuaPurtell / SmallBench
View on GitHub
Small, simple agent task environments for training and evaluation
☆20Nov 1, 2024Updated last year
scaleapi / SWE-Atlas
View on GitHub
open source SWE-Atlas
☆57Updated this week
He-Ren / OJBench
View on GitHub
☆32Feb 28, 2026Updated 4 months ago
Jingyu6 / speculative_prefill
View on GitHub
☆63May 19, 2025Updated last year
METR / RE-Bench
View on GitHub
☆145Oct 16, 2025Updated 9 months ago
MineAnyBuild / MineAnyBuild
View on GitHub
Code and benchmark of the paper "MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents" (NeurIPS D&B 2025)
☆15Oct 13, 2025Updated 9 months ago
Intelligent-CAT-Lab / PLTranslationEmpirical
View on GitHub
Artifact repository for the paper "Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code", In P…
☆54Apr 12, 2025Updated last year
facebookresearch / llm-speedrunner
View on GitHub
The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…
☆145May 6, 2026Updated 2 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
microsoft / SWE-bench-Live
View on GitHub
[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!
☆212Jun 11, 2026Updated last month
evalplus / repoqa
View on GitHub
RepoQA: Evaluating Long-Context Code Understanding
☆136Nov 1, 2024Updated last year
f-t-s / CGD
View on GitHub
This repository contains the Julia code for the paper "Competitive Gradient Descent"
☆25Dec 18, 2019Updated 6 years ago
Trinity-data-store / Trinity
View on GitHub
EuroSys '24: "Trinity: A Fast Compressed Multi-attribute Data Store"
☆18Mar 8, 2025Updated last year
suquark / llm4phd
View on GitHub
Examples and instructions about use LLMs (especially ChatGPT) for PhD
☆104Mar 18, 2023Updated 3 years ago
Infini-AI-Lab / vortex_torch
View on GitHub
Vortex: Programmable Sparse Attention for Agents as Algorithm Designers
☆67Jun 24, 2026Updated last month
letta-ai / letta-terminalbench
View on GitHub
letta integration for terminalbench (#1 open source agent, in under 200 lines of code)
☆19Oct 22, 2025Updated 9 months ago