gso-bench / gsoLinks
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
☆30Updated last week
Alternatives and similar repositories for gso
Users that are interested in gso are comparing it to the libraries listed below
Sorting:
- r2e: turn any github repository into a programming agent environment☆128Updated 3 months ago
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆128Updated last week
- A benchmark for LLMs on complicated tasks in the terminal☆260Updated last week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 4 months ago
- A scalable asynchronous reinforcement learning implementation with in-flight weight updates.☆130Updated this week
- Scaling Data for SWE-agents☆309Updated this week
- ☆20Updated 6 months ago
- RepoQA: Evaluating Long-Context Code Understanding☆112Updated 8 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆149Updated 9 months ago
- Can Language Models Solve Olympiad Programming?☆119Updated 6 months ago
- Evaluation of LLMs on latest math competitions☆151Updated this week
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆194Updated 2 weeks ago
- Training and Benchmarking LLMs for Code Preference.☆34Updated 8 months ago
- Async pipelined version of Verl☆110Updated 3 months ago
- ☆41Updated 5 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆61Updated 9 months ago
- 🚀 SWE-bench Goes Live!☆100Updated last week
- Replicating O1 inference-time scaling laws☆89Updated 7 months ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆506Updated 2 months ago
- ☆28Updated last week
- [COLM 2025] Code for Paper: Learning Adaptive Parallel Reasoning with Language Models☆114Updated 3 months ago
- ☆35Updated 4 months ago
- ☆192Updated last month
- ☆97Updated last month
- A simple unified framework for evaluating LLMs☆227Updated 3 months ago
- ☆57Updated this week
- LOFT: A 1 Million+ Token Long-Context Benchmark☆206Updated last month
- ✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024☆168Updated 11 months ago
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆150Updated 11 months ago
- ☆36Updated 2 months ago