openai / SWELancer-BenchmarkLinks

This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?"

☆1,433

Alternatives and similar repositories for SWELancer-Benchmark

Users that are interested in SWELancer-Benchmark are comparing it to the libraries listed below

Sorting:

openai / preparedness
Releases from OpenAI Preparedness
☆815Updated last week
openai / harmony
Renderer for the harmony response format to be used with gpt-oss
☆2,637Updated this week
OpenAutoCoder / Agentless
Agentless🐱: an agentless approach to automatically solve software development problems
☆1,846Updated 7 months ago
facebookresearch / swe-rl
Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆573Updated 4 months ago
TheAgentCompany / TheAgentCompany
An agent benchmark with tasks in a simulated software company.
☆515Updated last week
microsoft / WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
☆745Updated 3 months ago
augmentcode / augment-swebench-agent
The #1 open-source SWE-bench Verified implementation
☆783Updated 2 months ago
openai / mle-bench
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆823Updated last month
xingyaoww / code-act
Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhan…
☆1,317Updated last year
willccbb / verifiers
Verifiers for LLM Reinforcement Learning
☆1,690Updated last week
bespokelabsai / curator
Synthetic data curation for post-training and structured data extraction
☆1,468Updated last week
e2b-dev / desktop
E2B Desktop Sandbox for LLMs. E2B Sandbox with desktop graphical environment that you can connect to any LLM for secure computer use.
☆1,036Updated 2 weeks ago
jennyzzt / dgm
Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents
☆1,550Updated last month
sierra-research / tau-bench
Code and Data for Tau-Bench
☆742Updated 3 weeks ago
SWE-Gym / SWE-Gym
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆516Updated last week
aorwall / moatless-tools
☆533Updated last month
codelion / openevolve
Open-source implementation of AlphaEvolve
☆3,506Updated this week
NovaSky-AI / SkyThought
Sky-T1: Train your own O1 preview model within $450
☆3,320Updated 3 weeks ago
deepseek-ai / DeepSeek-Prover-V2
☆1,178Updated 3 weeks ago
LiveCodeBench / LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆616Updated 3 weeks ago
SWE-bench / SWE-bench
SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues?
☆3,273Updated last week
ServiceNow / BrowserGym
🌎💪 BrowserGym, a Gym environment for web task automation
☆843Updated this week
centerforaisafety / hle
Humanity's Last Exam
☆998Updated 2 weeks ago
qixucen / atom
Atom of Thoughts for Markov LLM Test-Time Scaling
☆580Updated last month
safety-research / circuit-tracer
☆2,238Updated this week
google / langfun
OO for LLMs
☆835Updated last week
facebookresearch / coconut
Training Large Language Model to Reason in a Continuous Latent Space
☆1,224Updated 6 months ago
ShengranHu / ADAS
[ICLR 2025] Automated Design of Agentic Systems
☆1,395Updated 6 months ago
openai / openai-cua-sample-app
Learn how to use CUA (our Computer Using Agent) via the API on multiple computer environments.
☆1,043Updated 3 months ago
MinorJerry / WebVoyager
Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"
☆877Updated last year