openai / SWELancer-BenchmarkLinks
This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?"
☆1,408Updated last month
Alternatives and similar repositories for SWELancer-Benchmark
Users that are interested in SWELancer-Benchmark are comparing it to the libraries listed below
Sorting:
- Releases from OpenAI Preparedness☆783Updated 3 weeks ago
- The #1 open-source SWE-bench Verified implementation☆747Updated 2 weeks ago
- Agentless🐱: an agentless approach to automatically solve software development problems☆1,743Updated 6 months ago
- E2B Desktop Sandbox for LLMs. E2B Sandbox with desktop graphical environment that you can connect to any LLM for secure computer use.☆978Updated last week
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆551Updated 3 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆760Updated last week
- A Model Context Protocol Server connector for Perplexity API, to enable web search without leaving the MCP ecosystem.☆1,271Updated 2 months ago
- AI computer use powered by open source LLMs and E2B Desktop Sandbox☆1,291Updated 3 weeks ago
- An agent benchmark with tasks in a simulated software company.☆407Updated this week
- OctoTools: An agentic framework with extensible tools for complex reasoning☆1,197Updated this week
- II-Agent: a new open-source framework to build and deploy intelligent agents☆2,511Updated this week
- Sky-T1: Train your own O1 preview model within $450☆3,272Updated last month
- [ICLR 2025] Automated Design of Agentic Systems☆1,345Updated 4 months ago
- Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents☆1,346Updated 2 weeks ago
- Democratizing Reinforcement Learning for LLMs☆3,396Updated last month
- SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues?☆3,107Updated this week
- Build effective agents using Model Context Protocol and simple workflow patterns☆5,930Updated this week
- ☆2,023Updated 2 weeks ago
- ⚖️ The First Coding Agent-as-a-Judge☆562Updated last month
- Keep searching, reading webpages, reasoning until it finds the answer (or exceeding the token budget)☆4,492Updated last week
- ☆3,410Updated 2 months ago
- Open source alternative to Gemini Deep Research. Generate reports with AI based on search results.☆1,973Updated 3 months ago
- ☆1,878Updated 2 months ago
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆789Updated 2 weeks ago
- Synthetic data curation for post-training and structured data extraction☆1,414Updated last week
- MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model.☆2,118Updated last week
- Sidecar is the AI brains for the Aide editor and works alongside it, locally on your machine☆571Updated last month
- Atom of Thoughts for Markov LLM Test-Time Scaling☆574Updated last week
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆489Updated last month
- Training Large Language Model to Reason in a Continuous Latent Space☆1,162Updated 5 months ago