openai / SWELancer-Benchmark
This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?"
β1,291Updated 3 weeks ago
Alternatives and similar repositories for SWELancer-Benchmark:
Users that are interested in SWELancer-Benchmark are comparing it to the libraries listed below
- Agentlessπ±: an agentless approach to automatically solve software development problemsβ1,603Updated 3 months ago
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"β477Updated 2 weeks ago
- AI computer use powered by open source LLMs and E2B Desktop Sandboxβ981Updated 3 weeks ago
- Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhanβ¦β1,006Updated 10 months ago
- [ICLR 2025] Agent S: an open agentic framework that uses computers like a humanβ1,407Updated last week
- β526Updated last week
- The specification of the Model Context Protocolβ1,198Updated this week
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineeringβ657Updated 2 months ago
- E2B Desktop Sandbox for LLMs. E2B Sandbox with desktop graphical environment that you can connect to any LLM for secure computer use.β557Updated this week
- Learn how to use CUA (our Computer Using Agent) via the API on multiple computer environments.β644Updated 2 weeks ago
- Keep searching, reading webpages, reasoning until it finds the answer (or exceeding the token budget)β3,725Updated last week
- Out-of-the-box (OOTB) GUI Agent for Windows and macOSβ1,473Updated last week
- Democratizing Reinforcement Learning for LLMsβ2,158Updated last month
- procedural reasoning datasetsβ541Updated this week
- An agent benchmark with tasks in a simulated software company.β274Updated 2 weeks ago
- Search-o1: Agentic Search-Enhanced Large Reasoning Modelsβ759Updated this week
- β438Updated 6 months ago
- β2,685Updated last week
- RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.β1,265Updated last week
- Synthetic data curation for post-training and structured data extractionβ1,097Updated last week
- β2,014Updated this week
- Training Large Language Model to Reason in a Continuous Latent Spaceβ1,015Updated 2 months ago
- Sidecar is the AI brains for the Aide editor and works alongside it, locally on your machineβ534Updated last week
- Sky-T1: Train your own O1 preview model within $450β3,167Updated last week
- Verifiers for LLM Reinforcement Learningβ727Updated last week
- Open source alternative to Gemini Deep Research. Generate reports with AI based on search results.β1,726Updated 2 weeks ago
- Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRLβ1,466Updated last week
- An open source deep research clone. AI Agent that reasons large amounts of web data extracted with Firecrawlβ5,208Updated last month
- Reasoning Augmented Generationβ772Updated last month
- LiveBench: A Challenging, Contamination-Free LLM Benchmarkβ628Updated this week