swe-bench / SWE-benchLinks
SWE-bench: Can Language Models Resolve Real-world Github Issues?
β3,796Updated last month
Alternatives and similar repositories for SWE-bench
Users that are interested in SWE-bench are comparing it to the libraries listed below
Sorting:
- Agentlessπ±: an agentless approach to automatically solve software development problemsβ1,951Updated 10 months ago
- A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-beβ¦β3,029Updated 6 months ago
- [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environmentsβ2,308Updated last week
- Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024β1,622Updated last month
- Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""β3,900Updated 11 months ago
- β4,166Updated 3 months ago
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Eβ¦β1,439Updated 4 months ago
- Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhanβ¦β1,441Updated last year
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineeringβ1,144Updated last week
- Code for the paper "Evaluating Large Language Models Trained on Code"β3,016Updated 10 months ago
- A framework for serving and evaluating LLM routers - save LLM costs without compromising qualityβ4,393Updated last year
- An self-improving embodied conversational agent seamlessly integrated into the operating system to automate our daily tasks.β1,692Updated last year
- Doing simple retrieval from LLM models at various context lengths to measure accuracyβ2,068Updated last year
- LDB: A Large Language Model Debugger via Verifying Runtime Execution Step by Step (ACL'24)β564Updated last year
- SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecβ¦β17,754Updated last week
- π¨βπ» An awesome and curated list of best code-LLM for research.β1,250Updated 11 months ago
- [ICML'24] Magicoder: Empowering Code Generation with OSS-Instructβ2,054Updated last year
- [ICLR 2025] Automated Design of Agentic Systemsβ1,459Updated 9 months ago
- Together Mixture-Of-Agents (MoA) β 65.1% on AlpacaEval with OSS modelsβ2,835Updated 10 months ago
- Sky-T1: Train your own O1 preview model within $450β3,352Updated 4 months ago
- β611Updated 2 months ago
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"β1,218Updated last month
- AIOS: AI Agent Operating Systemβ4,785Updated 3 weeks ago
- AllenAI's post-training codebaseβ3,294Updated this week
- LiveBench: A Challenging, Contamination-Free LLM Benchmarkβ916Updated last week
- [ICLR 2023] ReAct: Synergizing Reasoning and Acting in Language Modelsβ3,193Updated last year
- β4,103Updated last year
- Code and Data for Tau-Benchβ942Updated 2 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"β708Updated 4 months ago
- Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"β957Updated last year