swe-bench / SWE-benchLinks
SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues?
β3,173Updated 2 weeks ago
Alternatives and similar repositories for SWE-bench
Users that are interested in SWE-bench are comparing it to the libraries listed below
Sorting:
- Agentlessπ±: an agentless approach to automatically solve software development problemsβ1,793Updated 6 months ago
- A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-beβ¦β2,967Updated 2 months ago
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Eβ¦β1,435Updated 2 months ago
- [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environmentsβ1,978Updated this week
- Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024β1,510Updated last week
- Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""β3,869Updated 7 months ago
- Sky-T1: Train your own O1 preview model within $450β3,305Updated this week
- β3,821Updated last week
- Code for the paper "Evaluating Large Language Models Trained on Code"β2,828Updated 6 months ago
- LDB: A Large Language Model Debugger via Verifying Runtime Execution Step by Step (ACL'24)β549Updated 10 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineeringβ800Updated 3 weeks ago
- Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhanβ¦β1,296Updated last year
- π¨βπ» An awesome and curated list of best code-LLM for research.β1,215Updated 7 months ago
- SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecβ¦β16,669Updated this week
- [ICLR 2025] Automated Design of Agentic Systemsβ1,373Updated 5 months ago
- [ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Callingβ1,712Updated last year
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"β1,048Updated 5 months ago
- Code and Data for Tau-Benchβ666Updated this week
- [ICML'24] Magicoder: Empowering Code Generation with OSS-Instructβ2,018Updated 8 months ago
- Home of StarCoder2!β1,936Updated last year
- LiveBench: A Challenging, Contamination-Free LLM Benchmarkβ823Updated this week
- A framework for serving and evaluating LLM routers - save LLM costs without compromising qualityβ4,099Updated 11 months ago
- Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"β847Updated last year
- AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.β956Updated last week
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"β590Updated last week
- β490Updated 3 weeks ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracyβ1,934Updated 11 months ago
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)β2,683Updated 5 months ago
- AIOS: AI Agent Operating Systemβ4,358Updated last week
- AllenAI's post-training codebaseβ3,061Updated this week