swe-bench / SWE-benchLinks
SWE-bench: Can Language Models Resolve Real-world Github Issues?
β3,665Updated last week
Alternatives and similar repositories for SWE-bench
Users that are interested in SWE-bench are comparing it to the libraries listed below
Sorting:
- Agentlessπ±: an agentless approach to automatically solve software development problemsβ1,931Updated 9 months ago
- A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-beβ¦β3,016Updated 5 months ago
- Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024β1,606Updated 2 weeks ago
- [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environmentsβ2,226Updated this week
- β4,116Updated 2 months ago
- SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecβ¦β17,560Updated last week
- Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhanβ¦β1,397Updated last year
- π¨βπ» An awesome and curated list of best code-LLM for research.β1,237Updated 10 months ago
- Code for the paper "Evaluating Large Language Models Trained on Code"β2,973Updated 9 months ago
- Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""β3,895Updated 10 months ago
- β603Updated last month
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineeringβ1,012Updated last week
- LDB: A Large Language Model Debugger via Verifying Runtime Execution Step by Step (ACL'24)β560Updated last year
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Eβ¦β1,436Updated 3 months ago
- LiveBench: A Challenging, Contamination-Free LLM Benchmarkβ890Updated last week
- Code and Data for Tau-Benchβ891Updated last month
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"β680Updated 3 months ago
- AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.β1,048Updated 3 weeks ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracyβ2,050Updated last year
- AIOS: AI Agent Operating Systemβ4,714Updated this week
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"β1,175Updated 2 weeks ago
- [ICML'24] Magicoder: Empowering Code Generation with OSS-Instructβ2,049Updated 11 months ago
- [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achβ¦β5,481Updated 7 months ago
- TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients. Published in Nature.β3,012Updated 2 months ago
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)β2,844Updated last week
- [ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Callingβ1,767Updated last year
- [ICLR 2025] Automated Design of Agentic Systemsβ1,435Updated 8 months ago
- AllenAI's post-training codebaseβ3,252Updated this week
- [TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.β2,944Updated this week
- Arena-Hard-Auto: An automatic LLM benchmark.β940Updated 3 months ago