swe-bench / SWE-benchLinks
SWE-bench: Can Language Models Resolve Real-world Github Issues?
☆4,267Updated last week
Alternatives and similar repositories for SWE-bench
Users that are interested in SWE-bench are comparing it to the libraries listed below
Sorting:
- Agentless🐱: an agentless approach to automatically solve software development problems☆2,006Updated last year
- A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-be…☆3,053Updated 9 months ago
- [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments☆2,552Updated last week
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,439Updated 6 months ago
- ☆4,346Updated 6 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆1,301Updated 3 weeks ago
- Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""☆3,922Updated last year
- Code for the paper "Evaluating Large Language Models Trained on Code"☆3,127Updated last year
- LDB: A Large Language Model Debugger via Verifying Runtime Execution Step by Step (ACL'24)☆576Updated last year
- SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersec…☆18,430Updated this week
- Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024☆1,687Updated 4 months ago
- Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhan…☆1,579Updated last year
- The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—b…☆2,864Updated this week
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆1,032Updated last week
- An self-improving embodied conversational agent seamlessly integrated into the operating system to automate our daily tasks.☆1,748Updated last year
- AIOS: AI Agent Operating System☆5,060Updated 3 weeks ago
- A framework for serving and evaluating LLM routers - save LLM costs without compromising quality☆4,581Updated last year
- A benchmark for LLMs on complicated tasks in the terminal☆1,540Updated 3 weeks ago
- AgentCoder: multi-agent code generation framework.☆376Updated 2 months ago
- A unified evaluation framework for large language models☆2,773Updated 3 weeks ago
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"☆1,327Updated 2 months ago
- [ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling☆1,822Updated last year
- Code and Data for Tau-Bench☆1,094Updated 5 months ago
- [ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct☆2,076Updated last year
- AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.☆1,135Updated 3 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆796Updated 6 months ago
- OpenAI Frontier Evals☆994Updated 2 months ago
- Renderer for the harmony response format to be used with gpt-oss☆4,184Updated last month
- Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"☆1,020Updated last year
- ☆626Updated 5 months ago