swe-bench / SWE-benchLinks

SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues?

☆3,173

Alternatives and similar repositories for SWE-bench

Users that are interested in SWE-bench are comparing it to the libraries listed below

Sorting:

OpenAutoCoder / Agentless
Agentless🐱: an agentless approach to automatically solve software development problems
☆1,793Updated 6 months ago
AutoCodeRoverSG / auto-code-rover
A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-be…
☆2,967Updated 2 months ago
openai / SWELancer-Benchmark
This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…
☆1,435Updated 2 months ago
xlang-ai / OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
☆1,978Updated this week
evalplus / evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
☆1,510Updated last week
Codium-ai / AlphaCodium
Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""
☆3,869Updated 7 months ago
NovaSky-AI / SkyThought
Sky-T1: Train your own O1 preview model within $450
☆3,305Updated this week
openai / simple-evals
☆3,821Updated last week
openai / human-eval
Code for the paper "Evaluating Large Language Models Trained on Code"
☆2,828Updated 6 months ago
FloridSleeves / LLMDebugger
LDB: A Large Language Model Debugger via Verifying Runtime Execution Step by Step (ACL'24)
☆549Updated 10 months ago
openai / mle-bench
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆800Updated 3 weeks ago
xingyaoww / code-act
Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhan…
☆1,296Updated last year
huybery / Awesome-Code-LLM
👨‍💻 An awesome and curated list of best code-LLM for research.
☆1,215Updated 7 months ago
SWE-agent / SWE-agent
SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersec…
☆16,669Updated this week
ShengranHu / ADAS
[ICLR 2025] Automated Design of Agentic Systems
☆1,373Updated 5 months ago
SqueezeAILab / LLMCompiler
[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling
☆1,712Updated last year
web-arena-x / webarena
Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
☆1,048Updated 5 months ago
sierra-research / tau-bench
Code and Data for Tau-Bench
☆666Updated this week
ise-uiuc / magicoder
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct
☆2,018Updated 8 months ago
bigcode-project / starcoder2
Home of StarCoder2!
☆1,936Updated last year
LiveBench / LiveBench
LiveBench: A Challenging, Contamination-Free LLM Benchmark
☆823Updated this week
lm-sys / RouteLLM
A framework for serving and evaluating LLM routers - save LLM costs without compromising quality
☆4,099Updated 11 months ago
MinorJerry / WebVoyager
Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"
☆847Updated last year
WecoAI / aideml
AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.
☆956Updated last week
LiveCodeBench / LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆590Updated last week
aorwall / moatless-tools
☆490Updated 3 weeks ago
gkamradt / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆1,934Updated 11 months ago
THUDM / AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
☆2,683Updated 5 months ago
agiresearch / AIOS
AIOS: AI Agent Operating System
☆4,358Updated last week
allenai / open-instruct
AllenAI's post-training codebase
☆3,061Updated this week