amazon-science / SWE-PolyBenchLinks

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

☆77

Alternatives and similar repositories for SWE-PolyBench

Users that are interested in SWE-PolyBench are comparing it to the libraries listed below

Sorting:

SWE-bench / experiments
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
☆246Updated last week
SWE-agent / SWE-ReX
Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.
☆430Updated this week
SWE-bench / SWE-smith
[NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents
☆538Updated this week
laude-institute / harbor
Harbor is a framework for running agent evaluations and creating and using RL environments.
☆542Updated this week
princeton-pli / hal-harness
☆223Updated this week
amazon-science / CodeSage
CodeSage: Code Representation Learning At Scale (ICLR 2024)
☆116Updated last year
SWE-bench / sb-cli
Run SWE-bench evaluations remotely
☆53Updated 5 months ago
METR / eval-analysis-public
Public repository containing METR's DVC pipeline for eval data analysis
☆199Updated last week
bigcode-project / bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
☆477Updated last month
google / lmeval
☆238Updated 2 months ago
Aider-AI / aider-swe-bench
Harness used to benchmark aider against SWE Bench benchmarks
☆79Updated last year
DeepSoftwareAnalytics / Awesome-Agent4SE
☆106Updated last year
haizelabs / verdict
Inference-time scaling for LLMs-as-a-judge.
☆328Updated 3 months ago
aorwall / moatless-tree-search
☆132Updated 8 months ago
SWE-Gym / SWE-Gym
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆625Updated 6 months ago
sierra-research / tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
☆717Updated last week
chroma-core / context-rot
This repository contains the toolkit for replicating results from our technical report.
☆200Updated 5 months ago
ScalingIntelligence / codemonkeys
☆59Updated last year
All-Hands-AI / openhands-resolver
A system that tries to resolve all issues on a github repo with OpenHands.
☆117Updated last year
laude-institute / terminal-bench
A benchmark for LLMs on complicated tasks in the terminal
☆1,494Updated 2 weeks ago
PrimeIntellect-ai / community-environments
Lightly-reviewed collection of community environments
☆210Updated last week
anthropic-experimental / automated-auditing
Prompts used in the Automated Auditing Blog Post
☆137Updated 6 months ago
All-Hands-AI / openhands-aci
Agent computer interface for AI software engineer.
☆116Updated 2 months ago
UKGovernmentBEIS / inspect_evals
Collection of evals for Inspect AI
☆357Updated this week
facebookresearch / swe-rl
[NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆675Updated 10 months ago
open-thoughts / OpenThoughts-Agent
Data recipes and robust infrastructure for training AI agents
☆94Updated this week
OpenAutoCoder / live-swe-agent
Live-SWE-agent: live, runtime self-evolving software engineering agent
☆240Updated 3 weeks ago
anyscale / llm-router
Tutorial for building LLM router
☆244Updated last year
aymeric-roucher / GAIA
Beating the GAIA benchmark with Transformers Agents. 🚀
☆146Updated 11 months ago
guidance-ai / jsonschemabench
☆76Updated 7 months ago