amazon-science / SWE-PolyBenchLinks
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents
☆75Updated this week
Alternatives and similar repositories for SWE-PolyBench
Users that are interested in SWE-PolyBench are comparing it to the libraries listed below
Sorting:
- Harbor is a framework for running agent evaluations and creating and using RL environments.☆213Updated this week
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆396Updated this week
- ☆207Updated last week
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆487Updated this week
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆228Updated last week
- Run SWE-bench evaluations remotely☆47Updated 4 months ago
- τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment☆549Updated last week
- ☆59Updated 10 months ago
- Public repository containing METR's DVC pipeline for eval data analysis☆164Updated 8 months ago
- ☆235Updated 3 weeks ago
- A clean, modular SDK for building AI agents with OpenHands V1.☆360Updated this week
- A Text-Based Environment for Interactive Debugging☆286Updated this week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆190Updated 9 months ago
- Curated collection of community environments☆195Updated last week
- An agent benchmark with tasks in a simulated software company.☆604Updated last month
- Matrix (Multi-Agent daTa geneRation Infra and eXperimentation framework) is a versatile engine for multi-agent conversational data genera…☆241Updated 2 weeks ago
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?☆233Updated last month
- ☆128Updated 6 months ago
- ☆234Updated 5 months ago
- accompanying material for sleep-time compute paper☆118Updated 7 months ago
- Tutorial for building LLM router☆239Updated last year
- CodeSage: Code Representation Learning At Scale (ICLR 2024)☆114Updated last year
- A benchmark for LLMs on complicated tasks in the terminal☆1,235Updated this week
- Agent computer interface for AI software engineer.☆115Updated 2 weeks ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆601Updated 4 months ago
- The Granite Guardian models are designed to detect risks in prompts and responses.☆123Updated 2 months ago
- Coding problems used in aider's polyglot benchmark☆198Updated last year
- ☆136Updated 9 months ago
- ☆79Updated 2 months ago
- This repository contains the toolkit for replicating results from our technical report.☆181Updated 3 months ago