swe-bench / experiments
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
☆103Updated this week
Related projects ⓘ
Alternatives and complementary repositories for experiments
- ☆82Updated 4 months ago
- Enhancing AI Software Engineering with Repository-level Code Graph☆96Updated 2 months ago
- r2e: turn any github repository into a programming agent environment☆89Updated 3 weeks ago
- RepoQA: Evaluating Long-Context Code Understanding☆100Updated 3 weeks ago
- ☆153Updated 2 months ago
- Harness used to benchmark aider against SWE Bench benchmarks☆53Updated 4 months ago
- ☆274Updated this week
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆222Updated last month
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆194Updated 6 months ago
- ✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024☆133Updated 3 months ago
- Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions☆40Updated 3 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆128Updated last month
- [NeurIPS'24] SelfCodeAlign: Self-Alignment for Code Generation☆270Updated 3 weeks ago
- EvoEval: Evolving Coding Benchmarks via LLM☆60Updated 7 months ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆140Updated 3 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆111Updated last month
- AWM: Agent Workflow Memory☆210Updated last month
- BigCodeBench: Benchmarking Code Generation Towards AGI☆231Updated this week
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆115Updated last month
- ☆146Updated 3 months ago
- Graph-based method for end-to-end code completion with context awareness on repository☆47Updated 2 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆52Updated last month
- ☆103Updated 3 months ago
- Code and Data for Tau-Bench☆204Updated this week
- Just a bunch of benchmark logs for different LLMs☆116Updated 3 months ago
- Cognition's results and methodology on SWE-bench☆118Updated 8 months ago
- A multi-programming language benchmark for LLMs☆208Updated this week
- An Analytical Evaluation Board of Multi-turn LLM Agents☆250Updated 6 months ago
- A trace analysis tool for AI agents.☆124Updated last month
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆122Updated 3 months ago