swe-bench / experiments
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
☆96Updated this week
Related projects ⓘ
Alternatives and complementary repositories for experiments
- ☆80Updated 3 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆200Updated last month
- ☆253Updated last month
- Enhancing AI Software Engineering with Repository-level Code Graph☆90Updated 2 months ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆194Updated 6 months ago
- ☆152Updated 2 months ago
- RepoQA: Evaluating Long-Context Code Understanding☆99Updated last week
- Harness used to benchmark aider against SWE Bench benchmarks☆52Updated 4 months ago
- Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions☆40Updated 3 months ago
- r2e: turn any github repository into a programming agent environment☆87Updated last week
- Just a bunch of benchmark logs for different LLMs☆113Updated 3 months ago
- Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents☆109Updated 4 months ago
- ☆38Updated 3 months ago
- A multi-programming language benchmark for LLMs☆206Updated 2 weeks ago
- Functional Benchmarks and the Reasoning Gap☆78Updated last month
- ☆144Updated 3 months ago
- ✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024☆133Updated 2 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆106Updated 2 weeks ago
- EvoEval: Evolving Coding Benchmarks via LLM☆60Updated 7 months ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆138Updated 3 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆111Updated 3 weeks ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆119Updated 2 weeks ago
- AWM: Agent Workflow Memory☆203Updated last month
- ☆72Updated last year
- [NeurIPS'24] SelfCodeAlign: Self-Alignment for Code Generation☆259Updated last week
- A trace analysis tool for AI agents.☆118Updated 3 weeks ago
- Sphynx Hallucination Induction☆47Updated 3 months ago
- Official code for the paper "ADaPT: As-Needed Decomposition and Planning with Language Models"☆71Updated 10 months ago
- BigCodeBench: Benchmarking Code Generation Towards AGI☆223Updated this week
- Can Language Models Solve Olympiad Programming?☆100Updated 3 months ago