SWE-bench / sb-cliLinks
Run SWE-bench evaluations remotely
☆41Updated 2 months ago
Alternatives and similar repositories for sb-cli
Users that are interested in sb-cli are comparing it to the libraries listed below
Sorting:
- RepoQA: Evaluating Long-Context Code Understanding☆121Updated last year
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆218Updated 2 weeks ago
- Harness used to benchmark aider against SWE Bench benchmarks☆77Updated last year
- [ACL25' Findings] SWE-Dev is an SWE agent with a scalable test case construction pipeline.☆56Updated 3 months ago
- ☆125Updated 5 months ago
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆63Updated 10 months ago
- ☆59Updated 9 months ago
- ☆121Updated 4 months ago
- Code for the paper: CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models☆29Updated 7 months ago
- ☆58Updated 4 months ago
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆93Updated 5 months ago
- ☆30Updated last year
- Training and Benchmarking LLMs for Code Preference.☆36Updated 11 months ago
- accompanying material for sleep-time compute paper☆117Updated 6 months ago
- SWE Arena☆35Updated 3 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆188Updated 7 months ago
- ☆28Updated 9 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆63Updated last year
- ☆54Updated last year
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆178Updated 5 months ago
- Moatless Testbeds allows you to create isolated testbed environments in a Kubernetes cluster where you can apply code changes through git…☆14Updated 6 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆34Updated 6 months ago
- ☆101Updated last year
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆49Updated last year
- [ACL 2025] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems☆108Updated 4 months ago
- Small, simple agent task environments for training and evaluation☆18Updated last year
- ☆80Updated 2 weeks ago
- Official code for the paper "CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules"☆47Updated 9 months ago
- Computer Agent Arena: Test & compare AI agents in real desktop apps & web environments. Code/data coming soon!☆50Updated 6 months ago
- Data and evaluation scripts for "CodePlan: Repository-level Coding using LLMs and Planning", FSE 2024☆75Updated last year