SWE-bench / swe-bench.github.ioLinks

Landing page + leaderboard for SWE-Bench benchmark

☆6

Alternatives and similar repositories for swe-bench.github.io

Users that are interested in swe-bench.github.io are comparing it to the libraries listed below

Sorting:

aorwall / moatless-testbeds
Moatless Testbeds allows you to create isolated testbed environments in a Kubernetes cluster where you can apply code changes through git…
☆12Updated last month
allenai / olmo-cookbook
OLMost every training recipe you need to perform data interventions with the OLMo family of models.
☆30Updated this week
ZeroSumEval / ZeroSumEval
A framework for pitting LLMs against each other in an evolving library of games ⚔
☆32Updated last month
clinicalml / realhumaneval
☆19Updated 7 months ago
HazyResearch / aioli
Aioli: A unified optimization framework for language model data mixing
☆27Updated 4 months ago
OSU-NLP-Group / SeeActChromeExtension
☆16Updated 5 months ago
VITA-Group / ChainCoder
[ICML 2023] "Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation", Wenqing Zheng, S P Sharan, Ajay Kumar Jaiswal, …
☆40Updated last year
felixbinder / introspection_self_prediction
Code for experiments on self-prediction as a way to measure introspection in LLMs
☆13Updated 5 months ago
dxhou / CoAct
☆26Updated 10 months ago
Aider-AI / aider-swe-bench
Harness used to benchmark aider against SWE Bench benchmarks
☆72Updated 11 months ago
amazon-science / llm-code-preference
Training and Benchmarking LLMs for Code Preference.
☆33Updated 6 months ago
SalesforceAIResearch / CodeTree
Code for the paper: CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models
☆21Updated 2 months ago
mandyyyyii / east
☆17Updated last month
plastic-labs / dspy-opentom
Exploration using DSPy to optimize modules to maximize performance on the OpenToM dataset
☆16Updated last year
logic-star-ai / swt-bench
[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation
☆49Updated last month
Zyphra / zcookbook
Training hybrid models for dummies.
☆21Updated 4 months ago
yueqis / API-Based-Agent
☆50Updated last week
JoshuaPurtell / SmallBench
Small, simple agent task environments for training and evaluation
☆18Updated 7 months ago
allenai / codenav
CodeNav is an LLM agent that navigates and leverages previously unseen code repositories to solve user queries.
☆48Updated 9 months ago
aorwall / moatless-tree-search
☆83Updated last month
huggingface / wikirace-llms
☆21Updated 3 weeks ago
nuprl / CanItEdit
Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions
☆42Updated 10 months ago
ethz-spylab / superhuman-ai-consistency
☆29Updated last year
sorendunn / Agentless-Lite
Agentless Lite: RAG-based SWE-Bench software engineering scaffold
☆29Updated last month
logikon-ai / cot-eval
A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
☆17Updated 4 months ago
All-Hands-AI / trajectory-visualizer
☆25Updated this week
luohongyin / LangCode
LangCode - Improving alignment and reasoning of large language models (LLMs) with natural language embedded program (NLEP).
☆42Updated last year
kiddyboots216 / lottery-ticket-adaptation
Lottery Ticket Adaptation
☆39Updated 6 months ago
dinobby / MAGDi
The code implementation of MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models…
☆34Updated last year
TheDuckAI / arb
Advanced Reasoning Benchmark Dataset for LLMs
☆46Updated last year