open-compass / DevEvalLinks

A Comprehensive Benchmark for Software Development.

☆111

Alternatives and similar repositories for DevEval

Users that are interested in DevEval are comparing it to the libraries listed below

Sorting:

ganler / code-r1
Reproducing R1 for Code with Reliable Rewards
☆240Updated 2 months ago
microsoft / SWE-bench-Live
🚀 SWE-bench Goes Live!
☆103Updated last week
code-rag-bench / code-rag-bench
CodeRAG-Bench: Can Retrieval Augment Code Generation?
☆148Updated 8 months ago
thunlp / DebugBench
The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".
☆79Updated last year
qishenghu / InstructCoder
InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw
☆62Updated 9 months ago
R2E-Gym / R2E-Gym
Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆136Updated 3 weeks ago
zjunlp / AutoAct
[ACL 2024] AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning
☆229Updated 6 months ago
zjunlp / WorfBench
[ICLR 2025] Benchmarking Agentic Workflow Generation
☆111Updated 5 months ago
zorazrw / awesome-tool-llm
☆237Updated 11 months ago
open-compass / GTA
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
☆113Updated 4 months ago
InfiAgent / InfiAgent
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (ICML 2024)
☆138Updated 2 months ago
StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…
☆231Updated 2 months ago
zhangxjohn / LLM-Agent-Benchmark-List
A banchmark list for evaluation of large language models.
☆134Updated last month
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆151Updated 9 months ago
THUNLP-MT / StableToolBench
A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
☆165Updated 3 months ago
Leolty / repobench
✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
☆168Updated 11 months ago
SALT-NLP / DyLAN
Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
☆156Updated last year
CodeEditorBench / CodeEditorBench
☆49Updated last year
xlang-ai / Spider2-V
[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
☆129Updated 11 months ago
yuzhu-cai / rSDE-Bench
☆26Updated 2 months ago
xlang-ai / DS-1000
[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".
☆251Updated 9 months ago
LCLM-Horizon / A-Comprehensive-Survey-For-Long-Context-Language-Modeling
A Comprehensive Survey on Long Context Language Modeling
☆166Updated 3 weeks ago
GAIR-NLP / ToRL
☆258Updated 2 months ago
GAIR-NLP / OctoThinker
Revisiting Mid-training in the Era of Reinforcement Learning Scaling
☆159Updated last week
OpenBMB / Eurus
☆320Updated 10 months ago
Ber666 / ToolkenGPT
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)
☆264Updated last year
hkust-nlp / AgentBoard
An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]
☆332Updated last year
Open-Source-O1 / o1_Reasoning_Patterns_Study
☆103Updated 7 months ago
GAIR-NLP / ProX
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
☆255Updated 3 weeks ago
sambanova / toolbench
ToolBench, an evaluation suite for LLM tool manipulation capabilities.
☆157Updated last year