open-compass / DevEvalLinks
A Comprehensive Benchmark for Software Development.
☆115Updated last year
Alternatives and similar repositories for DevEval
Users that are interested in DevEval are comparing it to the libraries listed below
Sorting:
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆83Updated last year
- Reproducing R1 for Code with Reliable Rewards☆259Updated 5 months ago
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆176Updated 3 months ago
- [NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!☆128Updated 3 weeks ago
- 🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource…☆290Updated this week
- ☆239Updated last year
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆154Updated last year
- CodeRAG-Bench: Can Retrieval Augment Code Generation?☆156Updated 11 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆62Updated last year
- SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution☆89Updated last month
- ✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024☆174Updated last year
- NaturalCodeBench (Findings of ACL 2024)☆67Updated last year
- ☆139Updated last week
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆159Updated 2 months ago
- ☆53Updated last year
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆129Updated 8 months ago
- A Comprehensive Survey on Long Context Language Modeling☆193Updated 3 months ago
- [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents☆126Updated 6 months ago
- [ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".☆256Updated 11 months ago
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆191Updated 6 months ago
- [TMLR] Cumulative Reasoning With Large Language Models (https://arxiv.org/abs/2308.04371)☆302Updated 2 months ago
- ☆30Updated 4 months ago
- InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (ICML 2024)☆153Updated 4 months ago
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆114Updated 5 months ago
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]☆355Updated last year
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.☆247Updated 6 months ago
- Official repository for our paper "FullStack Bench: Evaluating LLMs as Full Stack Coders"☆106Updated 5 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆246Updated 5 months ago
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆262Updated last year
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?☆132Updated last year