open-compass / DevEval
A Comprehensive Benchmark for Software Development.
☆96Updated 9 months ago
Alternatives and similar repositories for DevEval:
Users that are interested in DevEval are comparing it to the libraries listed below
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆66Updated 8 months ago
- [ACL 2024] AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning☆213Updated 2 months ago
- NaturalCodeBench (Findings of ACL 2024)☆62Updated 4 months ago
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆132Updated last week
- Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization☆132Updated 9 months ago
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆58Updated 3 weeks ago
- InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (ICML 2024)☆111Updated 2 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆131Updated 5 months ago
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?☆118Updated 6 months ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)☆127Updated 4 months ago
- xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval☆78Updated 5 months ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆226Updated 3 weeks ago
- CodeRAG-Bench: Can Retrieval Augment Code Generation?☆117Updated 3 months ago
- Official implementation of paper "On the Diagram of Thought" (https://arxiv.org/abs/2409.10038)☆174Updated 5 months ago
- ☆103Updated last month
- ☆101Updated 3 months ago
- Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".☆51Updated 3 months ago
- Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis☆111Updated last week
- ☆214Updated 6 months ago
- Reformatted Alignment☆114Updated 5 months ago
- ✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024☆146Updated 6 months ago
- ☆123Updated 2 months ago
- On Memorization of Large Language Models in Logical Reasoning☆53Updated 4 months ago
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆66Updated last month
- Towards Large Multimodal Models as Visual Foundation Agents☆192Updated last month