open-compass / DevEval
A Comprehensive Benchmark for Software Development.
☆93Updated 8 months ago
Alternatives and similar repositories for DevEval:
Users that are interested in DevEval are comparing it to the libraries listed below
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆62Updated 7 months ago
- [ACL 2024] AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning☆207Updated last month
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆129Updated 5 months ago
- CodeRAG-Bench: Can Retrieval Augment Code Generation?☆109Updated 3 months ago
- Official implementation of paper "On the Diagram of Thought" (https://arxiv.org/abs/2409.10038)☆172Updated 4 months ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆217Updated this week
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆47Updated this week
- Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization☆128Updated 9 months ago
- Reformatted Alignment☆114Updated 4 months ago
- NaturalCodeBench (Findings of ACL 2024)☆62Updated 4 months ago
- ☆57Updated 2 months ago
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆256Updated 10 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆175Updated 4 months ago
- ☆210Updated 6 months ago
- InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (ICML 2024)☆107Updated 2 months ago
- ☆130Updated 2 months ago
- ☆98Updated 2 months ago
- [ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".☆233Updated 3 months ago
- The official repository of the Omni-MATH benchmark.☆71Updated last month
- ✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024☆144Updated 6 months ago
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…☆114Updated 7 months ago
- Code implementation of synthetic continued pretraining☆88Updated last month
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)☆116Updated 3 months ago
- ☆60Updated 7 months ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆65Updated 2 months ago
- Generative Judge for Evaluating Alignment☆228Updated last year
- ☆45Updated 4 months ago
- [NeurIPS 2024] Agent Planning with World Knowledge Model☆110Updated 2 months ago
- ☆258Updated 6 months ago