open-compass / DevEval
A Comprehensive Benchmark for Software Development.
โ85Updated 6 months ago
Alternatives and similar repositories for DevEval:
Users that are interested in DevEval are comparing it to the libraries listed below
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".โ59Updated 5 months ago
- [ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planningโ186Updated 2 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. ๐งฎโจโ117Updated 7 months ago
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.โ121Updated 3 months ago
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied witโฆโ92Updated 5 months ago
- CodeRAG-Bench: Can Retrieval Augment Code Generation?โ90Updated last month
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"โ201Updated 2 months ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenariosโ64Updated 2 weeks ago
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]โ127Updated 2 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Modelsโ170Updated 2 months ago
- The official repository of the Omni-MATH benchmark.โ59Updated last month
- NaturalCodeBench (Findings of ACL 2024)โ59Updated 2 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"โ92Updated 5 months ago
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?โ112Updated 3 months ago
- โ149Updated 4 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluationโ116Updated 2 months ago
- Code for the paper <SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning>โ46Updated last year
- A series of technical report on Slow Thinking with LLMโ23Updated this week
- [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Datasetโ86Updated 5 months ago
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"โ51Updated 7 months ago
- โ200Updated 4 months ago
- [NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.โ73Updated last month
- This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"โ43Updated last month
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Modelsโ225Updated 3 months ago
- โ37Updated 3 weeks ago
- Enhancing AI Software Engineering with Repository-level Code Graphโ101Updated 3 months ago
- [ACL 2024] The official codebase for the paper "Self-Distillation Bridges Distribution Gap in Language Model Fine-tuning".โ105Updated last month
- โ34Updated 2 months ago
- This is the official repository of the paper "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI"โ89Updated this week
- MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Modelsโ21Updated last month