codefuse-ai / codefuse-evaluationLinks

Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中

☆101

Alternatives and similar repositories for codefuse-evaluation

Users that are interested in codefuse-evaluation are comparing it to the libraries listed below

Sorting:

codefuse-ai / MFTCoder
High Accuracy and efficiency multi-task fine-tuning framework for Code LLMs. This work has been accepted by KDD 2024.
☆702Updated 9 months ago
CoderEval / CoderEval
A collection of practical code generation tasks and tests in open source projects. Complementary to HumanEval by OpenAI.
☆151Updated 9 months ago
yiyepiaoling0715 / codellm-data-preprocess-pipeline
代码大模型预训练&微调&DPO 数据处理业界处理pipeline sota
☆44Updated last year
bytedance / FullStackBench
Official repository for our paper "FullStack Bench: Evaluating LLMs as Full Stack Coders"
☆106Updated 5 months ago
LingmaTongyi / Lingma-SWE-GPT
Inference code of Lingma SWE-GPT
☆244Updated 10 months ago
seketeam / EvoCodeBench
An Evolving Code Generation Benchmark Aligned with Real-world Code Repositories
☆63Updated last year
thunlp / DebugBench
The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".
☆82Updated last year
codefuse-ai / RepoFuse
☆63Updated 8 months ago
OpenBMB / UltraEval
[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.
☆250Updated 11 months ago
amazon-science / cceval
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
☆157Updated last month
CodeArtsSnap / CoderEval
A collection of practical code generation tasks and tests from open source projects. Complementary to HumanEval by OpenAI.
☆24Updated 2 years ago
QiushiSun / Awesome-Code-Intelligence
Neural Code Intelligence Survey 2024; Reading lists and resources
☆274Updated 2 months ago
RepoUnderstander / RepoUnderstander
Official implementation of paper How to Understand Whole Repository? New SOTA on SWE-bench Lite (21.3%)
☆94Updated 6 months ago
aixcoder-plugin / nl2code-dataset
Aix-bench, the Java benchmark for code synthesis problem.
☆51Updated 3 years ago
THUDM / NaturalCodeBench
NaturalCodeBench (Findings of ACL 2024)
☆67Updated 11 months ago
luban-agi / Awesome-Tool-Learning
A curated list of papers and applications on tool learning.
☆123Updated last year
CLUEbenchmark / SuperCLUE-Agent
SuperCLUE-Agent: 基于中文原生任务的Agent智能体核心能力测评基准
☆92Updated last year
multi-swe-bench / multi-swe-bench
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
☆258Updated last week
open-compass / DevEval
A Comprehensive Benchmark for Software Development.
☆114Updated last year
MCEVAL / McEval
☆44Updated 10 months ago
allanj / repo-level-codegen-papers
Repo-Level Code generation papers
☆212Updated 2 months ago
onejune2018 / Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs…
☆571Updated last month
microsoft / FEA-Bench
[ACL25] FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
☆28Updated this week
NL2Code / NL2Code.github.io
Large Language Models Meet NL2Code: A Survey
☆35Updated 10 months ago
codefuse-ai / CodeFuse-Embeddings
☆29Updated this week
thu-coai / BPO
☆327Updated last year
X-PLUG / Multi-LLM-Agent
☆231Updated last year
code-rag-bench / code-rag-bench
CodeRAG-Bench: Can Retrieval Augment Code Generation?
☆156Updated 10 months ago
thu-coai / CritiqueLLM
☆147Updated last year
THUDM / AlignBench
大模型多维度中文对齐评测基准 (ACL 2024)
☆413Updated last year