MCEVAL / McEvalLinks

☆44

Alternatives and similar repositories for McEval

Users that are interested in McEval are comparing it to the libraries listed below

Sorting:

THUDM / NaturalCodeBench
NaturalCodeBench (Findings of ACL 2024)
☆67Updated last year
ntunlp / xCodeEval
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
☆87Updated last year
amazon-science / cceval
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
☆159Updated 2 months ago
OpenCoder-llm / opc_data_filtering
Heuristic filtering framework for RefineCode
☆80Updated 7 months ago
CodeEditorBench / CodeEditorBench
☆53Updated last year
thunlp / DebugBench
The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".
☆83Updated last year
SparksofAGI / MHPP
☆33Updated last month
newfacade / LeetCodeDataset
LeetCode Training and Evaluation Dataset
☆39Updated 6 months ago
open-compass / DevEval
A Comprehensive Benchmark for Software Development.
☆116Updated last year
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆154Updated last year
icip-cas / awesome-auto-alignment
Collection of papers for scalable automated alignment.
☆94Updated last year
princeton-nlp / LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
☆132Updated last year
ganler / code-r1
Reproducing R1 for Code with Reliable Rewards
☆262Updated 5 months ago
seketeam / EvoCodeBench
An Evolving Code Generation Benchmark Aligned with Real-world Code Repositories
☆65Updated last year
qishenghu / InstructCoder
InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw
☆63Updated last year
YJiangcm / FollowBench
[ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
☆117Updated 4 months ago
floatai / HumanEval-XL
[LREC-COLING'24] HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
☆38Updated 7 months ago
OpenBMB / UltraEval
[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.
☆251Updated last year
amazon-science / Repoformer
Repoformer: Selective Retrieval for Repository-Level Code Completion (ICML 2024)
☆61Updated 4 months ago
PremiLab-Math / MathCheck
[ICLR 2025] Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
☆33Updated last year
thu-coai / ComplexBench
Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)
☆95Updated 8 months ago
yiyepiaoling0715 / codellm-data-preprocess-pipeline
代码大模型预训练&微调&DPO 数据处理业界处理pipeline sota
☆44Updated last year
CoderEval / CoderEval
A collection of practical code generation tasks and tests in open source projects. Complementary to HumanEval by OpenAI.
☆154Updated 10 months ago
ntunlp / ExecEval
A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
☆56Updated last year
mtbench101 / mt-bench-101
[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
☆122Updated last year
swtheing / WizardCoder_Instruct_Generator
Generate the WizardCoder Instruct from the CodeAlpaca
☆21Updated 2 years ago
xlang-ai / DS-1000
[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".
☆256Updated last year
AlphaPav / mem-kk-logic
On Memorization of Large Language Models in Logical Reasoning
☆72Updated 7 months ago
YihongDong / CDD-TED4LLMs
☆15Updated 11 months ago
open-compass / MathBench
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
☆107Updated 5 months ago