domaineval / DomainEvalLinks

DOMAINEVAL is an auto-constructed benchmark for multi-domain code generation that consists of 2k+ subjects (i.e., description, reference code and tests) covering six domains (i.e., Computation, Basic, Network, Cryptography, Visualization, System).

☆13

Alternatives and similar repositories for DomainEval

Users that are interested in DomainEval are comparing it to the libraries listed below

Sorting:

cxcscmu / MATES
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆75Updated 11 months ago
YihongDong / CDD-TED4LLMs
☆15Updated 11 months ago
ntunlp / LLMSanitize
An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).
☆56Updated last year
liyucheng09 / Contamination_Detector
Lightweight tool to identify Data Contamination in LLMs evaluation
☆52Updated last year
lyy1994 / awesome-data-contamination
The Paper List on Data Contamination for Large Language Models Evaluation.
☆101Updated 2 months ago
GAIR-NLP / ReasonEval
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆73Updated 3 weeks ago
halfrot / ALaRM
[ACL 2024] Code for the paper "ALaRM: Align Language Models via Hierarchical Rewards Modeling"
☆25Updated last year
KbsdJames / omni-math-rule
The rule-based evaluation subset and code implementation of Omni-MATH
☆24Updated 10 months ago
shizhediao / R-Tuning
[NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't…
☆122Updated last year
ChengpengLi1003 / DotaMath
☆30Updated 10 months ago
KodCode-AI / code-r1
Reproducing R1 for Code with Reliable Rewards
☆11Updated 6 months ago
GAIR-NLP / MetaCritique
Evaluate the Quality of Critique
☆36Updated last year
SihengLi99 / LLM-Honesty-Survey
[2025-TMLR] A Survey on the Honesty of Large Language Models
☆61Updated 10 months ago
CriticBench / CriticBench
[ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
☆27Updated last year
pillowsofwind / Course-Correction
[EMNLP 2024] The official GitHub repo for the paper "Course-Correction: Safety Alignment Using Synthetic Preferences"
☆19Updated last year
ZHZisZZ / weak-to-strong-search
[NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
☆62Updated 10 months ago
SparkJiao / dpo-trajectory-reasoning
[EMNLP 2024] Source code for the paper "Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing".
☆82Updated 9 months ago
jszheng21 / RACE
RACE is a multi-dimensional benchmark for code generation that focuses on Readability, mAintainability, Correctness, and Efficiency.
☆10Updated last year
qtli / GSM-Plus
GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.
☆63Updated last year
hkust-nlp / dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
☆115Updated 10 months ago
zthang / Focus
☆20Updated last year
qiuzh20 / EMoE
Official PyTorch Implementation of EMoE: Unlocking Emergent Modularity in Large Language Models [main conference @ NAACL2024]
☆35Updated last year
TIGER-AI-Lab / AceCoder
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]
☆92Updated 6 months ago
SeekingDream / Static-to-Dynamic-LLMEval
The official GitHub repository of the paper "Recent advances in large langauge model benchmarks against data contamination: From static t…
☆45Updated last month
HillZhang1999 / ICD
Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"
☆69Updated last year
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆51Updated 4 months ago
princeton-nlp / LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
☆132Updated last year
lifan-yuan / OOD_NLP
[NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…
☆35Updated 2 years ago
KwanWaiChung / MT-Eval
Code and data for "MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models"
☆45Updated last year
thu-coai / BARREL
☆16Updated 5 months ago