MCEVAL / McEvalLinks
☆43Updated 7 months ago
Alternatives and similar repositories for McEval
Users that are interested in McEval are comparing it to the libraries listed below
Sorting:
- NaturalCodeBench (Findings of ACL 2024)☆68Updated 9 months ago
- ☆32Updated last month
- ☆51Updated last year
- xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval☆86Updated 10 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆151Updated 9 months ago
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆151Updated last year
- Heuristic filtering framework for RefineCode☆68Updated 4 months ago
- Reproducing R1 for Code with Reliable Rewards☆243Updated 2 months ago
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆79Updated last year
- Generate the WizardCoder Instruct from the CodeAlpaca☆21Updated 2 years ago
- Collection of papers for scalable automated alignment.☆93Updated 9 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆62Updated 10 months ago
- CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models☆41Updated last year
- ☆11Updated 2 years ago
- StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback☆67Updated 11 months ago
- [ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".☆251Updated 9 months ago
- Counting-Stars (★)☆83Updated 2 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆184Updated 9 months ago
- A Comprehensive Benchmark for Software Development.☆111Updated last year
- ☆71Updated 2 weeks ago
- Async pipelined version of Verl☆110Updated 3 months ago
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆102Updated last year
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆268Updated 10 months ago
- Repository of LV-Eval Benchmark☆67Updated 11 months ago
- Towards Systematic Measurement for Long Text Quality☆37Updated 10 months ago
- Code for our EMNLP-2023 paper: "Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks"☆24Updated last year
- A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.☆56Updated 9 months ago
- ☆144Updated last year
- LeetCode Training and Evaluation Dataset☆28Updated 3 months ago
- ☆298Updated last year