MCEVAL / McEval
☆38Updated 4 months ago
Alternatives and similar repositories for McEval:
Users that are interested in McEval are comparing it to the libraries listed below
- NaturalCodeBench (Findings of ACL 2024)☆62Updated 6 months ago
- Heuristic filtering framework for RefineCode☆59Updated last month
- xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval☆79Updated 6 months ago
- Reproducing R1 for Code with Reliable Rewards☆163Updated this week
- Collection of papers for scalable automated alignment.☆87Updated 5 months ago
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving☆28Updated last week
- On Memorization of Large Language Models in Logical Reasoning☆62Updated 2 weeks ago
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆136Updated 8 months ago
- Repository of LV-Eval Benchmark☆61Updated 7 months ago
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆72Updated 9 months ago
- ☆98Updated 6 months ago
- Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)☆76Updated last month
- ☆44Updated 10 months ago
- Official github repo for AutoDetect, an automated weakness detection framework for LLMs.☆42Updated 9 months ago
- An Evolving Code Generation Benchmark Aligned with Real-world Code Repositories☆55Updated 7 months ago
- [ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)☆122Updated last month
- ☆148Updated 3 months ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆65Updated 4 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆124Updated 9 months ago
- [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset☆97Updated 9 months ago
- [ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models☆97Updated 4 months ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆122Updated 10 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆135Updated 6 months ago
- StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback☆64Updated 7 months ago
- Code for Paper: Teaching Language Models to Critique via Reinforcement Learning☆88Updated last month
- CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models☆41Updated last year
- We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.☆61Updated 5 months ago
- [ICML 2024] Selecting High-Quality Data for Training Language Models☆164Updated 9 months ago
- ☆265Updated 8 months ago
- 代码大模型 预训练&微调&DPO 数据处理 业界处理pipeline sota☆38Updated 8 months ago