FSoft-AI4Code / CodeMMLULinks
[ICLR 2025] π CodeMMLU Evaluator: A framework for evaluating LM models on CodeMMLU MCQs benchmark.
β23Updated last month
Alternatives and similar repositories for CodeMMLU
Users that are interested in CodeMMLU are comparing it to the libraries listed below
Sorting:
- [EMNLP 2023] The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generationβ95Updated 9 months ago
- [FORGE 2025] Predicting Program Behavior with Dynamic Dependencies Learningβ24Updated 9 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srwβ62Updated 8 months ago
- π SWE-bench Goes Live!β24Updated last week
- A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.β54Updated 7 months ago
- β81Updated 7 months ago
- Training and Benchmarking LLMs for Code Preference.β33Updated 6 months ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Modelsβ58Updated last year
- Official code for the paper "CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules"β45Updated 4 months ago
- β26Updated 4 months ago
- Code for paper "LEVER: Learning to Verifiy Language-to-Code Generation with Execution" (ICML'23)β87Updated last year
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMsβ87Updated 6 months ago
- XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Expertsβ31Updated 11 months ago
- [LREC-COLING'24] HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalizationβ39Updated 2 months ago
- [NAACL 2025] Benchmark for Repository-Level Code Generation, focus on Executability, Correctness from Test Cases and Usage of Contexts frβ¦β29Updated 3 months ago
- xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrievalβ82Updated 8 months ago
- β107Updated 2 weeks ago
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generationβ48Updated last year
- [ACL'2025 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Taskβ¦β27Updated last month
- β39Updated 11 months ago
- Replicating O1 inference-time scaling lawsβ87Updated 6 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methodsβ92Updated this week
- β13Updated 2 months ago
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agentsβ73Updated last month
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.β89Updated 3 weeks ago
- β110Updated 10 months ago
- [ACL 2024] Novel reranking method to select the best solutions for code generationβ16Updated 11 months ago
- β46Updated last year
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"β107Updated last year
- β42Updated 2 months ago