FSoft-AI4Code / CodeMMLULinks

[ICLR 2025] 🚀 CodeMMLU Evaluator: A framework for evaluating LM models on CodeMMLU MCQs benchmark.

☆23

Alternatives and similar repositories for CodeMMLU

Users that are interested in CodeMMLU are comparing it to the libraries listed below

Sorting:

ScalerLab / JudgeBench
☆90Updated 8 months ago
evalplus / repoqa
RepoQA: Evaluating Long-Context Code Understanding
☆112Updated 8 months ago
zorazrw / odex
[EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation
☆48Updated last year
qishenghu / InstructCoder
InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw
☆61Updated 9 months ago
amazon-science / llm-code-preference
Training and Benchmarking LLMs for Code Preference.
☆34Updated 8 months ago
cohere-ai / magikarp
Code for the paper "Fishing for Magikarp"
☆159Updated 2 months ago
SalesforceAIResearch / CodeChain
Official code for the paper "CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules"
☆45Updated 6 months ago
niansong1996 / lever
Code for paper "LEVER: Learning to Verifiy Language-to-Code Generation with Execution" (ICML'23)
☆89Updated 2 years ago
gonglinyuan / safim
☆36Updated 2 months ago
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆149Updated 9 months ago
rmshin / llm-mcts
☆41Updated last year
ntunlp / ExecEval
A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
☆55Updated 9 months ago
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆88Updated 9 months ago
Ablustrund / APPS_Plus
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
☆67Updated 10 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆108Updated last month
AlexCuadron / ThinkingAgent
Systematic evaluation framework that automatically rates overthinking behavior in large language models.
☆91Updated 2 months ago
FSoft-AI4Code / RepoHyper
[FORGE 2025] Graph-based method for end-to-end code completion with context awareness on repository
☆64Updated 10 months ago
RulinShao / retrieval-scaling
Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".
☆206Updated last month
amazon-science / Repoformer
Repoformer: Selective Retrieval for Repository-Level Code Completion (ICML 2024)
☆55Updated last month
hughbzhang / o1_inference_scaling_laws
Replicating O1 inference-time scaling laws
☆89Updated 7 months ago
FSoft-AI4Code / TheVault
[EMNLP 2023] The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
☆96Updated 11 months ago
bigcode-project / astraios
Astraios: Parameter-Efficient Instruction Tuning Code Language Models
☆58Updated last year
marzenakrp / nocha
☆52Updated 9 months ago
SalesforceAIResearch / swecomm
☆27Updated 6 months ago
huggingface / ioi
☆35Updated 4 months ago
hkust-nlp / B-STaR
B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
☆82Updated 2 months ago
ahans30 / goldfish-loss
[NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs
☆90Updated 8 months ago
amazon-science / mxeval
☆110Updated last year
SalesforceAIResearch / LaTRO
☆117Updated 5 months ago
epfl-dlab / llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
☆78Updated last year