ntunlp / ExecEvalLinks

A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.

☆56

Alternatives and similar repositories for ExecEval

Users that are interested in ExecEval are comparing it to the libraries listed below

Sorting:

amazon-science / cceval
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
☆151Updated last year
ntunlp / xCodeEval
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
☆86Updated 10 months ago
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆151Updated 9 months ago
Leolty / repobench
✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
☆168Updated 11 months ago
qishenghu / InstructCoder
InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw
☆62Updated 9 months ago
shrivastavadisha / repo_level_prompt_generation
☆124Updated 2 years ago
reddy-lab-code-research / PPOCoder
Code for the TMLR 2023 paper "PPOCoder: Execution-based Code Generation using Deep Reinforcement Learning"
☆114Updated last year
xlang-ai / DS-1000
[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".
☆251Updated 9 months ago
Zyq-scut / RLTF
Accepted by Transactions on Machine Learning Research (TMLR)
☆130Updated 9 months ago
CodeEditorBench / CodeEditorBench
☆49Updated last year
SparksofAGI / MHPP
☆32Updated last month
zorazrw / odex
[EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation
☆48Updated last year
thunlp / DebugBench
The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".
☆79Updated last year
Ablustrund / APPS_Plus
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
☆67Updated 11 months ago
nuprl / MultiPL-E
A multi-programming language benchmark for LLMs
☆265Updated 2 weeks ago
theblackcat102 / evol-dataset
evol augment any dataset online
☆59Updated last year
amazon-science / Repoformer
Repoformer: Selective Retrieval for Repository-Level Code Completion (ICML 2024)
☆55Updated last month
amazon-science / mxeval
☆110Updated last year
THUDM / NaturalCodeBench
NaturalCodeBench (Findings of ACL 2024)
☆68Updated 9 months ago
bigcode-project / astraios
Astraios: Parameter-Efficient Instruction Tuning Code Language Models
☆59Updated last year
rmshin / llm-mcts
☆41Updated last year
R2E-Gym / R2E-Gym
Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆136Updated 2 weeks ago
nickrosh / evol-teacher
Open Source WizardCoder Dataset
☆159Updated 2 years ago
niansong1996 / lever
Code for paper "LEVER: Learning to Verifiy Language-to-Code Generation with Execution" (ICML'23)
☆89Updated 2 years ago
crux-eval / eval-arena
☆28Updated 2 weeks ago
logic-star-ai / swt-bench
[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation
☆51Updated this week
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆81Updated 11 months ago
evalplus / repoqa
RepoQA: Evaluating Long-Context Code Understanding
☆113Updated 9 months ago
terryyz / ice-score
[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code
☆76Updated last year
princeton-nlp / intercode
[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
☆223Updated last year