crux-eval / eval-arenaLinks

☆33

Alternatives and similar repositories for eval-arena

Users that are interested in eval-arena are comparing it to the libraries listed below

Sorting:

qishenghu / InstructCoder
InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw
☆64Updated last year
amazon-science / llm-code-preference
Training and Benchmarking LLMs for Code Preference.
☆37Updated last year
bigcode-project / astraios
Astraios: Parameter-Efficient Instruction Tuning Code Language Models
☆63Updated last year
Ablustrund / APPS_Plus
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
☆74Updated last year
zorazrw / odex
[EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation
☆49Updated 2 years ago
rmshin / llm-mcts
☆41Updated last year
CodeEditorBench / CodeEditorBench
☆56Updated last year
shunzh / Code-AI-Tree-Search
☆119Updated last year
ise-uiuc / xft
XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts
☆35Updated last year
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆165Updated last year
SparksofAGI / MHPP
☆33Updated 4 months ago
martin-wey / CodeUltraFeedback
CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)
☆73Updated last year
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆85Updated last year
SalesforceAIResearch / swecomm
☆28Updated 3 months ago
yaof20 / ReaL
Implementation and datasets for "Training Language Models to Generate Quality Code with Program Analysis Feedback"
☆40Updated 6 months ago
bigcode-project / bigcodebench-annotation
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
☆25Updated last year
evalplus / repoqa
RepoQA: Evaluating Long-Context Code Understanding
☆128Updated last year
zkx06111 / ALGO
☆36Updated 2 years ago
ntunlp / xCodeEval
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
☆87Updated last year
ntunlp / ExecEval
A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
☆62Updated last year
allenai / easy-to-hard-generalization
Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"
☆48Updated 2 years ago
CriticBench / CriticBench
[ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
☆30Updated last year
Zayne-sprague / MuSR
☆56Updated last year
GAIR-NLP / OlympicArena
[NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
☆107Updated 11 months ago
nyu-mll / ILF-for-code-generation
☆80Updated 10 months ago
niansong1996 / lever
Code for paper "LEVER: Learning to Verifiy Language-to-Code Generation with Execution" (ICML'23)
☆90Updated 2 years ago
xufangzhi / Symbol-LLM
[ACL 2024] The project of Symbol-LLM
☆59Updated last year
hughbzhang / o1_inference_scaling_laws
Replicating O1 inference-time scaling laws
☆93Updated last year
OSU-NLP-Group / llm-planning-eval
[ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"
☆54Updated last year
gonglinyuan / safim
☆44Updated 9 months ago