LiveCodeBench / LiveCodeBenchLinks

Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"

☆688

Alternatives and similar repositories for LiveCodeBench

Users that are interested in LiveCodeBench are comparing it to the libraries listed below

Sorting:

bigcode-project / bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
☆440Updated 2 weeks ago
facebookresearch / swe-rl
[NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆610Updated 7 months ago
SWE-Gym / SWE-Gym
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆553Updated 3 months ago
hkust-nlp / CodeIO
[ICML 2025 Oral] CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
☆556Updated 5 months ago
SWE-bench / SWE-smith
[NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents
☆432Updated last week
NVIDIA-NeMo / Skills
A project to improve skills of large language models
☆594Updated this week
multi-swe-bench / multi-swe-bench
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
☆268Updated this week
lmarena / arena-hard-auto
Arena-Hard-Auto: An automatic LLM benchmark.
☆948Updated 4 months ago
TIGER-AI-Lab / MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
☆305Updated 8 months ago
mlfoundations / evalchemy
Automatic evals for LLMs
☆550Updated 4 months ago
huggingface / Math-Verify
☆981Updated 3 months ago
SimpleBerry / LLaMA-O1
Large Reasoning Models
☆805Updated 10 months ago
bigcode-project / bigcode-evaluation-harness
A framework for the evaluation of autoregressive code generation language models.
☆986Updated 3 months ago
zhentingqi / rStar
☆964Updated 9 months ago
abacaj / code-eval
Run evaluation on LLMs using human-eval benchmark
☆420Updated 2 years ago
bigcode-project / selfcodealign
[NeurIPS'24] SelfCodeAlign: Self-Alignment for Code Generation
☆319Updated 8 months ago
idavidrein / gpqa
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
☆420Updated last year
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆254Updated 6 months ago
NovaSky-AI / SkyRL
SkyRL: A Modular Full-stack RL Library for LLMs
☆1,101Updated this week
Leolty / repobench
✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
☆174Updated last year
ADaM-BJTU / O1-CODER
AN O1 REPLICATION FOR CODING
☆336Updated 10 months ago
magpie-align / magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆782Updated 7 months ago
SWE-bench / experiments
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
☆218Updated last week
OpenBMB / Eurus
☆319Updated last year
OpenBMB / InfiniteBench
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
☆353Updated last year
StonyBrookNLP / appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource…
☆296Updated this week
ezelikman / quiet-star
Code for Quiet-STaR
☆739Updated last year
sierra-research / tau-bench
Code and Data for Tau-Bench
☆901Updated 2 months ago
allenai / olmes
Reproducible, flexible LLM evaluations
☆257Updated last week
LiveBench / LiveBench
LiveBench: A Challenging, Contamination-Free LLM Benchmark
☆902Updated 2 weeks ago