LiveCodeBench / LiveCodeBenchLinks
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆770Updated 6 months ago
Alternatives and similar repositories for LiveCodeBench
Users that are interested in LiveCodeBench are comparing it to the libraries listed below
Sorting:
- [ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI☆473Updated 3 weeks ago
- [NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆667Updated 10 months ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆624Updated 5 months ago
- [ICML 2025 Oral] CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction☆566Updated 8 months ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆988Updated 7 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆327Updated 2 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆519Updated last week
- A framework for the evaluation of autoregressive code generation language models.☆1,017Updated 6 months ago
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving☆311Updated last month
- A project to improve skills of large language models☆786Updated this week
- ☆1,080Updated 2 weeks ago
- ☆971Updated last year
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆241Updated 2 weeks ago
- Large Reasoning Models☆807Updated last year
- Code and Data for Tau-Bench☆1,070Updated 5 months ago
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆461Updated last year
- AN O1 REPLICATION FOR CODING☆334Updated last year
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆1,286Updated 2 weeks ago
- [ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".☆264Updated last year
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆820Updated 10 months ago
- Code for Quiet-STaR☆742Updated last year
- RewardBench: the first evaluation tool for reward models.☆683Updated last week
- xLAM: A Family of Large Action Models to Empower AI Agent Systems☆599Updated 5 months ago
- A simple unified framework for evaluating LLMs☆258Updated 9 months ago
- LongBench v2 and LongBench (ACL 25'&24')☆1,078Updated last year
- ☆320Updated last year
- 🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource…☆362Updated 2 months ago
- Building Open LLM Web Agents with Self-Evolving Online Curriculum RL☆497Updated 7 months ago
- Automatic evals for LLMs☆575Updated last month
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)☆688Updated last year