bigcode-project/bigcodebench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/bigcode-project/bigcodebench)

bigcode-project / bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

☆515

Alternatives and similar repositories for bigcodebench

Users that are interested in bigcodebench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

bigcode-project / bigcodebench-annotation
View on GitHub
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
☆26Aug 8, 2024Updated last year
LiveCodeBench / LiveCodeBench
View on GitHub
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆910Jul 16, 2025Updated last year
evalplus / evalplus
View on GitHub
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
☆1,782Oct 2, 2025Updated 9 months ago
all-the-noises / eval-arena
View on GitHub
☆34Mar 21, 2026Updated 3 months ago
ganler / code-r1
View on GitHub
Reproducing R1 for Code with Reliable Rewards
☆313May 5, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
facebookresearch / cruxeval
View on GitHub
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆171Oct 11, 2024Updated last year
xlang-ai / DS-1000
View on GitHub
[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".
☆275Oct 30, 2024Updated last year
ise-uiuc / blazedit
View on GitHub
Making code edting up to 7.7x faster using multi-layer speculation
☆23Feb 20, 2025Updated last year
TIGER-AI-Lab / AceCoder
View on GitHub
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]
☆100Apr 9, 2025Updated last year
bigcode-project / bigcode-evaluation-harness
View on GitHub
A framework for the evaluation of autoregressive code generation language models.
☆1,052Jul 22, 2025Updated 11 months ago
amazon-science / llm-code-preference
View on GitHub
Training and Benchmarking LLMs for Code Preference.
☆38Nov 15, 2024Updated last year
bigcode-project / selfcodealign
View on GitHub
[NeurIPS'24] SelfCodeAlign: Self-Alignment for Code Generation
☆323Feb 24, 2025Updated last year
bigcode-project / the-stack-v2
View on GitHub
Code for the curation of The Stack v2 and StarCoder2 training data
☆136Apr 11, 2024Updated 2 years ago
GAIR-NLP / OlympicArena
View on GitHub
[NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
☆106Mar 6, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
SWE-bench / SWE-bench
View on GitHub
SWE-bench: Can Language Models Resolve Real-world Github Issues?
☆5,452Apr 1, 2026Updated 3 months ago
SWE-Gym / SWE-Gym
View on GitHub
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆708Jul 29, 2025Updated 11 months ago
bigcode-project / bigcodearena
View on GitHub
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
☆61Oct 13, 2025Updated 9 months ago
SWE-Perf / SWE-Perf
View on GitHub
☆51Oct 28, 2025Updated 8 months ago
KbsdJames / omni-math-rule
View on GitHub
The rule-based evaluation subset and code implementation of Omni-MATH
☆28Dec 23, 2024Updated last year
nuprl / MultiPL-E
View on GitHub
A multi-programming language benchmark for LLMs
☆312Apr 12, 2026Updated 3 months ago
openai / human-eval
View on GitHub
Code for the paper "Evaluating Large Language Models Trained on Code"
☆3,310Jan 17, 2025Updated last year
BigComputer-Project / SWE-Arena
View on GitHub
SWE Arena
☆36Jul 6, 2025Updated last year
multi-swe-bench / multi-swe-bench
View on GitHub
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
☆354Dec 18, 2025Updated 7 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
OpenAutoCoder / Agentless
View on GitHub
Agentless🐱: an agentless approach to automatically solve software development problems
☆2,083Dec 22, 2024Updated last year
FudanSELab / ClassEval
View on GitHub
Benchmark ClassEval for class-level code generation.
☆151Oct 24, 2024Updated last year
evalplus / repoqa
View on GitHub
RepoQA: Evaluating Long-Context Code Understanding
☆136Nov 1, 2024Updated last year
MCEVAL / McEval
View on GitHub
☆48Dec 12, 2024Updated last year
sail-sg / regmix
View on GitHub
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆194Feb 17, 2025Updated last year
lm-sys / llm-decontaminator
View on GitHub
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆324Dec 20, 2023Updated 2 years ago
facebookresearch / swe-rl
View on GitHub
[NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆709Mar 16, 2025Updated last year
bigcode-project / astraios
View on GitHub
Astraios: Parameter-Efficient Instruction Tuning Code Language Models
☆63Apr 10, 2024Updated 2 years ago
KodCode-AI / code-r1
View on GitHub
Reproducing R1 for Code with Reliable Rewards
☆13Apr 9, 2025Updated last year
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
evo-eval / evoeval
View on GitHub
EvoEval: Evolving Coding Benchmarks via LLM
☆84Apr 6, 2024Updated 2 years ago
floatai / HumanEval-XL
View on GitHub
[LREC-COLING'24] HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
☆42Mar 7, 2025Updated last year
yangzhou6666 / Beyond-Accurate-Code-Models
View on GitHub
A collection of publications that works on code models but beyond focusing on the accuracies.
☆12Jun 30, 2023Updated 3 years ago
Leolty / repobench
View on GitHub
✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
☆212Aug 16, 2024Updated last year
THUDM / NaturalCodeBench
View on GitHub
NaturalCodeBench (Findings of ACL 2024)
☆70Oct 14, 2024Updated last year
codefuse-ai / Awesome-Code-LLM
View on GitHub
[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.
☆3,412May 20, 2026Updated 2 months ago
lmarena / arena-hard-auto
View on GitHub
Arena-Hard-Auto: An automatic LLM benchmark.
☆1,050Jun 21, 2025Updated last year