openai/frontier-evals

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/openai/frontier-evals)

openai / frontier-evals

OpenAI Frontier Evals

☆1,262

Alternatives and similar repositories for frontier-evals

Users that are interested in frontier-evals are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

openai / mle-bench
View on GitHub
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆1,655Apr 24, 2026Updated 3 months ago
openai / SWELancer-Benchmark
View on GitHub
This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…
☆1,435Jul 18, 2025Updated last year
facebookresearch / swe-rl
View on GitHub
[NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆712Mar 16, 2025Updated last year
SWE-Gym / SWE-Gym
View on GitHub
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆711Jul 29, 2025Updated 11 months ago
NovaSky-AI / SkyRL
View on GitHub
SkyRL: A Modular Full-stack RL Library for LLMs
☆2,093Updated this week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
SWE-bench / SWE-smith
View on GitHub
[NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents
☆711Updated this week
rllm-org / rllm
View on GitHub
Democratizing Reinforcement Learning for LLMs
☆5,731Updated this week
facebookresearch / MLGym
View on GitHub
MLGym A New Framework and Benchmark for Advancing AI Research Agents
☆612Aug 10, 2025Updated 11 months ago
openai / simple-evals
View on GitHub
☆4,583Apr 22, 2026Updated 3 months ago
harbor-framework / terminal-bench
View on GitHub
A benchmark for LLMs on complicated tasks in the terminal
☆2,483Jul 11, 2026Updated 2 weeks ago
SWE-bench / SWE-bench
View on GitHub
SWE-bench: Can Language Models Resolve Real-world Github Issues?
☆5,483Apr 1, 2026Updated 3 months ago
PrimeIntellect-ai / verifiers
View on GitHub
Our library for RL environments + evals
☆4,400Updated this week
METR / RE-Bench
View on GitHub
☆145Oct 16, 2025Updated 9 months ago
THUDM / slime
View on GitHub
slime is an LLM post-training framework for RL Scaling.
☆7,629Updated this week
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
thinking-machines-lab / tinker-cookbook
View on GitHub
Post-training with Tinker
☆3,911Updated this week
R2E-Gym / R2E-Gym
View on GitHub
[COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆310Jul 13, 2025Updated last year
openai / harmony
View on GitHub
Renderer for the harmony response format to be used with gpt-oss
☆4,466Apr 8, 2026Updated 3 months ago
verl-project / verl
View on GitHub
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
☆22,654Updated this week
aisa-group / PostTrainBench
View on GitHub
Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
☆467Updated this week
harbor-framework / harbor
View on GitHub
Framework for evaluating and improving agents
☆3,504Updated this week
NVIDIA-NeMo / RL
View on GitHub
Scalable toolkit for efficient model reinforcement
☆1,849Updated this week
thinking-machines-lab / tinker-project-ideas
View on GitHub
Ideas for projects related to Tinker
☆191Nov 6, 2025Updated 8 months ago
princeton-pli / hal-harness
View on GitHub
☆310Jul 1, 2026Updated 3 weeks ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
allenai / open-instruct
View on GitHub
AllenAI's post-training codebase
☆3,809Updated this week
hkust-nlp / simpleRL-reason
View on GitHub
Simple RL training for reasoning
☆3,870Dec 23, 2025Updated 7 months ago
mll-lab-nu / RAGEN
View on GitHub
RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.
☆2,756Updated this week
PeterGriffinJin / Search-R1
View on GitHub
Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL
☆5,153Nov 13, 2025Updated 8 months ago
siegelz / core-bench
View on GitHub
☆78Nov 23, 2025Updated 8 months ago
PrimeIntellect-ai / prime-rl
View on GitHub
Agentic RL Training at Scale
☆1,724Updated this week
BytedTsinghua-SIA / DAPO
View on GitHub
An Open-source RL System from ByteDance Seed and Tsinghua AIR
☆1,846May 11, 2025Updated last year
SWE-agent / mini-swe-agent
View on GitHub
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—b…
☆6,017Updated this week
facebookresearch / meta-agents-research-environments
View on GitHub
Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike stat…
☆531Updated this week
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
microsoft / SWE-bench-Live
View on GitHub
[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!
☆212Jun 11, 2026Updated last month
open-thought / reasoning-gym
View on GitHub
[NeurIPS 2025 Spotlight] Reasoning Environments for Reinforcement Learning with Verifiable Rewards
☆1,468Apr 17, 2026Updated 3 months ago
sierra-research / tau-bench
View on GitHub
Code and Data for Tau-Bench
☆1,345Mar 18, 2026Updated 4 months ago
SWE-Perf / SWE-Perf
View on GitHub
☆52Oct 28, 2025Updated 8 months ago
LiveCodeBench / LiveCodeBench
View on GitHub
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆913Jul 16, 2025Updated last year
huggingface / open-r1
View on GitHub
Fully open reproduction of DeepSeek-R1
☆26,414Apr 2, 2026Updated 3 months ago
huggingface / trl
View on GitHub
Train transformer language models with reinforcement learning.
☆18,927Updated this week