centerforaisafety / hleLinks

Humanity's Last Exam

☆1,151

Alternatives and similar repositories for hle

Users that are interested in hle are comparing it to the libraries listed below

Sorting:

LiveBench / LiveBench
LiveBench: A Challenging, Contamination-Free LLM Benchmark
☆902Updated 2 weeks ago
safety-research / circuit-tracer
☆2,395Updated last week
aw31 / openai-imo-2025-proofs
☆476Updated 3 months ago
openai / frontier-evals
OpenAI Frontier Evals
☆924Updated last week
idavidrein / gpqa
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
☆420Updated last year
deepseek-ai / DeepSeek-Prover-V2
☆1,197Updated 3 months ago
hijohnnylin / neuronpedia
open source interpretability platform 🧠
☆455Updated 2 weeks ago
openai / mle-bench
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆1,042Updated last week
openai / harmony
Renderer for the harmony response format to be used with gpt-oss
☆3,926Updated 2 months ago
GAIR-NLP / LIMO
[COLM 2025] LIMO: Less is More for Reasoning
☆1,038Updated 3 months ago
microsoft / rStar
☆1,320Updated last month
openai / SWELancer-Benchmark
This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…
☆1,438Updated 3 months ago
lmarena / arena-hard-auto
Arena-Hard-Auto: An automatic LLM benchmark.
☆948Updated 4 months ago
laude-institute / terminal-bench
A benchmark for LLMs on complicated tasks in the terminal
☆961Updated this week
google-deepmind / alphaevolve_results
☆231Updated 4 months ago
lyang36 / IMO25
An AI agent system for solving International Mathematical Olympiad (IMO) problems using Google's Gemini, OpenAI, and XAI APIs.
☆810Updated 3 weeks ago
thinking-machines-lab / tinker-cookbook
Post-training with Tinker
☆1,096Updated last week
jennyzzt / dgm
Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents
☆1,705Updated 2 months ago
lmgame-org / GamingAgent
LLM/VLM gaming agents and model evaluation through games.
☆783Updated last month
LiveCodeBench / LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆688Updated 3 months ago
SakanaAI / AI-Scientist-v2
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
☆1,709Updated this week
facebookresearch / swe-rl
[NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆610Updated 7 months ago
SakanaAI / self-adaptive-llms
A Self-adaptation Framework🐙 that adapts LLMs for unseen tasks in real-time!
☆1,159Updated 9 months ago
open-thought / reasoning-gym
[NeurIPS 2025 Spotlight] Reasoning Environments for Reinforcement Learning with Verifiable Rewards
☆1,202Updated 3 weeks ago
facebookresearch / MLGym
MLGym A New Framework and Benchmark for Advancing AI Research Agents
☆564Updated 2 months ago
AkariAsai / OpenScholar
This repository includes the official implementation of OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs.
☆725Updated 2 months ago
anthropic-experimental / agentic-misalignment
☆494Updated 4 months ago
facebookresearch / coconut
Training Large Language Model to Reason in a Continuous Latent Space
☆1,313Updated 2 months ago
qixucen / atom
[NeurIPS 2025] Atom of Thoughts for Markov LLM Test-Time Scaling
☆591Updated 4 months ago
seal-rg / recurrent-pretraining
Pretraining and inference code for a large-scale depth-recurrent language model
☆838Updated 2 weeks ago