centerforaisafety / hleLinks
Humanity's Last Exam
☆1,151Updated 3 weeks ago
Alternatives and similar repositories for hle
Users that are interested in hle are comparing it to the libraries listed below
Sorting:
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆902Updated 2 weeks ago
- ☆2,395Updated last week
- ☆476Updated 3 months ago
- OpenAI Frontier Evals☆924Updated last week
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆420Updated last year
- ☆1,197Updated 3 months ago
- open source interpretability platform 🧠☆455Updated 2 weeks ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆1,042Updated last week
- Renderer for the harmony response format to be used with gpt-oss☆3,926Updated 2 months ago
- [COLM 2025] LIMO: Less is More for Reasoning☆1,038Updated 3 months ago
- ☆1,320Updated last month
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,438Updated 3 months ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆948Updated 4 months ago
- A benchmark for LLMs on complicated tasks in the terminal☆961Updated this week
- ☆231Updated 4 months ago
- An AI agent system for solving International Mathematical Olympiad (IMO) problems using Google's Gemini, OpenAI, and XAI APIs.☆810Updated 3 weeks ago
- Post-training with Tinker☆1,096Updated last week
- Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents☆1,705Updated 2 months ago
- LLM/VLM gaming agents and model evaluation through games.☆783Updated last month
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆688Updated 3 months ago
- The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search☆1,709Updated this week
- [NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆610Updated 7 months ago
- A Self-adaptation Framework🐙 that adapts LLMs for unseen tasks in real-time!☆1,159Updated 9 months ago
- [NeurIPS 2025 Spotlight] Reasoning Environments for Reinforcement Learning with Verifiable Rewards☆1,202Updated 3 weeks ago
- MLGym A New Framework and Benchmark for Advancing AI Research Agents☆564Updated 2 months ago
- This repository includes the official implementation of OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs.☆725Updated 2 months ago
- ☆494Updated 4 months ago
- Training Large Language Model to Reason in a Continuous Latent Space☆1,313Updated 2 months ago
- [NeurIPS 2025] Atom of Thoughts for Markov LLM Test-Time Scaling☆591Updated 4 months ago
- Pretraining and inference code for a large-scale depth-recurrent language model☆838Updated 2 weeks ago