centerforaisafety / hleLinks
Humanity's Last Exam
☆1,323Updated 3 months ago
Alternatives and similar repositories for hle
Users that are interested in hle are comparing it to the libraries listed below
Sorting:
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆1,029Updated this week
- ☆2,568Updated this week
- OpenAI Frontier Evals☆990Updated last month
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆463Updated last year
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,439Updated 6 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆1,295Updated 2 weeks ago
- ☆1,230Updated 6 months ago
- A benchmark for LLMs on complicated tasks in the terminal☆1,442Updated last week
- ☆483Updated 6 months ago
- open source interpretability platform 🧠☆675Updated this week
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆780Updated 6 months ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆991Updated 7 months ago
- τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment☆690Updated this week
- [COLM 2025] LIMO: Less is More for Reasoning☆1,061Updated 6 months ago
- [ICLR 2026] LLM/VLM gaming agents and model evaluation through games.☆854Updated 2 months ago
- ☆615Updated 8 months ago
- ☆1,385Updated 4 months ago
- An agent benchmark with tasks in a simulated software company.☆631Updated 2 months ago
- [NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆673Updated 10 months ago
- ☆258Updated 3 weeks ago
- Renderer for the harmony response format to be used with gpt-oss☆4,159Updated last month
- A series of math-specific large language models of our Qwen2 series.☆1,061Updated last year
- An AI agent system for solving International Mathematical Olympiad (IMO) problems using Google's Gemini, OpenAI, and XAI APIs.☆907Updated 4 months ago
- Pretraining and inference code for a large-scale depth-recurrent language model☆861Updated last month
- ☆557Updated 7 months ago
- Testing baseline LLMs performance across various models☆336Updated this week
- Sky-T1: Train your own O1 preview model within $450☆3,369Updated 6 months ago
- Code release for Best-of-N Jailbreaking☆552Updated 11 months ago
- ☆4,316Updated 6 months ago
- Synthetic data curation for post-training and structured data extraction☆1,618Updated last week