centerforaisafety / hle
Humanity's Last Exam
☆596Updated last month
Alternatives and similar repositories for hle:
Users that are interested in hle are comparing it to the libraries listed below
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆620Updated last week
- Understanding R1-Zero-Like Training: A Critical Perspective☆725Updated this week
- An Open Large Reasoning Model for Real-World Solutions☆1,477Updated 3 weeks ago
- Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL☆1,466Updated last week
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,291Updated 2 weeks ago
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆318Updated 6 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆401Updated this week
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆477Updated 2 weeks ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆656Updated 2 months ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym☆410Updated 3 weeks ago
- ☆526Updated last week
- Testing baseline LLMs performance across various models☆244Updated last week
- Pretraining code for a large-scale depth-recurrent language model☆709Updated 2 weeks ago
- LIMO: Less is More for Reasoning☆875Updated last month
- Recipes to scale inference-time compute of open models☆1,048Updated last month
- Arena-Hard-Auto: An automatic LLM benchmark.☆771Updated 2 weeks ago
- Synthetic data curation for post-training and structured data extraction☆1,097Updated last week
- [ICLR 2025] Automated Design of Agentic Systems☆1,241Updated 2 months ago
- A collection of prompts to challenge the reasoning abilities of large language models in presence of misguiding information☆386Updated last week
- Democratizing Reinforcement Learning for LLMs☆2,158Updated last month
- Scalable RL solution for advanced reasoning of language models☆1,445Updated 2 weeks ago
- ☆420Updated 8 months ago
- ☆438Updated 5 months ago
- Prompt-to-Leaderboard☆205Updated 2 weeks ago
- OLMoE: Open Mixture-of-Experts Language Models☆698Updated 2 weeks ago
- Search-o1: Agentic Search-Enhanced Large Reasoning Models☆748Updated 3 weeks ago
- ☆493Updated last week
- Sky-T1: Train your own O1 preview model within $450☆3,167Updated last week
- procedural reasoning datasets☆541Updated this week
- ☆141Updated 3 weeks ago