centerforaisafety / hleLinks
Humanity's Last Exam
☆1,256Updated 2 months ago
Alternatives and similar repositories for hle
Users that are interested in hle are comparing it to the libraries listed below
Sorting:
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆958Updated this week
- OpenAI Frontier Evals☆957Updated this week
- ☆1,224Updated 4 months ago
- ☆2,477Updated last month
- Renderer for the harmony response format to be used with gpt-oss☆4,050Updated last month
- open source interpretability platform 🧠☆515Updated last week
- ☆478Updated 4 months ago
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,438Updated 4 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆1,209Updated 2 weeks ago
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆436Updated last year
- ☆533Updated 5 months ago
- A benchmark for LLMs on complicated tasks in the terminal☆1,162Updated last week
- Post-training with Tinker☆2,357Updated this week
- Testing baseline LLMs performance across various models☆325Updated last week
- ☆246Updated 5 months ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆965Updated 5 months ago
- A Self-adaptation Framework🐙 that adapts LLMs for unseen tasks in real-time!☆1,174Updated 10 months ago
- ☆1,355Updated 3 months ago
- ☆1,416Updated last week
- [COLM 2025] LIMO: Less is More for Reasoning☆1,054Updated 4 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆732Updated 4 months ago
- Training Large Language Model to Reason in a Continuous Latent Space☆1,382Updated 4 months ago
- Pretraining and inference code for a large-scale depth-recurrent language model☆852Updated last month
- ☆569Updated 6 months ago
- [NeurIPS 2025 Spotlight] Reasoning Environments for Reinforcement Learning with Verifiable Rewards☆1,262Updated 3 weeks ago
- Large Concept Models: Language modeling in a sentence representation space☆2,309Updated 10 months ago
- Dream 7B, a large diffusion language model☆1,099Updated 3 weeks ago
- Code and Data for Tau-Bench☆987Updated 3 months ago
- ☆3,465Updated 9 months ago
- Sky-T1: Train your own O1 preview model within $450☆3,358Updated 5 months ago