centerforaisafety / hleLinks
Humanity's Last Exam
☆1,032Updated 2 weeks ago
Alternatives and similar repositories for hle
Users that are interested in hle are comparing it to the libraries listed below
Sorting:
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆847Updated this week
- ☆466Updated 3 weeks ago
- ☆2,238Updated this week
- ☆1,178Updated 3 weeks ago
- Renderer for the harmony response format to be used with gpt-oss☆2,637Updated this week
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆388Updated 10 months ago
- Releases from OpenAI Preparedness☆833Updated last week
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,435Updated 3 weeks ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆892Updated last month
- Testing baseline LLMs performance across various models☆293Updated 2 weeks ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆616Updated 3 weeks ago
- ☆215Updated last month
- open source interpretability platform 🧠☆316Updated this week
- A benchmark for LLMs on complicated tasks in the terminal☆358Updated this week
- The OpenAI Model Spec☆551Updated 3 months ago
- ☆402Updated 2 months ago
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆573Updated 4 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆840Updated last month
- procedural reasoning datasets☆1,030Updated this week
- ☆494Updated 2 weeks ago
- ☆377Updated last month
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆516Updated last week
- Seed-Coder is a family of lightweight open-source code LLMs comprising base, instruct and reasoning models, developed by ByteDance Seed.☆539Updated 2 months ago
- MLGym A New Framework and Benchmark for Advancing AI Research Agents☆541Updated 2 weeks ago
- MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model.☆2,791Updated last month
- GLM-4.5: An open-source large language model designed for intelligent agents by Z.ai☆1,541Updated last week
- Pretraining and inference code for a large-scale depth-recurrent language model☆810Updated 3 weeks ago
- Self-Adapting Language Models☆743Updated last week
- [COLM 2025] LIMO: Less is More for Reasoning☆1,000Updated last week
- Training Large Language Model to Reason in a Continuous Latent Space☆1,235Updated 6 months ago