centerforaisafety / hleLinks
Humanity's Last Exam
☆810Updated 3 weeks ago
Alternatives and similar repositories for hle
Users that are interested in hle are comparing it to the libraries listed below
Sorting:
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆789Updated 2 weeks ago
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,408Updated last month
- ☆197Updated last week
- Atom of Thoughts for Markov LLM Test-Time Scaling☆574Updated last week
- Testing baseline LLMs performance across various models☆275Updated last week
- open source interpretability platform 🧠☆276Updated this week
- Releases from OpenAI Preparedness☆783Updated 3 weeks ago
- Fully open data curation for reasoning models☆1,935Updated 3 weeks ago
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆367Updated 8 months ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆851Updated last week
- ☆2,065Updated this week
- A collection of prompts to challenge the reasoning abilities of large language models in presence of misguiding information☆427Updated last week
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆551Updated 3 months ago
- AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and re…☆350Updated this week
- Code release for Best-of-N Jailbreaking☆524Updated 4 months ago
- Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".☆620Updated last week
- Training Large Language Model to Reason in a Continuous Latent Space☆1,162Updated 5 months ago
- Dream 7B, a large diffusion language model☆774Updated last week
- ☆211Updated last week
- ☆570Updated 2 months ago
- Automatic evals for LLMs☆437Updated 3 weeks ago
- LIMO: Less is More for Reasoning☆963Updated 2 months ago
- [ICML 2025] A platform for developers to simulate collaborative research activities☆161Updated this week
- ZeroSearch: Incentivize the Search Capability of LLMs without Searching☆1,020Updated 2 weeks ago
- ☆1,153Updated last month
- Open source interpretability artefacts for R1.☆149Updated 2 months ago
- ☆504Updated last week
- This repository includes the official implementation of OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs.☆695Updated 2 months ago
- OO for LLMs☆801Updated last week
- procedural reasoning datasets☆872Updated last week