ysy-phoenix / evalhub
All-in-one benchmarking platform for evaluating LLM.
☆13Updated last week
Alternatives and similar repositories for evalhub:
Users that are interested in evalhub are comparing it to the libraries listed below
- The OlymMATH dataset☆11Updated this week
- ☆54Updated last week
- This repository contains a regularly updated paper list for LLMs-reasoning-in-latent-space.☆72Updated this week
- Reproducing R1 for Code with Reliable Rewards☆179Updated this week
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…☆72Updated last month
- ☆125Updated 3 weeks ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆191Updated last month
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆101Updated 4 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆203Updated last year
- Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".☆95Updated last month
- Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?☆25Updated 2 weeks ago
- ☆187Updated 2 months ago
- ☆63Updated 5 months ago
- ☆39Updated 5 months ago
- ☆25Updated this week
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆175Updated last month
- ☆90Updated 3 months ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆75Updated last week
- Repo of paper "Free Process Rewards without Process Labels"☆143Updated last month
- ☆22Updated 3 weeks ago
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆81Updated 10 months ago
- A platform to develop CTM-motivated AI architecture.☆12Updated this week
- A brief and partial summary of RLHF algorithms.☆127Updated last month
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.☆229Updated last week
- A Comprehensive Survey on Long Context Language Modeling☆131Updated 3 weeks ago
- Enhances Overleaf by allowing article searches and BibTeX retrieval from DBLP and Google Scholar | 通过允许从 DBLP 和 Google Scholar 进行文章搜索和获取 …☆62Updated last week
- Based on the R1-Zero method, using rule-based rewards and GRPO on the Code Contests dataset.☆17Updated this week
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…☆122Updated 9 months ago
- 🔥 How to efficiently and effectively compress the CoTs or directly generate concise CoTs during inference while maintaining the reasonin…☆40Updated this week
- ☆38Updated last year