ysy-phoenix / evalhubLinks

All-in-one benchmarking platform for evaluating LLM.

☆15

Alternatives and similar repositories for evalhub

Users that are interested in evalhub are comparing it to the libraries listed below

Sorting:

0xWJ / code-judge
☆9Updated last week
phonism / CP-Zero
Based on the R1-Zero method, using rule-based rewards and GRPO on the Code Contests dataset.
☆17Updated 2 months ago
ganler / code-r1
Reproducing R1 for Code with Reliable Rewards
☆221Updated last month
princeton-nlp / ProLong
Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"
☆202Updated 3 months ago
CMU-AIRe / MRT
Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".
☆94Updated 3 months ago
bethgelab / sober-reasoning
A Sober Look at Language Model Reasoning
☆74Updated last week
OpenSparseLLMs / Linear-MoE
☆104Updated 3 weeks ago
hkust-nlp / dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
☆108Updated 6 months ago
ruipeterpan / specreason
PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [arXiv '25]
☆39Updated last month
TIGER-AI-Lab / verl-tool
A version of verl to support tool use
☆261Updated this week
MingyuJ666 / Rope_with_LLM
[ICML'25] Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concen…
☆73Updated last week
tongyx361 / symeval
Evaluation utilities based on SymPy.
☆20Updated 6 months ago
tmlr-group / landscape-of-thoughts
[ICLR 2025 Workshop] "Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models"
☆25Updated last week
OS-Copilot / ScienceBoard
Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"
☆85Updated this week
thu-wyz / inference_scaling
☆71Updated 7 months ago
TsinghuaC3I / MARTI
A Framework for LLM-based Multi-Agent Reinforced Training and Inference
☆140Updated 2 weeks ago
henryzhongsc / longctx_bench
Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…
☆79Updated 4 months ago
GAIR-NLP / ToRL
☆228Updated last month
Jingyu6 / speculative_prefill
☆30Updated last month
hyx1999 / SAM-Decoding
Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton
☆28Updated 4 months ago
tongyx361 / Awesome-LLM-Research
Curation of resources for LLM research, screened by @tongyx361 to ensure high quality and accompanied with elaborately-written concise de…
☆55Updated 11 months ago
ISEEKYAN / mbridge
☆43Updated this week
sustcsonglin / linear-attention-and-beyond-slides
☆76Updated 4 months ago
Infini-AI-Lab / Multiverse
☆58Updated last week
UNITES-Lab / MC-SMoE
[ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"
☆85Updated last week
PRIME-RL / ImplicitPRM
Repo of paper "Free Process Rewards without Process Labels"
☆154Updated 3 months ago
cmu-l3 / l1
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
☆222Updated last month
agentica-project / verl-pipeline
Async pipelined version of Verl
☆100Updated 2 months ago
kanishkg / cognitive-behaviors
☆190Updated 3 months ago
nightdessert / Retrieval_Head
open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality
☆201Updated 10 months ago