ysy-phoenix / evalhubLinks
All-in-one benchmarking platform for evaluating LLM.
☆15Updated last month
Alternatives and similar repositories for evalhub
Users that are interested in evalhub are comparing it to the libraries listed below
Sorting:
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]☆49Updated 2 months ago
- Reproducing R1 for Code with Reliable Rewards☆258Updated 4 months ago
- Implementation for FP8/INT8 Rollout for RL training without performence drop.☆207Updated last week
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton☆32Updated 7 months ago
- ☆47Updated last month
- ☆119Updated 3 months ago
- SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution☆84Updated last week
- Bridge Megatron-Core to Hugging Face/Reinforcement Learning☆129Updated this week
- Async pipelined version of Verl☆117Updated 5 months ago
- ☆23Updated 3 months ago
- Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".☆105Updated last month
- ☆43Updated 4 months ago
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…☆84Updated 7 months ago
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆113Updated 9 months ago
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆210Updated last year
- Based on the R1-Zero method, using rule-based rewards and GRPO on the Code Contests dataset.☆18Updated 5 months ago
- ☆74Updated 10 months ago
- siiRL: Shanghai Innovation Institute RL Framework for Advanced LLMs and Multi-Agent Systems☆214Updated this week
- LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification☆63Updated 2 months ago
- [ICML 2025] Reward-guided Speculative Decoding (RSD) for efficiency and effectiveness.☆47Updated 4 months ago
- ☆75Updated 3 months ago
- Ongoing research project for code&math LLMs☆18Updated 2 months ago
- ☆289Updated 4 months ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆443Updated 4 months ago
- Revisiting Mid-training in the Era of Reinforcement Learning Scaling☆176Updated 2 months ago
- Source code for the paper "LongGenBench: Long-context Generation Benchmark"☆23Updated 11 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆226Updated 2 weeks ago
- Evaluation utilities based on SymPy.☆20Updated 9 months ago
- ☆33Updated 7 months ago
- Curation of resources for LLM research, screened by @tongyx361 to ensure high quality and accompanied with elaborately-written concise de…☆61Updated last year