ysy-phoenix / evalhubLinks
All-in-one benchmarking platform for evaluating LLM.
☆15Updated 3 weeks ago
Alternatives and similar repositories for evalhub
Users that are interested in evalhub are comparing it to the libraries listed below
Sorting:
- Reproducing R1 for Code with Reliable Rewards☆252Updated 4 months ago
- Bridge Megatron-Core to Hugging Face/Reinforcement Learning☆103Updated last week
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [arXiv '25]☆49Updated last month
- Ongoing research project for code&math LLMs☆18Updated 2 months ago
- Async pipelined version of Verl☆117Updated 4 months ago
- Implementation for FP8/INT8 Rollout for RL training without performence drop.☆184Updated this week
- ☆117Updated 2 months ago
- ☆46Updated 2 weeks ago
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆113Updated 8 months ago
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…☆82Updated 6 months ago
- Physics of Language Models, Part 4☆238Updated last month
- ☆23Updated 3 months ago
- SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution☆78Updated this week
- Revisiting Mid-training in the Era of Reinforcement Learning Scaling☆169Updated last month
- Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".☆102Updated last month
- ☆33Updated 5 months ago
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆208Updated last year
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton☆31Updated 6 months ago
- [ICML 2025] Reward-guided Speculative Decoding (RSD) for efficiency and effectiveness.☆46Updated 4 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆251Updated last year
- End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning☆190Updated this week
- ☆41Updated 3 months ago
- A Sober Look at Language Model Reasoning☆81Updated 2 months ago
- ☆33Updated 6 months ago
- Source code for the paper "LongGenBench: Long-context Generation Benchmark"☆23Updated 10 months ago
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆232Updated 8 months ago
- siiRL: Shanghai Innovation Institute RL Framework for Advanced LLMs and Multi-Agent Systems☆179Updated this week
- Based on the R1-Zero method, using rule-based rewards and GRPO on the Code Contests dataset.☆18Updated 4 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆148Updated 3 weeks ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆221Updated 5 months ago