ysy-phoenix / evalhubLinks
All-in-one benchmarking platform for evaluating LLM.
☆15Updated 2 months ago
Alternatives and similar repositories for evalhub
Users that are interested in evalhub are comparing it to the libraries listed below
Sorting:
- Implementation for FP8/INT8 Rollout for RL training without performence drop.☆281Updated 2 months ago
- Reproducing R1 for Code with Reliable Rewards☆278Updated 8 months ago
- ☆49Updated 4 months ago
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]☆61Updated 3 months ago
- ☆20Updated 3 months ago
- ☆90Updated 6 months ago
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…☆88Updated 10 months ago
- Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".☆114Updated 5 months ago
- Physics of Language Models, Part 4☆291Updated this week
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆226Updated last year
- [ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter☆121Updated last month
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆83Updated last month
- Bridge Megatron-Core to Hugging Face/Reinforcement Learning☆181Updated this week
- ☆126Updated 7 months ago
- Based on the R1-Zero method, using rule-based rewards and GRPO on the Code Contests dataset.☆18Updated 8 months ago
- ☆47Updated 6 months ago
- SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution☆100Updated 3 months ago
- A Sober Look at Language Model Reasoning☆92Updated last month
- Async pipelined version of Verl☆124Updated 9 months ago
- ☆36Updated 10 months ago
- Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination.☆21Updated 5 months ago
- ☆49Updated 7 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆243Updated 3 months ago
- ☆31Updated 2 months ago
- ☆41Updated 9 months ago
- Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"☆117Updated last month
- ☆102Updated 10 months ago
- R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning☆29Updated 3 months ago
- LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification☆72Updated 5 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆182Updated 3 months ago