GAIR-NLP / ResearcherBenchLinks
ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry
☆24Updated 2 weeks ago
Alternatives and similar repositories for ResearcherBench
Users that are interested in ResearcherBench are comparing it to the libraries listed below
Sorting:
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆127Updated 3 months ago
- Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"☆100Updated last month
- Revisiting Mid-training in the Era of Reinforcement Learning Scaling☆163Updated 3 weeks ago
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆139Updated 9 months ago
- [ICLR 2025] This is the code repo for our ICLR’25 paper "RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rew…☆42Updated 6 months ago
- A Comprehensive Survey on Long Context Language Modeling☆170Updated last month
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆108Updated 3 months ago
- ☆66Updated this week
- The official repository of the Omni-MATH benchmark.☆87Updated 7 months ago
- General Reasoner: Advancing LLM Reasoning Across All Domains☆162Updated 2 months ago
- [Neurips2024] Source code for xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token☆148Updated last year
- ☆67Updated last month
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆118Updated 5 months ago
- Code for paper "Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System"☆60Updated 9 months ago
- The official repo of SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond☆161Updated last month
- The code and data of DPA-RAG, accepted by WWW 2025 main conference.☆60Updated 6 months ago
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆112Updated 8 months ago
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆102Updated 5 months ago
- ☆34Updated last month
- WideSearch: Benchmarking Agentic Broad Info-Seeking☆50Updated this week
- ☆85Updated 3 weeks ago
- ☆36Updated 3 months ago
- RL Scaling and Test-Time Scaling (ICML'25)☆110Updated 6 months ago
- A Comprehensive Benchmark for Software Development.☆112Updated last year
- Reproducing R1 for Code with Reliable Rewards☆246Updated 3 months ago
- [ICLR 2025] SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction☆77Updated 4 months ago
- [ICLR 2025] LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization☆40Updated 5 months ago
- The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]☆86Updated 4 months ago
- A version of verl to support tool use☆328Updated this week
- Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering☆61Updated 8 months ago