mnluzimu / WebGen-BenchLinks
☆24Updated 3 months ago
Alternatives and similar repositories for WebGen-Bench
Users that are interested in WebGen-Bench are comparing it to the libraries listed below
Sorting:
- RM-R1: Unleashing the Reasoning Potential of Reward Models☆127Updated 2 months ago
- The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]☆87Updated 4 months ago
- Resources for the Enigmata Project.☆68Updated 3 weeks ago
- ReasonFlux-Coder: Open-Source LLM Coders with Co-Evolving Reinforcement Learning☆111Updated last week
- This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"☆50Updated 10 months ago
- The official repository of the Omni-MATH benchmark.☆87Updated 8 months ago
- General Reasoner: Advancing LLM Reasoning Across All Domains☆166Updated 2 months ago
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆102Updated 5 months ago
- ☆71Updated 5 months ago
- End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning☆190Updated this week
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆110Updated 3 months ago
- [ACL 2024] Code for "MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation"☆38Updated last year
- Revisiting Mid-training in the Era of Reinforcement Learning Scaling☆167Updated last month
- ☆49Updated 10 months ago
- Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge☆75Updated last month
- [NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models☆62Updated 8 months ago
- [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents☆121Updated 5 months ago
- ☆49Updated 3 months ago
- RL Scaling and Test-Time Scaling (ICML'25)☆112Updated 7 months ago
- IKEA: Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent☆64Updated 3 months ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆69Updated 3 months ago
- Official Implementation of ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay☆117Updated 3 months ago
- Official repository for ACL 2025 paper "Model Extrapolation Expedites Alignment"☆75Updated 3 months ago
- instruction-following benchmark for large reasoning models☆40Updated 3 weeks ago
- Code for ICLR 2024 paper "CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets"☆58Updated last year
- B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners☆82Updated 3 months ago
- The code and data for the paper JiuZhang3.0☆49Updated last year
- Repo for paper "Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents"☆56Updated last year
- [NAACL 2025] Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs☆24Updated 11 months ago
- ☆28Updated 11 months ago