VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking
☆86Jan 21, 2026Updated last month
Alternatives and similar repositories for VeriWeb
Users that are interested in VeriWeb are comparing it to the libraries listed below
Sorting:
- ☆23Feb 4, 2026Updated 3 weeks ago
- ☆19Mar 10, 2025Updated 11 months ago
- Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent w…☆100Sep 8, 2025Updated 5 months ago
- AutoLibra: Metric Induction for Agents from Open-Ended Human Feedback☆17Oct 15, 2025Updated 4 months ago
- Official InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows☆19Nov 4, 2025Updated 3 months ago
- ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation☆25Aug 24, 2025Updated 6 months ago
- [NAACL 2024] CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions☆13May 7, 2024Updated last year
- TopViewRS: Vision-Language Models as Top-View Spatial Reasoners (EMNLP 2024 Oral)☆15Jun 14, 2025Updated 8 months ago
- On Path to Multimodal Generalist: General-Level and General-Bench☆18Jul 11, 2025Updated 7 months ago
- The code for HerO: a fact-checking pipeline based on open LLMs (the runner-up in AVeriTeC)☆13Mar 18, 2025Updated 11 months ago
- [ACL'25 (Findings)] Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents☆26Feb 17, 2026Updated last week
- ☆18Oct 28, 2025Updated 4 months ago
- UQ: Assessing Language Models on Unsolved Questions☆30Aug 26, 2025Updated 6 months ago
- The official code of Multi-player Nash Preference Optimization [ICLR 2026]☆31Feb 4, 2026Updated 3 weeks ago
- ☆22Sep 9, 2025Updated 5 months ago
- Dataaset Release for Explanations for CommonsenseQA, ACL 2021 Paper☆20Jul 30, 2021Updated 4 years ago
- Code, Data and Model for Paper "Learning from Peers in Reasoning Models"☆27May 13, 2025Updated 9 months ago
- Compiler-R1: Towards Agentic Compiler Auto-tuning with Reinforcement Learning☆28Jul 14, 2025Updated 7 months ago
- [NeurIPS 2025 Spotlight] Official repository for "Web-Shepherd: Advancing PRMs for Reinforcing Web Agents"☆53May 21, 2025Updated 9 months ago
- ☆23Jan 19, 2026Updated last month
- 🎮Manipulates mobile phones just like how you would. Official code for "MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficien…☆27Oct 10, 2025Updated 4 months ago
- [NeurIPS 2024] AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos☆23Dec 6, 2024Updated last year
- Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models (ICLR 2026)☆42Feb 18, 2026Updated last week
- ☆17Aug 1, 2025Updated 6 months ago
- Implementation for AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications☆34Apr 21, 2025Updated 10 months ago
- Official code and data repository of MathChat: MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Inte…☆22Jun 3, 2024Updated last year
- Suri: Multi-constraint instruction following for long-form text generation (EMNLP’24)☆27Oct 3, 2025Updated 4 months ago
- [CVPR 2025] Offical implementation of the paper "Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters The…☆31Feb 27, 2025Updated last year
- Code for the paper Robot Data Curation with Mutual Information Estimators☆29Apr 22, 2025Updated 10 months ago
- Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments☆48Jan 8, 2026Updated last month
- Code for paper "SPG Sandwiched Policy Gradient for Masked Diffusion Language Models"☆49Oct 29, 2025Updated 4 months ago
- ☆33Nov 18, 2025Updated 3 months ago
- [ICLR 2026] Geometric-Mean Policy Optimization☆100Jan 26, 2026Updated last month
- The first spoken long-text dataset derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-w…☆12Jun 28, 2025Updated 8 months ago
- ☆55Aug 5, 2025Updated 6 months ago
- A Text2SQL benchmark for evaluation of Large Language Models☆41Updated this week
- The official repository of "SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World".☆27Aug 20, 2025Updated 6 months ago
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents☆37Oct 7, 2025Updated 4 months ago
- ☆35Jan 12, 2026Updated last month