open-compass / GTA
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
☆88Updated last month
Alternatives and similar repositories for GTA:
Users that are interested in GTA are comparing it to the libraries listed below
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆85Updated 2 months ago
- [ACL2024] Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios☆56Updated last year
- ☆144Updated last month
- ☆111Updated this week
- Official codebase for "GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning".☆71Updated last week
- Repo of paper "Free Process Rewards without Process Labels"☆145Updated last month
- ☆115Updated last week
- Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis☆130Updated last month
- ☆29Updated 7 months ago
- Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement (EMNLP 2024 Main Conference)☆57Updated 6 months ago
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling☆101Updated 3 months ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆90Updated 2 weeks ago
- Code for Paper: Teaching Language Models to Critique via Reinforcement Learning☆94Updated 3 weeks ago
- Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning☆175Updated last month
- ☆138Updated this week
- "Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents"☆69Updated 3 weeks ago
- ☆55Updated 6 months ago
- ☆102Updated 4 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆186Updated 3 weeks ago
- ☆31Updated 5 months ago
- [ICLR 2025] SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction☆69Updated last month
- MPO: Boosting LLM Agents with Meta Plan Optimization☆50Updated 2 months ago
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"☆53Updated last year
- ☆192Updated 2 months ago
- OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning☆135Updated 4 months ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)☆138Updated 6 months ago
- Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".☆52Updated 5 months ago
- ☆62Updated last month
- ☆63Updated 5 months ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆100Updated 2 months ago