princeton-nlp / WebShopLinks
[NeurIPS 2022] πWebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
β446Updated last year
Alternatives and similar repositories for WebShop
Users that are interested in WebShop are comparing it to the libraries listed below
Sorting:
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]β375Updated last year
- VisualWebArena is a benchmark for multimodal agents.β413Updated last year
- SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasksβ323Updated last year
- ICML 2024: Improving Factuality and Reasoning in Language Models through Multiagent Debateβ497Updated 8 months ago
- An extensible benchmark for evaluating large language models on planningβ435Updated 3 months ago
- Code for the paper π³ Tree Search for Language Model Agentsβ216Updated last year
- FireAct: Toward Language Agent Fine-tuningβ287Updated 2 years ago
- β185Updated 10 months ago
- π AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resourceβ¦β346Updated last month
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898β232Updated last year
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)β266Updated last year
- LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.β763Updated last year
- Paper collection on building and evaluating language model agents via executable language groundingβ363Updated last year
- Codes for our paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate"β316Updated last year
- Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)β256Updated last year
- Data and Code for Program of Thoughts [TMLR 2023]β300Updated last year
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and trainingβ283Updated last year
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Themβ536Updated last year
- Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)β380Updated last month
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"β473Updated last year
- [TMLR] Cumulative Reasoning With Large Language Models (https://arxiv.org/abs/2308.04371)β307Updated 4 months ago
- RewardBench: the first evaluation tool for reward models.β670Updated 6 months ago
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels β¦β283Updated 2 years ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)β217Updated 2 years ago
- WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?β226Updated last week
- [ICML 2024] Official repository for "Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models"β810Updated last year
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Zihaβ¦β133Updated last year
- This is the repo for the paper Shepherd -- A Critic for Language Model Generationβ220Updated 2 years ago
- A large-scale, fine-grained, diverse preference dataset (and models).β357Updated last year
- AWM: Agent Workflow Memoryβ372Updated 10 months ago