princeton-nlp / WebShopLinks
[NeurIPS 2022] πWebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
β428Updated last year
Alternatives and similar repositories for WebShop
Users that are interested in WebShop are comparing it to the libraries listed below
Sorting:
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]β360Updated last year
- VisualWebArena is a benchmark for multimodal agents.β402Updated last year
- SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasksβ317Updated last year
- ICML 2024: Improving Factuality and Reasoning in Language Models through Multiagent Debateβ479Updated 6 months ago
- An extensible benchmark for evaluating large language models on planningβ426Updated last month
- FireAct: Toward Language Agent Fine-tuningβ284Updated 2 years ago
- β185Updated 9 months ago
- π AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resourceβ¦β307Updated this week
- Code for the paper π³ Tree Search for Language Model Agentsβ217Updated last year
- LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.β752Updated last year
- ToolBench, an evaluation suite for LLM tool manipulation capabilities.β164Updated last year
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)β263Updated last year
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898β227Updated last year
- A banchmark list for evaluation of large language models.β146Updated 2 months ago
- Paper collection on building and evaluating language model agents via executable language groundingβ362Updated last year
- Multi-agent Social Simulation + Efficient, Effective, and Stable alternative of RLHF. Code for the paper "Training Socially Aligned Languβ¦β351Updated 2 years ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)β218Updated 2 years ago
- Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)β377Updated 3 weeks ago
- [TMLR] Cumulative Reasoning With Large Language Models (https://arxiv.org/abs/2308.04371)β303Updated 3 months ago
- An implemtation of Everyting of Thoughts (XoT).β152Updated last year
- [ICML'24 Spotlight] "TravelPlanner: A Benchmark for Real-World Planning with Language Agents"β441Updated this week
- [ICML 2024] Official repository for "Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models"β798Updated last year
- [ICLR 2024] Lemur: Open Foundation Models for Language Agentsβ554Updated 2 years ago
- Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)β253Updated last year
- RewardBench: the first evaluation tool for reward models.β649Updated 5 months ago
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels β¦β282Updated 2 years ago
- β313Updated last year
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Zihaβ¦β132Updated last year
- Data and Code for Program of Thoughts [TMLR 2023]β292Updated last year
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and trainingβ284Updated last year