asappresearch / webagents-step
☆39Updated 8 months ago
Alternatives and similar repositories for webagents-step:
Users that are interested in webagents-step are comparing it to the libraries listed below
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆53Updated 4 months ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆192Updated 8 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆103Updated 4 months ago
- Flow of Reasoning: Training LLMs for Divergent Problem Solving with Minimal Examples☆84Updated 3 weeks ago
- ☆121Updated 10 months ago
- Functional Benchmarks and the Reasoning Gap☆84Updated 6 months ago
- ☆76Updated last week
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- WebLINX is a benchmark for building web navigation agents with conversational capabilities☆146Updated 2 months ago
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆47Updated last year
- ☆54Updated last year
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆52Updated last year
- LangCode - Improving alignment and reasoning of large language models (LLMs) with natural language embedded program (NLEP).☆42Updated last year
- Codebase accompanying the Summary of a Haystack paper.☆77Updated 6 months ago
- ☆81Updated last month
- Repository for the paper Stream of Search: Learning to Search in Language☆144Updated 2 months ago
- Interaction-first method for generating demonstrations for web-agents on any website☆35Updated last month
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]☆132Updated 4 months ago
- Replicating O1 inference-time scaling laws☆83Updated 4 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆55Updated 7 months ago
- Advanced Reasoning Benchmark Dataset for LLMs☆45Updated last year
- ☆115Updated last month
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?☆121Updated 7 months ago
- B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners☆75Updated last week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆167Updated last month
- ☆118Updated 8 months ago
- A repository for research on medium sized language models.☆76Updated 10 months ago
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆86Updated this week
- Official code for the paper "ADaPT: As-Needed Decomposition and Planning with Language Models"☆75Updated last year
- ☆14Updated last month