oriyor/assistantbench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/oriyor/assistantbench)

oriyor / assistantbench

Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"

☆71

Alternatives and similar repositories for assistantbench

Users that are interested in assistantbench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

IBM / ColPret
View on GitHub
Efficient Scaling laws and collaborative pretraining.
☆23Updated this week
AtakanTekparmak / agento
View on GitHub
Very minimal (and stateless) agent framework
☆44Jan 12, 2025Updated last year
modestyachts / cifar-10.2
View on GitHub
Host CIFAR-10.2 Data Set
☆13Sep 22, 2021Updated 4 years ago
vatsalsaglani / local-qwen-swarm
View on GitHub
A Python implementation of an agent swarm system that works with local LLM servers. The system allows you to create multiple agents that …
☆14Nov 20, 2024Updated last year
jonathan-roberts1 / SciFIBench
View on GitHub
NeurIPS 2024: SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation
☆13May 24, 2025Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
ARiSE-Lab / CYCLE_OOPSLA_24
View on GitHub
Open-source repository for the OOPSLA'24 paper "CYCLE: Learning to Self-Refine Code Generation"
☆10Mar 8, 2024Updated 2 years ago
blender-nlp / SmartBook
View on GitHub
☆28Oct 31, 2023Updated 2 years ago
allenai / faithful-nmn
View on GitHub
Evaluating and improving the faithfulness of the interpretations offered by Neural Module Networks
☆13Jun 12, 2023Updated 3 years ago
davidbrandfonbrener / color-filter-olmo
View on GitHub
☆13Dec 12, 2025Updated 7 months ago
McGill-NLP / weblinx
View on GitHub
WebLINX is a benchmark for building web navigation agents with conversational capabilities
☆162Feb 11, 2025Updated last year
JasonForJoy / Model-Editing-Hurt
View on GitHub
EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
☆37May 26, 2025Updated last year
itayle / diverse-demonstrations
View on GitHub
Diverse Demonstrations Improve In-context Compositional Generalization
☆13Jul 7, 2023Updated 3 years ago
blender-nlp / NewsClaims
View on GitHub
☆19Sep 10, 2022Updated 3 years ago
yuh-zha / Align
View on GitHub
Align, a general text alignment function
☆16Dec 7, 2023Updated 2 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
assafbk / mocha_code
View on GitHub
Mitigating Open-Vocabulary Caption Hallucinations (EMNLP 2024)
☆19Oct 18, 2024Updated last year
mandyyyyii / scibench
View on GitHub
☆132Jul 8, 2024Updated 2 years ago
heaplax / ARMAP
View on GitHub
☆29Jun 5, 2025Updated last year
showlab / videogui
View on GitHub
[NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos
☆53Feb 22, 2026Updated 5 months ago
sxzrt / CIFAR-10-W
View on GitHub
CIFAR-10-Warehouse: Towards Broad and More Realistic Testbeds in Model Generalization Analysis
☆18Jul 15, 2024Updated 2 years ago
all-the-noises / eval-arena
View on GitHub
☆34Mar 21, 2026Updated 4 months ago
Alignment-Lab-AI / Dataset-Conversion-Toolkit
View on GitHub
a set of scripts to easily convert all training data from huggingface into alpaca instruct or sharegpt format, which should allow for eas…
☆20Mar 14, 2025Updated last year
xingjianleng / autoeval_baselines
View on GitHub
This repository includes various baseline techniques for label-free model evaluation task for the VDU2023 competition.
☆19Mar 8, 2023Updated 3 years ago
neulab / MultiUI
View on GitHub
Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding
☆54Dec 12, 2024Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
niuzaisheng / ScreenExplorer
View on GitHub
ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World
☆26Jun 17, 2025Updated last year
rosmineb / unit_test_rl
View on GitHub
Project code for training LLMs to write better unit tests + code
☆22May 19, 2025Updated last year
OSU-NLP-Group / Middleware
View on GitHub
Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)
☆37Dec 29, 2024Updated last year
Job-Bench / job-bench-eval
View on GitHub
Official eval scripts for JobBench
☆29Jul 18, 2026Updated last week
midas-research / speechmix
View on GitHub
☆12Oct 2, 2020Updated 5 years ago
web-arena-x / visualwebarena
View on GitHub
VisualWebArena is a benchmark for multimodal agents.
☆484Nov 9, 2024Updated last year
ShareChatAI / 3MASSIV
View on GitHub
☆13May 10, 2022Updated 4 years ago
top-yun / SPARK
View on GitHub
A benchmark dataset and simple code examples for measuring the perception and reasoning of multi-sensor Vision Language models.
☆19Dec 27, 2024Updated last year
AIM3-RUC / MPMQA
View on GitHub
Official repository of the paper MPMQA: Multimodal Question Answering on Product Manuals (AAAI 2023)
☆21Nov 28, 2022Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
yxuansu / Contrastive_Search_versus_Contrastive_Decoding
View on GitHub
An Empirical Study On Contrastive Search And Contrastive Decoding For Open-ended Text Generation
☆27Jun 7, 2024Updated 2 years ago
allenai / super-benchmark
View on GitHub
☆54Apr 4, 2025Updated last year
HKUST-KnowComp / PseudoReasoner
View on GitHub
Official code repository for Findings of EMNLP 2022 paper: PseudoReasoner: Leveraging Pseudo Labels for Commonsense Knowledge Base Popula…
☆11Oct 18, 2022Updated 3 years ago
kohjingyu / search-agents
View on GitHub
Code for the paper 🌳 Tree Search for Language Model Agents
☆223Jul 25, 2024Updated 2 years ago
xyflow / react-flow-slide-show
View on GitHub
☆19Jul 22, 2024Updated 2 years ago
deeplearning-wisc / args
View on GitHub
☆47Feb 8, 2024Updated 2 years ago
xlang-ai / EVOR
View on GitHub
☆70Dec 15, 2024Updated last year