showlab / VideoGUILinks

[NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos

☆48

Alternatives and similar repositories for VideoGUI

Users that are interested in VideoGUI are comparing it to the libraries listed below

Sorting:

yihedeng9 / OpenVLThinker
OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement
☆129Updated 6 months ago
orrzohar / Video-STaR
[ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
☆72Updated last year
EvolvingLMMs-Lab / VideoMMMU
Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
☆64Updated 5 months ago
JieyuZ2 / TaskMeAnything
[NeurIPS 2024] A task generation and model evaluation system for multimodal language models.
☆73Updated last year
TencentARC / SEED-Bench-R1
☆97Updated 7 months ago
TIGER-AI-Lab / MEGA-Bench
This repo contains the code for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks" [ICLR 2025]
☆77Updated 7 months ago
shulin16 / MMInA
[ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents
☆48Updated 11 months ago
showlab / WorldGUI
Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.
☆109Updated 6 months ago
kkahatapitiya / LangRepo
Code for our ACL 2025 paper "Language Repository for Long Video Understanding"
☆34Updated last year
VisualWebBench / VisualWebBench
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
☆63Updated last year
zeyofu / BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…
☆159Updated 4 months ago
TongUI-agent / TongUI-agent
[AAAI 2026]Release of code, datasets and model for our work TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for General…
☆67Updated 2 months ago
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆64Updated 6 months ago
RifleZhang / LLaVA-Reasoner-DPO
☆110Updated last year
Yan98 / GTA1
☆122Updated 4 months ago
xjtupanda / Sparrow
Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"
☆48Updated 5 months ago
SalesforceAIResearch / LATTE
☆68Updated 4 months ago
CeeZh / LLoVi
Official implementation for "A Simple LLM Framework for Long-Range Video Question-Answering"
☆106Updated last year
facebookresearch / multimodal_rewardbench
Multimodal RewardBench
☆60Updated 11 months ago
jihaonew / MM-Instruct
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
☆35Updated last year
zzxslp / SoM-LLaVA
[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
☆145Updated last year
OpenGVLab / ZeroGUI
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
☆107Updated 6 months ago
Ahnsun / merlin
[ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds
☆96Updated last year
yfzhang114 / SliME
✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
☆164Updated last year
NOVAglow646 / Monet
Official codes of "Monet: Reasoning in Latent Visual Space Beyond Image and Language"
☆125Updated this week
agents-x-project / PyVision
[MTI-LLM@NeurIPS 2025] Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."
☆147Updated 6 months ago
TIGER-AI-Lab / Mantis
Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR 2024 Best Paper]
☆238Updated last month
imagegridworth / IG-VLM
☆138Updated last year
Yushi-Hu / VisualSketchpad
Codes for Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
☆277Updated 6 months ago
chenllliang / G1
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
☆96Updated 8 months ago