showlab / VideoGUI
[NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos
☆35Updated 3 weeks ago
Alternatives and similar repositories for VideoGUI:
Users that are interested in VideoGUI are comparing it to the libraries listed below
- ☆40Updated 3 weeks ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆67Updated 2 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆50Updated 4 months ago
- ☆41Updated last month
- Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.☆66Updated 3 weeks ago
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"☆42Updated 2 months ago
- ☆91Updated 3 weeks ago
- Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"☆49Updated last month
- VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning☆113Updated 2 weeks ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆111Updated last month
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆38Updated last month
- [ICLR2025] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models☆69Updated 7 months ago
- Official repo for StableLLAVA☆95Updated last year
- ☆80Updated last month
- Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"☆55Updated 6 months ago
- Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆109Updated 2 months ago
- NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation☆53Updated last week
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆31Updated 4 months ago
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆50Updated last year
- ☆72Updated 11 months ago
- Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning☆15Updated 6 months ago
- VideoHallucer, The first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)☆28Updated last month
- ☆40Updated 4 months ago
- This repo contains the code for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks" [ICLR2025]☆65Updated 2 weeks ago
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…☆124Updated 10 months ago
- OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement☆76Updated last month
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds☆92Updated 10 months ago
- VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs☆47Updated last month
- Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"☆25Updated 7 months ago
- [CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆75Updated 3 weeks ago