showlab / WorldGUILinks
Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.
☆85Updated 2 weeks ago
Alternatives and similar repositories for WorldGUI
Users that are interested in WorldGUI are comparing it to the libraries listed below
Sorting:
- [NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆40Updated last month
- GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes fr…☆119Updated 8 months ago
- ZeroGUI: Automating Online GUI Learning at Zero Human Cost☆75Updated last week
- Codes for Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models☆244Updated 8 months ago
- [CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆212Updated last week
- Long Context Transfer from Language to Vision☆384Updated 3 months ago
- ☆50Updated 3 weeks ago
- [ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents☆262Updated last month
- Code for the paper "AutoPresent: Designing Structured Visuals From Scratch" (CVPR 2025)☆112Updated last month
- This is the official code of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)☆221Updated 7 months ago
- [ACL 2025 🔥] Rethinking Step-by-step Visual Reasoning in LLMs☆304Updated last month
- [ICCV 2025] The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆164Updated 4 months ago
- [ACL 2025] Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis☆147Updated this week
- GUICourse: From General Vision Langauge Models to Versatile GUI Agents☆119Updated last year
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR 2024]☆220Updated 3 months ago
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"☆207Updated 2 weeks ago
- Pixel-Level Reasoning Model trained with RL☆167Updated 2 weeks ago
- VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning☆162Updated last month
- Code for "UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning"☆120Updated last month
- Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents☆145Updated 2 months ago
- ☆136Updated 9 months ago
- Release of code, datasets and model for our work TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials☆40Updated this week
- Towards Large Multimodal Models as Visual Foundation Agents☆221Updated 2 months ago
- OpenThinkIMG is an end-to-end open-source framework that empowers LVLMs to think with images.☆261Updated last month
- MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning☆126Updated last year
- This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension"☆208Updated this week
- Official implementation for "Android in the Zoo: Chain-of-Action-Thought for GUI Agents" (Findings of EMNLP 2024)☆91Updated 9 months ago
- 💡 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning☆230Updated 3 weeks ago
- [ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆70Updated 4 months ago
- Repository for the paper "InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners"☆53Updated last month