showlab / GUI-Thinker
Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.
☆66Updated last month
Alternatives and similar repositories for GUI-Thinker
Users that are interested in GUI-Thinker are comparing it to the libraries listed below
Sorting:
- [CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆190Updated last month
- [NeurIPS 2024] Official Implementation for Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks☆73Updated last month
- [NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆35Updated last month
- ☆66Updated last month
- ☆43Updated last month
- VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning☆121Updated this week
- Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis☆131Updated this week
- ☆43Updated last month
- The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆153Updated last month
- Long Context Transfer from Language to Vision☆374Updated last month
- GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes fr…☆110Updated 6 months ago
- ☆95Updated last month
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆67Updated 2 months ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆100Updated 2 months ago
- ✨✨R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning☆109Updated this week
- Codes for Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models☆207Updated 6 months ago
- [ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents☆219Updated last week
- A Self-Training Framework for Vision-Language Reasoning☆77Updated 3 months ago
- Official repo for StableLLAVA☆95Updated last year
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]☆214Updated last month
- This repo contains the code for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks" [ICLR2025]☆65Updated 3 weeks ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆82Updated 6 months ago
- GUICourse: From General Vision Langauge Models to Versatile GUI Agents☆114Updated 9 months ago
- OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement☆83Updated this week
- A Large-scale Dataset for training and evaluating model's ability on Dense Text Image Generation☆68Updated 2 months ago
- Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"☆237Updated 2 weeks ago
- ☆79Updated last month
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆133Updated 6 months ago
- The Next Step Forward in Multimodal LLM Alignment☆153Updated last week
- Official implementation for "Android in the Zoo: Chain-of-Action-Thought for GUI Agents" (Findings of EMNLP 2024)☆85Updated 6 months ago