showlab / GUI-Thinker
Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.
☆61Updated last week
Alternatives and similar repositories for GUI-Thinker:
Users that are interested in GUI-Thinker are comparing it to the libraries listed below
- [CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆182Updated 2 weeks ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆66Updated last month
- VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning☆105Updated last week
- [NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆33Updated 2 weeks ago
- [NeurIPS 2024] Official Implementation for Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks☆71Updated last week
- ☆40Updated 2 weeks ago
- GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes fr…☆106Updated 5 months ago
- Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆105Updated last month
- Official repo for StableLLAVA☆95Updated last year
- ☆72Updated 2 weeks ago
- ☆84Updated 2 weeks ago
- Codes for Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models☆199Updated 5 months ago
- ☆35Updated 2 weeks ago
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]☆214Updated 3 weeks ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆99Updated last month
- A Self-Training Framework for Vision-Language Reasoning☆76Updated 2 months ago
- The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆152Updated last month
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆110Updated 2 weeks ago
- ☆65Updated last month
- [Neurips 24' D&B] Official Dataloader and Evaluation Scripts for LongVideoBench.☆94Updated 8 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆155Updated 3 months ago
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆54Updated 2 weeks ago
- [NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"☆175Updated 6 months ago
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆131Updated 5 months ago
- Long Context Transfer from Language to Vision☆372Updated last month
- ☆73Updated 3 months ago
- A Large-scale Dataset for training and evaluating model's ability on Dense Text Image Generation☆64Updated 2 months ago
- [CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆72Updated last week
- The codebase for our EMNLP24 paper: Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Mo…☆75Updated 2 months ago
- Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"☆49Updated last month