showlab / GUI-Thinker
Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.
☆53Updated last week
Alternatives and similar repositories for GUI-Thinker:
Users that are interested in GUI-Thinker are comparing it to the libraries listed below
- ☆62Updated last week
- [NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆33Updated 3 weeks ago
- ☆36Updated last week
- [CVPR2025] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆172Updated last week
- Empowering Unified MLLM with Multi-granular Visual Generation☆119Updated 2 months ago
- Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"☆179Updated last week
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆40Updated last week
- Official repo for StableLLAVA☆95Updated last year
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆66Updated last month
- A Large-scale Dataset for training and evaluating model's ability on Dense Text Image Generation☆60Updated last month
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆110Updated 3 months ago
- FQGAN: Factorized Visual Tokenization and Generation☆45Updated 2 months ago
- Code for the paper "AutoPresent: Designing Structured Visuals From Scratch" (CVPR 2025)☆60Updated last month
- PhysGame Benchmark for Physical Commonsense Evaluation in Gameplay Videos☆40Updated last month
- Code release for our NeurIPS 2024 Spotlight paper "GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing"☆113Updated 5 months ago
- [ARXIV'25] GameFactory: Creating New Games with Generative Interactive Videos☆275Updated last week
- Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models☆83Updated 6 months ago
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆93Updated last week
- [CVPR 2025] DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles☆19Updated 3 weeks ago
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆86Updated 2 months ago
- [NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.☆47Updated 5 months ago
- Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆57Updated 3 months ago
- [CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs☆142Updated 8 months ago
- [NeurIPS 2024] The official implement of research paper "FreeLong : Training-Free Long Video Generation with SpectralBlend Temporal Atten…☆40Updated last month
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆131Updated 4 months ago
- ☆37Updated 3 months ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆53Updated 5 months ago
- [NeurIPS 2024] Official Implementation for Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks☆70Updated 2 weeks ago
- T2VScore: Towards A Better Metric for Text-to-Video Generation☆79Updated 11 months ago
- [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs☆140Updated 7 months ago