XiaoMi / mobilevlm
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
☆51Updated last month
Alternatives and similar repositories for mobilevlm:
Users that are interested in mobilevlm are comparing it to the libraries listed below
- Official implementation for "Android in the Zoo: Chain-of-Action-Thought for GUI Agents" (Findings of EMNLP 2024)☆80Updated 5 months ago
- GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes fr…☆98Updated 4 months ago
- ✨✨Latest Papers and Datasets on Mobile and PC GUI Agent☆117Updated 4 months ago
- ☆28Updated 6 months ago
- GUICourse: From General Vision Langauge Models to Versatile GUI Agents☆106Updated 8 months ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆118Updated 4 months ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆60Updated 5 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 6 months ago
- ☆199Updated this week
- A Simple Framework of Small-scale Large Multimodal Models for Video Understanding Based on TinyLLaVA_Factory.☆46Updated last week
- Official repository of MMDU dataset☆86Updated 6 months ago
- ✨✨ [ICLR 2025] MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?☆101Updated 3 weeks ago
- Pruning the VLLMs☆89Updated 3 months ago
- ☆81Updated 10 months ago
- [NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of…☆113Updated 4 months ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆97Updated last month
- A Token-level Text Image Foundation Model for Document Understanding☆78Updated last week
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆75Updated 5 months ago
- ☆40Updated last year
- [ICLR 2025] LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation☆120Updated 2 months ago
- ☆36Updated 8 months ago
- The Next Step Forward in Multimodal LLM Alignment☆138Updated 3 weeks ago
- ZO2 (Zeroth-Order Offloading): Full Parameter Fine-Tuning 175B LLMs with 18GB GPU Memory☆71Updated last week
- Research Code for Multimodal-Cognition Team in Ant Group☆139Updated 8 months ago
- A collection of visual instruction tuning datasets.☆76Updated last year
- MMR1: Advancing the Frontiers of Multimodal Reasoning☆148Updated 2 weeks ago
- VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models☆49Updated 8 months ago
- The model, data and code for the visual GUI Agent SeeClick☆343Updated 4 months ago
- ☆73Updated last year
- Touchstone: Evaluating Vision-Language Models by Language Models☆82Updated last year