eric-ai-lab / Screen-Point-and-ReadLinks
Code repo for "Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding"
☆28Updated 10 months ago
Alternatives and similar repositories for Screen-Point-and-Read
Users that are interested in Screen-Point-and-Read are comparing it to the libraries listed below
Sorting:
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"☆43Updated 3 months ago
- Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"☆57Updated 7 months ago
- [NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆35Updated 2 months ago
- ☆29Updated 8 months ago
- ZeroGUI: Automating Online GUI Learning at Zero Human Cost☆43Updated this week
- Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"☆34Updated 9 months ago
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆37Updated 5 months ago
- ☆31Updated last year
- (ICLR 2025) The Official Code Repository for GUI-World.☆57Updated 5 months ago
- GUICourse: From General Vision Langauge Models to Versatile GUI Agents☆116Updated 10 months ago
- Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding☆51Updated 5 months ago
- Official implementation of "Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models"☆36Updated last year
- This is the implementation of CounterCurate, the data curation pipeline of both physical and semantic counterfactual image-caption pairs.☆18Updated 11 months ago
- The official repo for "VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search"☆25Updated last month
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆51Updated last year
- This repo contains code for the paper "Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM"☆14Updated 2 months ago
- ☆102Updated last month
- Official Implementation of ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay☆71Updated last week
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆63Updated 10 months ago
- Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.☆70Updated last month
- Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073☆28Updated 10 months ago
- Codes for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding☆32Updated last month
- [ACL 2025] A Generalizable and Purely Unsupervised Self-Training Framework☆60Updated this week
- VPEval Codebase from Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)☆43Updated last year
- GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes fr…☆114Updated 6 months ago
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension.☆68Updated last year
- ☆46Updated last month
- [NeurIPS 2024] A task generation and model evaluation system for multimodal language models.☆71Updated 6 months ago
- Official Code Repository for EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents (COLM 2024)☆31Updated 10 months ago
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆39Updated 2 months ago