eric-ai-lab / Screen-Point-and-Read
Code repo for "Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding"
☆24Updated 7 months ago
Alternatives and similar repositories for Screen-Point-and-Read:
Users that are interested in Screen-Point-and-Read are comparing it to the libraries listed below
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"☆41Updated last month
- Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"☆50Updated 5 months ago
- The official repo for "VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search"☆20Updated this week
- Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"☆35Updated 7 months ago
- ☆29Updated last year
- Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning☆14Updated 4 months ago
- Official implementation of "Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models"☆36Updated last year
- This is the implementation of CounterCurate, the data curation pipeline of both physical and semantic counterfactual image-caption pairs.☆18Updated 9 months ago
- [NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆33Updated 2 weeks ago
- This repository contains the code and data for the paper "VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception o…☆22Updated 3 months ago
- DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding☆33Updated last week
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆59Updated 8 months ago
- This repo contains code for the paper "Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM"☆13Updated last month
- Codes for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding☆19Updated 2 months ago
- Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding☆50Updated 3 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated 7 months ago
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆24Updated 2 months ago
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆49Updated last year
- ☆17Updated 11 months ago
- ☆13Updated 3 months ago
- The codebase for our EMNLP24 paper: Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Mo…☆73Updated 2 months ago
- VisRL: Intention-Driven Visual Perception via Reinforced Reasoning☆15Updated last week
- Official Code of IdealGPT☆34Updated last year
- The released data for paper "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models".☆32Updated last year
- Multimodal RewardBench☆31Updated last month
- Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"☆51Updated 5 months ago
- Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073☆28Updated 8 months ago
- [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs☆24Updated 5 months ago
- [NAACL 2025] Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs☆23Updated 6 months ago
- [NeurIPS 2024] A task generation and model evaluation system for multimodal language models.☆70Updated 4 months ago