google-research / pix2structLinks
☆644Updated this week
Alternatives and similar repositories for pix2struct
Users that are interested in pix2struct are comparing it to the libraries listed below
Sorting:
- ☆246Updated 2 years ago
- Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…☆276Updated 2 years ago
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆267Updated 11 months ago
- ☆116Updated last year
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆744Updated last year
- ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …☆117Updated 3 months ago
- On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)☆623Updated 3 months ago
- Official repo for MM-REACT☆949Updated last year
- Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understan…☆349Updated 2 years ago
- ☆135Updated last year
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.☆350Updated last year
- Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".☆666Updated 8 months ago
- VisualWebArena is a benchmark for multimodal agents.☆347Updated 6 months ago
- 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".☆482Updated last year
- The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and desc…☆71Updated last year
- [NeurIPS 2023] Official implementations of "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models"☆520Updated last year
- GPT-4V in Wonderland: LMMs as Smartphone Agents☆134Updated 10 months ago
- ☆707Updated last year
- Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding"☆342Updated 2 months ago
- 🐟 Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".☆455Updated last year
- Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)☆238Updated 10 months ago
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆345Updated 2 years ago
- Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch☆1,244Updated 2 years ago
- [NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist w…☆830Updated 2 months ago
- Official Repository of ChatCaptioner☆463Updated 2 years ago
- Code release for "Learning Video Representations from Large Language Models"☆522Updated last year
- E5-V: Universal Embeddings with Multimodal Large Language Models☆249Updated 5 months ago
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…☆202Updated 9 months ago
- ☆201Updated last month
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆145Updated 2 months ago