google-research / pix2struct
☆635Updated 3 weeks ago
Alternatives and similar repositories for pix2struct:
Users that are interested in pix2struct are comparing it to the libraries listed below
- ☆243Updated 2 years ago
- ☆111Updated last year
- ☆180Updated 8 months ago
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆265Updated 9 months ago
- Official repo for MM-REACT☆944Updated last year
- Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…☆269Updated 2 years ago
- GIT: A Generative Image-to-text Transformer for Vision and Language☆559Updated last year
- Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".☆644Updated 6 months ago
- A Comprehensive Benchmark for Document Parsing and Evaluation☆288Updated 3 weeks ago
- DataComp: In search of the next generation of multimodal datasets☆687Updated last year
- 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".☆479Updated last year
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.☆352Updated last year
- ☆708Updated last year
- Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch☆1,235Updated 2 years ago
- On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)☆571Updated last month
- [ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large mult…☆731Updated last month
- ☆1,692Updated 5 months ago
- ☆354Updated last year
- DocBank: A Benchmark Dataset for Document Layout Analysis☆601Updated 7 months ago
- MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.☆921Updated this week
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆730Updated last year
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…☆197Updated 6 months ago
- ☆131Updated last year
- ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …☆107Updated last month
- LLaVA-Interactive-Demo☆366Updated 7 months ago
- ☆772Updated 8 months ago
- Algorithms, papers, datasets, performance comparisons for Document AI. Continuously updating.☆183Updated 3 weeks ago
- Set-of-Mark Prompting for GPT-4V and LMMs☆1,324Updated 7 months ago
- ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Expert…☆1,367Updated last week
- Generative Representational Instruction Tuning☆610Updated last week