google-research / pix2structLinks
☆675Updated 7 months ago
Alternatives and similar repositories for pix2struct
Users that are interested in pix2struct are comparing it to the libraries listed below
Sorting:
- ☆249Updated 2 years ago
- Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…☆286Updated 2 years ago
- Official repo for MM-REACT☆964Updated last year
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆269Updated last year
- The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and p…☆301Updated last year
- My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"☆74Updated this week
- On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)☆780Updated 6 months ago
- [Open-Source Project] Combining MMOCR with Segment Anything & Stable Diffusion. Automatically detect, recognize and segment text instance…☆576Updated last year
- GIT: A Generative Image-to-text Transformer for Vision and Language☆578Updated 2 years ago
- ☆127Updated 2 years ago
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.☆356Updated 5 months ago
- Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understan…☆359Updated 3 years ago
- 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".☆484Updated 2 years ago
- ☆234Updated 8 months ago
- Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".☆716Updated last year
- ☆715Updated last year
- ☆67Updated 2 years ago
- The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and desc…☆81Updated last year
- Code for fine-tuning Platypus fam LLMs using LoRA☆631Updated last year
- ☆142Updated last year
- MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.☆949Updated 9 months ago
- DataComp: In search of the next generation of multimodal datasets☆765Updated 8 months ago
- Algorithms, papers, datasets, performance comparisons for Document AI. Continuously updating.☆202Updated 10 months ago
- Doc2Graph transforms documents into graphs and exploit a GNN to solve several tasks.☆134Updated 2 months ago
- ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …☆135Updated 11 months ago
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆763Updated last year
- Official implementation for Dessurt: Document end-to-end self-supervised understanding and recognition transformer☆62Updated 3 years ago
- Object Detection for Graphical User Interface: Old Fashioned or Deep Learning or a Combination?☆128Updated last year
- ☆389Updated 2 years ago
- ☆32Updated last year