google-research / pix2structLinks
☆664Updated 4 months ago
Alternatives and similar repositories for pix2struct
Users that are interested in pix2struct are comparing it to the libraries listed below
Sorting:
- ☆250Updated 2 years ago
- Official repo for MM-REACT☆959Updated last year
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆268Updated last year
- GIT: A Generative Image-to-text Transformer for Vision and Language☆575Updated last year
- On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)☆737Updated 3 months ago
- [Open-Source Project] Combining MMOCR with Segment Anything & Stable Diffusion. Automatically detect, recognize and segment text instance…☆574Updated last year
- ☆714Updated last year
- Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…☆287Updated 2 years ago
- ☆123Updated last year
- Code for fine-tuning Platypus fam LLMs using LoRA☆629Updated last year
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆760Updated last year
- 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".☆482Updated 2 years ago
- MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.☆942Updated 7 months ago
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.☆354Updated 3 months ago
- DataComp: In search of the next generation of multimodal datasets☆745Updated 6 months ago
- ☆224Updated 6 months ago
- Salesforce open-source LLMs with 8k sequence length.☆722Updated 9 months ago
- ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …☆129Updated 8 months ago
- Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understan…☆357Updated 3 years ago
- Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch☆1,266Updated 3 years ago
- [NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist w…☆890Updated 6 months ago
- [NeurIPS 2023] Official implementations of "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models"☆522Updated last year
- An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal …☆361Updated last year
- PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"☆681Updated last year
- My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"☆72Updated last week
- Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".☆698Updated last year
- ☆372Updated 2 years ago
- The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and p…☆299Updated 11 months ago
- 🐟 Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".☆467Updated last year
- ☆141Updated last year