google-research / pix2struct
☆600Updated last month
Related projects ⓘ
Alternatives and complementary repositories for pix2struct
- ☆242Updated last year
- Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…☆255Updated last year
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆258Updated 5 months ago
- ☆166Updated 3 months ago
- ☆101Updated 11 months ago
- 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".☆478Updated last year
- GIT: A Generative Image-to-text Transformer for Vision and Language☆550Updated 11 months ago
- ☆699Updated 8 months ago
- Algorithms, papers, datasets, performance comparisons for Document AI. Continuously updating.☆163Updated this week
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.☆351Updated 11 months ago
- The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".☆1,298Updated 9 months ago
- Combining MMOCR with Segment Anything & Stable Diffusion. Automatically detect, recognize and segment text instances, with serval downstr…☆533Updated 9 months ago
- DataComp: In search of the next generation of multimodal datasets☆652Updated 10 months ago
- Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".☆605Updated last month
- MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.☆902Updated 5 months ago
- 🐟 Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".☆430Updated 9 months ago
- ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …☆90Updated 3 months ago
- On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)☆470Updated 3 weeks ago
- Official repo for MM-REACT☆931Updated 9 months ago
- Salesforce open-source LLMs with 8k sequence length.☆718Updated 10 months ago
- ☆112Updated 8 months ago
- UniTable: Towards a Unified Table Foundation Model☆373Updated 5 months ago
- ☆329Updated 10 months ago
- LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions☆811Updated last year
- ☆50Updated 5 months ago
- ☆411Updated last year
- Get hundred of million of image+url from the crawling at home dataset and preprocess them☆205Updated 5 months ago
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…☆188Updated 2 months ago
- [ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large mult…☆639Updated 2 weeks ago
- Open LLaMA Eyes to See the World☆175Updated last year