google-research / pix2struct
☆640Updated 2 months ago
Alternatives and similar repositories for pix2struct:
Users that are interested in pix2struct are comparing it to the libraries listed below
- ☆245Updated 2 years ago
- Official repo for MM-REACT☆949Updated last year
- Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".☆658Updated 7 months ago
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆266Updated 10 months ago
- ☆193Updated 2 weeks ago
- [NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web"☆819Updated last month
- ☆116Updated last year
- MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.☆929Updated last month
- ☆132Updated last year
- GIT: A Generative Image-to-text Transformer for Vision and Language☆567Updated last year
- Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understan…☆347Updated 2 years ago
- On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)☆605Updated 2 months ago
- Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…☆276Updated 2 years ago
- ☆63Updated last year
- Codes for "Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models".☆1,129Updated last year
- LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions☆820Updated 2 years ago
- Open LLaMA Eyes to See the World☆174Updated 2 years ago
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆739Updated last year
- Doc2Graph transforms documents into graphs and exploit a GNN to solve several tasks.☆120Updated last year
- ☆706Updated last year
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.☆352Updated last year
- Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"☆1,060Updated last year
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆338Updated 2 years ago
- Salesforce open-source LLMs with 8k sequence length.☆717Updated 3 months ago
- GPT4Tools is an intelligent system that can automatically decide, control, and utilize different visual foundation models, allowing the u…☆772Updated last year
- Official implementation of our NeurIPS 2023 paper "Augmenting Language Models with Long-Term Memory".☆792Updated last year
- Code release for "Learning Video Representations from Large Language Models"☆518Updated last year
- My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"☆73Updated last month
- Inference code for Persimmon-8B☆415Updated last year
- SGPT: GPT Sentence Embeddings for Semantic Search☆867Updated last year