google-research / pix2struct
☆631Updated 4 months ago
Alternatives and similar repositories for pix2struct:
Users that are interested in pix2struct are comparing it to the libraries listed below
- ☆109Updated last year
- ☆242Updated 2 years ago
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆722Updated last year
- PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"☆565Updated last year
- ☆178Updated 7 months ago
- Salesforce open-source LLMs with 8k sequence length.☆717Updated 2 weeks ago
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.☆352Updated last year
- Inference code for Persimmon-8B☆416Updated last year
- Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…☆267Updated 2 years ago
- On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)☆538Updated this week
- DataComp: In search of the next generation of multimodal datasets☆678Updated last year
- [NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web"☆776Updated 6 months ago
- 🐟 Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".☆447Updated last year
- Official repo for MM-REACT☆941Updated last year
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆262Updated 8 months ago
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…☆196Updated 5 months ago
- ☆707Updated 11 months ago
- 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".☆478Updated last year
- MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.☆917Updated 8 months ago
- Code for fine-tuning Platypus fam LLMs using LoRA☆626Updated last year
- GIT: A Generative Image-to-text Transformer for Vision and Language☆556Updated last year
- VisualWebArena is a benchmark for multimodal agents.☆295Updated 3 months ago
- ☆349Updated last year
- GPT4Tools is an intelligent system that can automatically decide, control, and utilize different visual foundation models, allowing the u…☆766Updated last year
- LLaVA-Interactive-Demo☆362Updated 6 months ago
- Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".☆629Updated 5 months ago
- CodeGen2 models for program synthesis☆274Updated last year
- [ICLR 2024] Lemur: Open Foundation Models for Language Agents☆540Updated last year
- Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understan…☆346Updated 2 years ago
- Doc2Graph transforms documents into graphs and exploit a GNN to solve several tasks.☆117Updated last year