google-research / pix2structLinks
☆650Updated 3 weeks ago
Alternatives and similar repositories for pix2struct
Users that are interested in pix2struct are comparing it to the libraries listed below
Sorting:
- ☆246Updated 2 years ago
- Official repo for MM-REACT☆949Updated last year
- Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…☆278Updated 2 years ago
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆267Updated last year
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆744Updated last year
- [arXiv 2023] Set-of-Mark Prompting for GPT-4V and LMMs☆1,411Updated 10 months ago
- Combining MMOCR with Segment Anything & Stable Diffusion. Automatically detect, recognize and segment text instances, with serval downstr…☆563Updated last year
- On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)☆636Updated 4 months ago
- [NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist w…☆839Updated 2 months ago
- 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".☆482Updated last year
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.☆351Updated last year
- Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch☆1,248Updated 2 years ago
- ☆710Updated last year
- MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.☆932Updated 3 months ago
- [NeurIPS 2023] Official implementations of "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models"☆520Updated last year
- GIT: A Generative Image-to-text Transformer for Vision and Language☆568Updated last year
- DataComp: In search of the next generation of multimodal datasets☆719Updated last month
- Multimodal-GPT☆1,504Updated 2 years ago
- ☆117Updated last year
- Code for fine-tuning Platypus fam LLMs using LoRA☆629Updated last year
- LLaVA-Interactive-Demo☆374Updated 11 months ago
- GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest☆533Updated 3 weeks ago
- An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal …☆361Updated last year
- ☆782Updated 11 months ago
- GPT4Tools is an intelligent system that can automatically decide, control, and utilize different visual foundation models, allowing the u…☆774Updated last year
- LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions☆819Updated 2 years ago
- Codes for "Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models".☆1,132Updated last year
- 🐟 Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".☆456Updated last year
- Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".☆670Updated 9 months ago
- Inference code for Persimmon-8B☆415Updated last year