google-research / pix2structLinks

☆664

Alternatives and similar repositories for pix2struct

Users that are interested in pix2struct are comparing it to the libraries listed below

Sorting:

microsoft / UDOP
☆250Updated 2 years ago
microsoft / MM-REACT
Official repo for MM-REACT
☆959Updated last year
SALT-NLP / LLaVAR
Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"
☆268Updated last year
microsoft / GenerativeImage2Text
GIT: A Generative Image-to-text Transformer for Vision and Language
☆575Updated last year
Yuliang-Liu / MultimodalOCR
On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)
☆737Updated 3 months ago
yeungchenwa / OCR-SAM
[Open-Source Project] Combining MMOCR with Segment Anything & Stable Diffusion. Automatically detect, recognize and segment text instance…
☆574Updated last year
SkunkworksAI / BakLLaVA
☆714Updated last year
shabie / docformer
Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task…
☆287Updated 2 years ago
js0nwu / webui
☆123Updated last year
arielnlee / Platypus
Code for fine-tuning Platypus fam LLMs using LoRA
☆629Updated last year
LLaVA-VL / LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
☆760Updated last year
kohjingyu / fromage
🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
☆482Updated 2 years ago
allenai / mmc4
MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
☆942Updated 7 months ago
ContextualAI / lens
This is the official repository for the LENS (Large Language Models Enhanced to See) system.
☆354Updated 3 months ago
mlfoundations / datacomp
DataComp: In search of the next generation of multimodal datasets
☆745Updated 6 months ago
vis-nlp / ChartQA
☆224Updated 6 months ago
salesforce / xgen
Salesforce open-source LLMs with 8k sequence length.
☆722Updated 9 months ago
google-research-datasets / screen_qa
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …
☆129Updated 8 months ago
jpWang / LiLT
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understan…
☆357Updated 3 years ago
lucidrains / flamingo-pytorch
Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch
☆1,266Updated 3 years ago
OSU-NLP-Group / Mind2Web
[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist w…
☆890Updated 6 months ago
luogen1996 / LaVIN
[NeurIPS 2023] Official implementations of "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models"
☆522Updated last year
kyegomez / CM3Leon
An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal …
☆361Updated last year
penghao-wu / vstar
PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
☆681Updated last year
kyegomez / Kosmos2.5
My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"
☆72Updated last week
lupantech / ScienceQA
Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".
☆698Updated last year
conceptofmind / toolformer
☆372Updated 2 years ago
google-research-datasets / hiertext
The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and p…
☆299Updated 11 months ago
kohjingyu / gill
🐟 Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".
☆467Updated last year
LukeForeverYoung / UReader
☆141Updated last year