nttmdlab-nlp / InstructDocLinks

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions (AAAI2024)

☆161

Alternatives and similar repositories for InstructDoc

Users that are interested in InstructDoc are comparing it to the libraries listed below

Sorting:

kyegomez / Kosmos2.5
My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"
☆73Updated 2 weeks ago
DS3Lab / WordScape
The WordScape repository contains code for the WordScape pipeline to create datasets to train document understanding models.
☆37Updated last year
Ucas-HaoranWei / Vary-tiny-600k
Vary-tiny codebase upon LAVIS （for training from scratch）and a PDF image-text pairs data (about 600k including English/Chinese)
☆84Updated 10 months ago
ucaslcl / Fox
official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"
☆152Updated last year
LukeForeverYoung / UReader
☆137Updated last year
LayTextLLM / LayTextLLM
☆96Updated 7 months ago
rubenpt91 / MP-DocVQA-Framework
☆66Updated last year
microsoft / CompHRDoc
Datasets and Evaluation Scripts for CompHRDoc
☆47Updated 5 months ago
harrytea / Awesome-Document-Understanding
Document Artifical Intelligence
☆184Updated 3 months ago
google-research-datasets / vrdu
We identify the desiderata for a comprehensive benchmark and propose Visually Rich Document Understanding (VRDU). VRDU contains two datas…
☆80Updated 2 years ago
guoxy25 / Ocean-OCR
☆37Updated 5 months ago
SCUT-DLVCLab / GPT-4V_OCR
Evaluation of the Optical Character Recognition (OCR) capabilities of GPT-4V(ision)
☆125Updated last year
wjbmattingly / qwen2-vl-finetune-huggingface
This project is a collection of fine-tuning scripts to help researchers fine-tune Qwen 2 VL on HuggingFace datasets.
☆73Updated 3 weeks ago
opendatalab / OHR-Bench
(ICCV 2025) OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
☆85Updated last month
jfma-USTC / HRDoc
Dataset and scripts for HRDoc
☆39Updated 2 years ago
XiaoduoAILab / XmodelVLM
☆69Updated last year
SALT-NLP / LLaVAR
Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"
☆269Updated last year
HCIILAB / M6Doc
☆142Updated 2 months ago
mayubo2333 / MMLongBench-Doc
Official Repository of MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations
☆90Updated last year
naver-ai / cream
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models, EMNLP 2023
☆46Updated last year
SpursGoZmy / Table-LLaVA
Dataset and Code for our ACL 2024 paper: "Multimodal Table Understanding". We propose the first large-scale Multimodal IFT and Pre-Train …
☆211Updated last month
kongds / E5-V
E5-V: Universal Embeddings with Multimodal Large Language Models
☆262Updated 7 months ago
j-rausch / DSG
☆32Updated last year
andreagemelli / doc2graph
Doc2Graph transforms documents into graphs and exploit a GNN to solve several tasks.
☆129Updated 2 years ago
bytedance / MTVQA
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. A comprehensive evaluation of multimodal large model multilingua…
☆62Updated 2 months ago
FuxiaoLiu / MMC
[NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM Instruction Tuning
☆98Updated 6 months ago
allenai / pixmo-docs
ACL 2025: Synthetic data generation pipelines for text-rich images.
☆126Updated 5 months ago
hewei2001 / ReachQA
Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"
☆53Updated 9 months ago
neulab / Pangea
This is the repo for the paper "PANGEA: A FULLY OPEN MULTILINGUAL MULTIMODAL LLM FOR 39 LANGUAGES"
☆110Updated last month
yuyq96 / TextHawk
Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
☆62Updated 9 months ago