harrytea / TGDoc
arXiv 23 "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs"
☆13Updated 9 months ago
Related projects ⓘ
Alternatives and complementary repositories for TGDoc
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆32Updated last month
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆28Updated 7 months ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆51Updated last week
- Code for AAAI 2023 Paper : “Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models”☆17Updated last year
- A huge dataset for Document Visual Question Answering☆13Updated 3 months ago
- ☆72Updated 8 months ago
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆22Updated 10 months ago
- [EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"☆14Updated 3 weeks ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆60Updated 2 months ago
- The codebase for our EMNLP24 paper: Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Mo…☆52Updated last month
- The largest VQA dataset for Vietnamese. Related to the text content in the image.☆16Updated 6 months ago
- MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. A comprehensive evaluation of multimodal large model multilingua…☆45Updated last month
- Datasets and Evaluation Scripts for CompHRDoc☆21Updated 7 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆32Updated 4 months ago
- Official Repository of MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations☆55Updated 3 months ago
- ☆29Updated 3 weeks ago
- Official release of RFUND introduced in the MM'2024 paper "PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking for End-to-…☆17Updated 2 months ago
- ☆45Updated last year
- Code for our Paper "All in an Aggregated Image for In-Image Learning"☆29Updated 7 months ago
- An open-source implementaion for fine-tuning Molmo-7B-D and Molmo-7B-O by allenai.☆25Updated 3 weeks ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆62Updated 2 weeks ago
- ☆64Updated 2 months ago
- [NAACL 2024] Visually Guided Generative Text-Layout Pre-training for Document Intelligence☆48Updated 2 months ago
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆130Updated last month
- OCR-VQGAN, a discrete image encoder (tokenizer and detokenizer) for figure images in Paper2Fig100k dataset. Implementation of OCR Percept…☆73Updated last year
- MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models☆50Updated last month
- MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment☆31Updated 4 months ago
- Contrast-guided Feature Adjustment Module for Visual Information Extraction☆28Updated last year
- [ACL 2023] PuMer: Pruning and Merging Tokens for Efficient Vision Language Models☆28Updated last month
- Official implementation of ECCV24 paper: POA☆24Updated 3 months ago