harrytea / TGDoc
arXiv 23 "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs"
☆13Updated 9 months ago
Related projects ⓘ
Alternatives and complementary repositories for TGDoc
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆53Updated 3 weeks ago
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆35Updated 2 months ago
- Code for AAAI 2023 Paper : “Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models”☆17Updated last year
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆22Updated 10 months ago
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆28Updated 8 months ago
- A huge dataset for Document Visual Question Answering☆14Updated 3 months ago
- [EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"☆14Updated last month
- ☆74Updated 8 months ago
- Official Pytorch Implementation of Self-emerging Token Labeling☆30Updated 7 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆61Updated 2 months ago
- ☆23Updated 3 months ago
- Official PyTorch Implementation of MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced …☆38Updated last week
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆32Updated 5 months ago
- The codebase for our EMNLP24 paper: Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Mo…☆55Updated last month
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆132Updated last month
- MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. A comprehensive evaluation of multimodal large model multilingua…☆45Updated last month
- Making LLaVA Tiny via MoE-Knowledge Distillation☆63Updated last month
- Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"☆32Updated 3 weeks ago
- MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models☆55Updated 2 months ago
- [NAACL 2024] Visually Guided Generative Text-Layout Pre-training for Document Intelligence☆49Updated 2 months ago
- ☆45Updated last year
- LAVIS - A One-stop Library for Language-Vision Intelligence☆47Updated 3 months ago
- ☆24Updated this week
- ☆33Updated 10 months ago
- MLLM-DataEngine: An Iterative Refinement Approach for MLLM☆37Updated 6 months ago
- imagetokenizer is a python package, helps you encoder visuals and generate visuals token ids from codebook, supports both image and video…☆29Updated 5 months ago
- ☆22Updated 9 months ago
- Datasets and Evaluation Scripts for CompHRDoc☆28Updated 7 months ago
- [ACL 2023] PuMer: Pruning and Merging Tokens for Efficient Vision Language Models☆28Updated last month
- (CVPR 2024) Bridging the Gap Between End-to-End and Two-Step Text Spotting.☆50Updated 5 months ago