nttmdlab-nlp / VDocRAGLinks
[CVPR2025] VDocRAG: Retirval-Augmented Generation over Visually-Rich Documents
☆22Updated last week
Alternatives and similar repositories for VDocRAG
Users that are interested in VDocRAG are comparing it to the libraries listed below
Sorting:
- "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs" 2023☆14Updated 6 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 8 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆20Updated 5 months ago
- Official PyTorch Implementation of MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced …☆78Updated 6 months ago
- ☆60Updated 2 weeks ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆60Updated 7 months ago
- The official repo for "VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search"☆25Updated last month
- A Survey of Multimodal Retrieval-Augmented Generation☆18Updated last month
- [EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"☆18Updated 7 months ago
- ☆102Updated last month
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆37Updated 5 months ago
- Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types☆18Updated last month
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆40Updated 8 months ago
- Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents, CVPR 2025☆18Updated 4 months ago
- [CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models☆40Updated 3 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆35Updated 11 months ago
- The proposed simulated dataset consisting of 9,536 charts and associated data annotations in CSV format.☆25Updated last year
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆66Updated 8 months ago
- ☆12Updated this week
- MLLM-DataEngine: An Iterative Refinement Approach for MLLM☆45Updated last year
- MLLM @ Game☆14Updated 3 weeks ago
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆26Updated last year
- [ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding☆47Updated 5 months ago
- ☆18Updated last month
- Repository for the NeurIPS 2024 paper "SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up…☆24Updated 5 months ago
- GIFT: Generative Interpretable Fine-Tuning☆20Updated 7 months ago
- The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …☆53Updated 7 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆59Updated 3 months ago
- Fast-Slow Thinking for Large Vision-Language Model Reasoning☆14Updated last month
- The code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation" [CVPR2025]☆15Updated 3 months ago