Token-family / TokenFD
A Token-level Text Image Foundation Model for Document Understanding
☆89Updated 3 weeks ago
Alternatives and similar repositories for TokenFD:
Users that are interested in TokenFD are comparing it to the libraries listed below
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆60Updated 5 months ago
- ☆56Updated last year
- The official code for NeurIPS 2024 paper: Harmonizing Visual Text Comprehension and Generation☆119Updated 5 months ago
- MMR1: Advancing the Frontiers of Multimodal Reasoning☆154Updated last month
- Official code implementation of Slow Perception:Let's Perceive Geometric Figures Step-by-step☆125Updated 2 months ago
- ☆173Updated last year
- 【ArXiv】PDF-Wukong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling☆116Updated 6 months ago
- ☆73Updated last year
- The Next Step Forward in Multimodal LLM Alignment☆145Updated last month
- Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources☆169Updated 2 weeks ago
- Our 2nd-gen LMM☆33Updated 11 months ago
- ☆29Updated 8 months ago
- Vary-tiny codebase upon LAVIS (for training from scratch)and a PDF image-text pairs data (about 600k including English/Chinese)☆79Updated 7 months ago
- Research Code for Multimodal-Cognition Team in Ant Group☆142Updated 9 months ago
- ☆86Updated 4 months ago
- Official repository of MMDU dataset☆89Updated 6 months ago
- Official PyTorch Implementation of MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced …☆66Updated 5 months ago
- Video dataset dedicated to portrait-mode video recognition.☆48Updated 4 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 6 months ago
- ☆47Updated this week
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆122Updated 5 months ago
- [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text☆338Updated last month
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆39Updated 6 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆155Updated 3 months ago
- official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"☆144Updated 10 months ago
- This project aims to collect and collate various datasets for multimodal large model training, including but not limited to pre-training …☆39Updated 6 months ago
- A Simple Framework of Small-scale LMMs for Video Understanding☆50Updated last week
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆99Updated last month
- [NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of…☆115Updated 4 months ago
- Synthetic data generation pipelines for text-rich images.☆60Updated last month