yuyq96 / TextHawk
Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
☆51Updated 2 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for TextHawk
- [NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of…☆102Updated 3 weeks ago
- ☆73Updated 8 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆137Updated last week
- [NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment☆50Updated last month
- Vary-tiny codebase upon LAVIS (for training from scratch)and a PDF image-text pairs data (about 600k including English/Chinese)☆68Updated 2 months ago
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆132Updated last month
- Making LLaVA Tiny via MoE-Knowledge Distillation☆60Updated 3 weeks ago
- This repo contains the code and data for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks"☆69Updated last week
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆33Updated last month
- ☆131Updated 10 months ago
- The official code for NeurIPS 2024 paper: Harmonizing Visual Text Comprehension and Generation☆73Updated this week
- ☆105Updated 3 months ago
- Official repository of MMDU dataset☆75Updated last month
- ☆58Updated 9 months ago
- LVBench: An Extreme Long Video Understanding Benchmark☆61Updated 2 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆60Updated 2 months ago
- ☆67Updated this week
- A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo☆31Updated 3 months ago
- MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. A comprehensive evaluation of multimodal large model multilingua…☆45Updated last month
- 【NeurIPS 2024】Dense Connector for MLLMs☆140Updated last month
- MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria☆55Updated last month
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆38Updated 4 months ago
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆61Updated 3 weeks ago
- The official code for “DeepEraser: Deep Iterative Context Mining for Generic Text Eraser”, TMM, 2024.☆28Updated 2 months ago
- [MM2024, oral] "Self-Supervised Visual Preference Alignment" https://arxiv.org/abs/2404.10501☆41Updated 3 months ago
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆142Updated last week
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models☆231Updated last month
- The proposed simulated dataset consisting of 9,536 charts and associated data annotations in CSV format.☆21Updated 8 months ago
- Video dataset dedicated to portrait-mode video recognition.☆36Updated 7 months ago
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)☆184Updated this week