buptlihang / CVLM
☆21Updated 10 months ago
Related projects ⓘ
Alternatives and complementary repositories for CVLM
- ☆131Updated 11 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆32Updated 5 months ago
- Lion: Kindling Vision Intelligence within Large Language Models☆52Updated 10 months ago
- ☆19Updated 11 months ago
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning☆66Updated 5 months ago
- ☆84Updated 4 months ago
- MLLM-DataEngine: An Iterative Refinement Approach for MLLM☆37Updated 6 months ago
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆120Updated last month
- The proposed simulated dataset consisting of 9,536 charts and associated data annotations in CSV format.☆21Updated 9 months ago
- This is the official repo for the incoming work: ByteVideoLLM☆15Updated 3 weeks ago
- PyTorch implementation of "UNIT: Unifying Image and Text Recognition in One Vision Encoder", NeurlPS 2024.☆20Updated last month
- Official repository of MMDU dataset☆75Updated last month
- Making LLaVA Tiny via MoE-Knowledge Distillation☆63Updated last month
- ☆85Updated last year
- [NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training☆140Updated 2 weeks ago
- SVIT: Scaling up Visual Instruction Tuning☆163Updated 5 months ago
- [CVPR 2024] CapsFusion: Rethinking Image-Text Data at Scale☆194Updated 8 months ago
- IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model☆25Updated last month
- ☆102Updated 5 months ago
- Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models☆36Updated last year
- This repo contains the code for our paper Towards Open-Ended Visual Recognition with Large Language Model☆90Updated 4 months ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆53Updated 3 weeks ago
- ☆109Updated 5 months ago
- A collection of visual instruction tuning datasets.☆75Updated 8 months ago
- ☆105Updated 3 months ago
- Towards Video Text Visual Question Answering: Benchmark and Baseline☆37Updated 8 months ago
- A huge dataset for Document Visual Question Answering☆14Updated 3 months ago
- Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs☆77Updated 5 months ago
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆35Updated 2 months ago