lxtGH / DenseWorld-1MLinks

Code and dataset link for "DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World"

☆116

Alternatives and similar repositories for DenseWorld-1M

Users that are interested in DenseWorld-1M are comparing it to the libraries listed below

Sorting:

Haochen-Wang409 / ross
[ICLR'25] Reconstructive Visual Instruction Tuning
☆128Updated 7 months ago
baaivision / DenseFusion
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
☆158Updated 11 months ago
OpenGVLab / Mono-InternVL
[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
☆96Updated 4 months ago
hustvl / GroundingSuite
[ICCV 2025] GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
☆69Updated 5 months ago
zhouyiks / CoLVA
☆40Updated 4 months ago
callsys / ControlCap
[ECCV 2024] ControlCap: Controllable Region-level Captioning
☆80Updated last year
NVlabs / QLIP
[arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
☆94Updated 9 months ago
AFeng-x / Draw-and-Understand
[ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
☆91Updated this week
marinero4972 / Open-o3-Video
Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"
☆119Updated 3 weeks ago
Yangr116 / VST
Visual Spatial Tuning
☆149Updated this week
mightyzau / RegionBLIP
☆58Updated 2 years ago
penghao-wu / visual_jigsaw
☆63Updated last month
bytedance / OmniScient-Model
This repo contains the code for our paper Towards Open-Ended Visual Recognition with Large Language Model
☆98Updated last year
yu-rp / VisualPerceptionToken
☆130Updated 8 months ago
showlab / VideoLISA
[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
☆141Updated 11 months ago
hshjerry / VideoEspresso
[CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
☆128Updated 4 months ago
rongyaofang / PUMA
Empowering Unified MLLM with Multi-granular Visual Generation
☆131Updated 10 months ago
x-cls / superclass
[NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training
☆219Updated 8 months ago
jiyt17 / IDA-VLM
[ICLR 2025] IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
☆36Updated last year
appletea233 / LLaVA-ST
[CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
☆79Updated 5 months ago
ProvenceStar / PartGLEE
[ECCV2024] PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
☆54Updated last year
lizhou-cs / mglmm
☆32Updated last year
ggjy / DeLVM
☆120Updated last year
TencentARC / SEED-Bench-R1
☆94Updated 5 months ago
AILab-CVC / VL-GPT
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
☆86Updated last year
eric-ai-lab / GRIT
Official code for NeurIPS 2025 paper "GRIT: Teaching MLLMs to Think with Images"
☆163Updated last month
V3Det / V3Det
☆114Updated last year
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆62Updated 4 months ago
multimodal-reasoning-lab / Bagel-Zebra-CoT
https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT
☆103Updated last month
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆133Updated 6 months ago