TalalWasim / Video-GroundingDINO

☆58

Related projects: ⓘ

Hon-Wong / Elysium
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
☆47Updated 2 months ago
farewellthree / BT-Adapter
[CVPR 2024] Official PyTorch implementation of the paper "One For All: Video Conversation is Feasible Without Video Instruction Tuning"
☆24Updated 7 months ago
Ahnsun / merlin
[ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds
☆80Updated 2 months ago
wengzejia1 / Open-VCLIP
☆100Updated 7 months ago
Liuziyu77 / RAR
The official implementation of RAR
☆61Updated 5 months ago
shikras / d-cube
A detection/segmentation dataset with labels characterized by intricate and flexible expressions. "Described Object Detection: Liberating…
☆104Updated 6 months ago
CVMI-Lab / CoDet
(NeurIPS2023) CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection
☆104Updated 4 months ago
callsys / ControlCap
[ECCV 2024] ControlCap: Controllable Region-level Captioning
☆49Updated last month
SY-Xuan / Pink
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
☆72Updated 3 months ago
whwu95 / FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant
☆42Updated 3 months ago
farewellthree / STAN
Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
☆91Updated 7 months ago
baaivision / DenseFusion
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
☆103Updated 3 weeks ago
DCDmllm / Momentor
☆43Updated 2 months ago
HJYao00 / DenseConnector
Dense Connector for MLLMs
☆98Updated last month
KangarooGroup / Kangaroo
official impelmentation of Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
☆44Updated 3 weeks ago
yingsen1 / UniMD
UniMD: Towards Unifying Moment retrieval and temporal action Detection
☆32Updated 2 months ago
bytedance / OmniScient-Model
This repo contains the code for our paper Towards Open-Ended Visual Recognition with Large Language Model
☆88Updated 2 months ago
PVIT-official / PVIT
Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
☆36Updated last year
alibaba-mmai-research / DiST
ICCV2023: Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
☆35Updated 11 months ago
OpenGVLab / EgoExoLearn
[CVPR 2024] Data and benchmark code for the EgoExoLearn dataset
☆43Updated 2 weeks ago
lorebianchi98 / FG-OVD
[CVPR2024 Highlight] Official repository of the paper "The devil is in the fine-grained details: Evaluating open-vocabulary object detect…
☆39Updated last month
V3Det / V3Det
☆93Updated 3 months ago
Becomebright / GroundVQA
Official PyTorch code of "Grounded Question-Answering in Long Egocentric Videos", accepted by CVPR 2024.
☆49Updated this week
showlab / cosmo
☆70Updated 4 months ago
sail-sg / ptp
[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》
☆148Updated last year
HengLan / CGSTVG
[CVPR 2024] Context-Guided Spatio-Temporal Video Grounding
☆38Updated 2 months ago
OpenGVLab / MUTR
[AAAI 2024] Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation
☆62Updated 2 months ago
gyxxyg / VTG-LLM
[Preprint] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
☆49Updated last month
mightyzau / RegionBLIP
☆56Updated last year
seanzhuh / SeqTR
SeqTR: A Simple yet Universal Network for Visual Grounding
☆128Updated 3 months ago