z-x-yang / DoraemonGPT
Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models
☆75Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for DoraemonGPT
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆52Updated this week
- Official PyTorch code of "Grounded Question-Answering in Long Egocentric Videos", accepted by CVPR 2024.☆52Updated 2 months ago
- Accepted by CVPR 2024☆28Updated 6 months ago
- MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning☆99Updated 6 months ago
- [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'☆98Updated last week
- ☆43Updated 4 months ago
- FreeVA: Offline MLLM as Training-Free Video Assistant☆49Updated 5 months ago
- A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability☆35Updated 2 weeks ago
- 👾 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)☆34Updated 2 weeks ago
- [CVPR 2024] The repository contains the official implementation of "Open-Vocabulary Segmentation with Semantic-Assisted Calibration"☆61Updated 2 months ago
- ☆76Updated last month
- [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM☆58Updated 3 weeks ago
- Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"☆78Updated 8 months ago
- A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!☆118Updated 10 months ago
- [ECCV 2024 Best Paper Candidate] Implementation of "Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Vi…☆41Updated last month
- [ECCV2024] Learning Video Context as Interleaved Multimodal Sequences☆30Updated last month
- Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision☆24Updated last month
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videos☆34Updated 6 months ago
- ☆57Updated last year
- Open-vocabulary Video Instance Segmentation Codebase built upon Detectron2, which is really easy to use.☆17Updated 8 months ago
- The official implementation of RAR☆75Updated 7 months ago
- ☆33Updated last month
- The official code of the paper "PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction".☆45Updated 3 weeks ago
- ☆15Updated 2 months ago
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds☆82Updated 4 months ago
- ☆54Updated 4 months ago
- Official implementation of "Why are Visually-Grounded Language Models Bad at Image Classification?" (NeurIPS 2024)☆52Updated last month
- ☆24Updated 4 months ago
- (ICCV 2023) Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation☆45Updated 4 months ago
- [CVPR 2024] Context-Guided Spatio-Temporal Video Grounding☆42Updated 4 months ago