om-ai-lab / GroundVLP
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)
☆58Updated 10 months ago
Related projects ⓘ
Alternatives and complementary repositories for GroundVLP
- The official implementation of RAR☆74Updated 7 months ago
- [TMM 2023] Self-paced Curriculum Adapting of CLIP for Visual Grounding.☆109Updated 4 months ago
- Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs☆77Updated 5 months ago
- DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution☆39Updated last week
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆61Updated 3 weeks ago
- [CVPR 2024] Official Code for the Paper "Compositional Chain-of-Thought Prompting for Large Multimodal Models"☆80Updated 5 months ago
- Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision☆24Updated last month
- Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"☆78Updated 8 months ago
- [IEEE TCSVT] Official Pytorch Implementation of CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation.☆35Updated 3 weeks ago
- OvarNet official implement of the paper "OvarNet: Towards Open-vocabulary Object Attribute Recognition"☆98Updated last year
- [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'☆96Updated last week
- ☆78Updated 9 months ago
- ☆60Updated 10 months ago
- [CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge☆121Updated 4 months ago
- [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM☆58Updated 3 weeks ago
- [ECCV 2024] SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding☆40Updated last month
- FreeVA: Offline MLLM as Training-Free Video Assistant☆49Updated 5 months ago
- Code for the paper: "SuS-X: Training-Free Name-Only Transfer of Vision-Language Models" [ICCV'23]☆94Updated last year
- ☆85Updated 11 months ago
- Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".☆56Updated last year
- [ICML 2024] Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning☆44Updated 6 months ago
- ☆105Updated 3 months ago
- This repo holds the official code and data for "Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentati…☆64Updated 5 months ago
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆132Updated last month
- ✨✨ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?☆78Updated last week
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆82Updated 4 months ago
- Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models☆36Updated last year
- ☆27Updated 8 months ago
- PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. PixelLM is accepted by CVPR 2024.☆182Updated 5 months ago