[ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model
β343Nov 4, 2024Updated last year
Alternatives and similar repositories for GroundingGPT
Users that are interested in GroundingGPT are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Modelβ22Aug 5, 2024Updated last year
- [CVPR 2024 π₯] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses thaβ¦β956Aug 5, 2025Updated 9 months ago
- [CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".β297Jun 13, 2024Updated last year
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".β254Feb 5, 2024Updated 2 years ago
- [AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Groundingβ128Dec 10, 2024Updated last year
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videosβ47Apr 29, 2024Updated 2 years ago
- [CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understandingβ421May 8, 2025Updated last year
- β134Dec 22, 2023Updated 2 years ago
- The official repository of "Video assistant towards large language model makes everything easy"β232Dec 24, 2024Updated last year
- [ECCV 22] LocVTP: Video-Text Pre-training for Temporal Localizationβ39Jul 29, 2022Updated 3 years ago
- β808Jul 8, 2024Updated last year
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)β861Jul 29, 2024Updated last year
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understandingβ640Dec 10, 2024Updated last year
- This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Pβ¦β25Nov 28, 2023Updated 2 years ago
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of β¦β507Aug 9, 2024Updated last year
- [CVPR 2024] OneLLM: One Framework to Align All Modalities with Languageβ667Oct 22, 2024Updated last year
- [ACL 2024 π₯] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capβ¦β1,498Aug 5, 2025Updated 9 months ago
- The official code of Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval (AAAI2024)β32Mar 29, 2024Updated 2 years ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Modelβ281Jun 25, 2024Updated last year
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenizationβ587Jun 7, 2024Updated last year
- γEMNLP 2024π₯γVideo-LLaVA: Learning United Visual Representation by Alignment Before Projectionβ3,485Dec 3, 2024Updated last year
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videosβ29Jun 24, 2024Updated last year
- β4,658Apr 15, 2026Updated last month
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Modelsβ264Aug 5, 2025Updated 9 months ago
- γTMM 2025π₯γ Mixture-of-Experts for Large Vision-Language Modelsβ2,314Jul 15, 2025Updated 10 months ago
- [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Textβ419May 5, 2025Updated last year
- γNeurIPS 2024γDense Connector for MLLMsβ183Oct 14, 2024Updated last year
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understandingβ694Jan 29, 2025Updated last year
- [CVPR 2024] Context-Guided Spatio-Temporal Video Groundingβ67Jun 28, 2024Updated last year
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactionsβ2,925May 26, 2025Updated 11 months ago
- The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025β279May 26, 2025Updated 11 months ago
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoningβ43Mar 2, 2026Updated 2 months ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- β12Jul 23, 2024Updated last year
- Long Context Transfer from Language to Visionβ403Mar 18, 2025Updated last year
- β19Nov 6, 2023Updated 2 years ago
- β157Oct 31, 2024Updated last year
- [ECCV2024] Video Foundation Models & Data for Multimodal Understandingβ2,264Mar 25, 2026Updated last month
- [ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modelingβ523Nov 18, 2025Updated 6 months ago
- [IJCAI-2024] The official code of Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognitionβ10Aug 10, 2025Updated 9 months ago