[ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model
β343Nov 4, 2024Updated last year
Alternatives and similar repositories for GroundingGPT
Users that are interested in GroundingGPT are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Modelβ22Aug 5, 2024Updated last year
- [CVPR 2024 π₯] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses thaβ¦β951Aug 5, 2025Updated 8 months ago
- [CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".β295Jun 13, 2024Updated last year
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".β254Feb 5, 2024Updated 2 years ago
- [AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Groundingβ126Dec 10, 2024Updated last year
- Bare Metal GPUs on DigitalOcean Gradient AI β’ AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- β425Jul 29, 2024Updated last year
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videosβ47Apr 29, 2024Updated last year
- [CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understandingβ418May 8, 2025Updated 11 months ago
- β134Dec 22, 2023Updated 2 years ago
- The official repository of "Video assistant towards large language model makes everything easy"β232Dec 24, 2024Updated last year
- [ECCV 22] LocVTP: Video-Text Pre-training for Temporal Localizationβ39Jul 29, 2022Updated 3 years ago
- β807Jul 8, 2024Updated last year
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)β862Jul 29, 2024Updated last year
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understandingβ639Dec 10, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Pβ¦β25Nov 28, 2023Updated 2 years ago
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of β¦β506Aug 9, 2024Updated last year
- [CVPR 2024] OneLLM: One Framework to Align All Modalities with Languageβ665Oct 22, 2024Updated last year
- [ACL 2024 π₯] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capβ¦β1,497Aug 5, 2025Updated 8 months ago
- The official code of Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval (AAAI2024)β32Mar 29, 2024Updated 2 years ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Modelβ282Jun 25, 2024Updated last year
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenizationβ586Jun 7, 2024Updated last year
- γEMNLP 2024π₯γVideo-LLaVA: Learning United Visual Representation by Alignment Before Projectionβ3,471Dec 3, 2024Updated last year
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videosβ28Jun 24, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways β’ AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- β4,638Updated this week
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Modelsβ263Aug 5, 2025Updated 8 months ago
- γTMM 2025π₯γ Mixture-of-Experts for Large Vision-Language Modelsβ2,314Jul 15, 2025Updated 9 months ago
- [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Textβ417May 5, 2025Updated 11 months ago
- γNeurIPS 2024γDense Connector for MLLMsβ183Oct 14, 2024Updated last year
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understandingβ691Jan 29, 2025Updated last year
- [CVPR 2024] Context-Guided Spatio-Temporal Video Groundingβ67Jun 28, 2024Updated last year
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactionsβ2,922May 26, 2025Updated 10 months ago
- The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025β278May 26, 2025Updated 10 months ago
- Wordpress hosting with auto-scaling - Free Trial β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoningβ43Mar 2, 2026Updated last month
- β12Jul 23, 2024Updated last year
- Long Context Transfer from Language to Visionβ403Mar 18, 2025Updated last year
- β19Nov 6, 2023Updated 2 years ago
- β157Oct 31, 2024Updated last year
- [ECCV2024] Video Foundation Models & Data for Multimodal Understandingβ2,241Mar 25, 2026Updated 3 weeks ago
- [ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modelingβ518Nov 18, 2025Updated 5 months ago