[ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model
β343Nov 4, 2024Updated last year
Alternatives and similar repositories for GroundingGPT
Users that are interested in GroundingGPT are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Modelβ22Aug 5, 2024Updated last year
- [CVPR 2024 π₯] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses thaβ¦β951Aug 5, 2025Updated 7 months ago
- [CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".β295Jun 13, 2024Updated last year
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".β253Feb 5, 2024Updated 2 years ago
- [AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Groundingβ126Dec 10, 2024Updated last year
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- β425Jul 29, 2024Updated last year
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videosβ46Apr 29, 2024Updated last year
- [CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understandingβ415May 8, 2025Updated 10 months ago
- β134Dec 22, 2023Updated 2 years ago
- The official repository of "Video assistant towards large language model makes everything easy"β232Dec 24, 2024Updated last year
- [ECCV 22] LocVTP: Video-Text Pre-training for Temporal Localizationβ39Jul 29, 2022Updated 3 years ago
- β807Jul 8, 2024Updated last year
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)β860Jul 29, 2024Updated last year
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understandingβ641Dec 10, 2024Updated last year
- Managed Database hosting by DigitalOcean β’ AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Pβ¦β25Nov 28, 2023Updated 2 years ago
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of β¦β506Aug 9, 2024Updated last year
- [CVPR 2024] OneLLM: One Framework to Align All Modalities with Languageβ665Oct 22, 2024Updated last year
- [ACL 2024 π₯] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capβ¦β1,498Aug 5, 2025Updated 7 months ago
- The official code of Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval (AAAI2024)β32Mar 29, 2024Updated 2 years ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Modelβ281Jun 25, 2024Updated last year
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenizationβ584Jun 7, 2024Updated last year
- γEMNLP 2024π₯γVideo-LLaVA: Learning United Visual Representation by Alignment Before Projectionβ3,466Dec 3, 2024Updated last year
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videosβ28Jun 24, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Modelsβ262Aug 5, 2025Updated 7 months ago
- β4,615Sep 14, 2025Updated 6 months ago
- γTMM 2025π₯γ Mixture-of-Experts for Large Vision-Language Modelsβ2,310Jul 15, 2025Updated 8 months ago
- [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Textβ415May 5, 2025Updated 10 months ago
- γNeurIPS 2024γDense Connector for MLLMsβ182Oct 14, 2024Updated last year
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understandingβ689Jan 29, 2025Updated last year
- [CVPR 2024] Context-Guided Spatio-Temporal Video Groundingβ66Jun 28, 2024Updated last year
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactionsβ2,923May 26, 2025Updated 10 months ago
- The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025β278May 26, 2025Updated 10 months ago
- NordVPN Special Discount Offer β’ AdSave on top-rated NordVPN 1 or 2-year plans with secure browsing, privacy protection, and support for for all major platforms.
- β12Jul 23, 2024Updated last year
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoningβ43Mar 2, 2026Updated 3 weeks ago
- Long Context Transfer from Language to Visionβ403Mar 18, 2025Updated last year
- β19Nov 6, 2023Updated 2 years ago
- β157Oct 31, 2024Updated last year
- [ECCV2024] Video Foundation Models & Data for Multimodal Understandingβ2,223Dec 15, 2025Updated 3 months ago
- [ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modelingβ511Nov 18, 2025Updated 4 months ago