[ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model
β343Nov 4, 2024Updated last year
Alternatives and similar repositories for GroundingGPT
Users that are interested in GroundingGPT are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Modelβ22Aug 5, 2024Updated last year
- [CVPR 2024 π₯] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses thaβ¦β962Aug 5, 2025Updated 10 months ago
- [CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".β296Jun 13, 2024Updated 2 years ago
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".β254Feb 5, 2024Updated 2 years ago
- [AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Groundingβ129Dec 10, 2024Updated last year
- AI Agents on DigitalOcean Gradient AI Platform β’ AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videosβ47Apr 29, 2024Updated 2 years ago
- [CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understandingβ424May 8, 2025Updated last year
- β134Dec 22, 2023Updated 2 years ago
- The official repository of "Video assistant towards large language model makes everything easy"β231Dec 24, 2024Updated last year
- [ECCV 22] LocVTP: Video-Text Pre-training for Temporal Localizationβ39Jul 29, 2022Updated 3 years ago
- β813Jul 8, 2024Updated last year
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)β861Jul 29, 2024Updated last year
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understandingβ638Dec 10, 2024Updated last year
- This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Pβ¦β26Nov 28, 2023Updated 2 years ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of β¦β509Aug 9, 2024Updated last year
- [CVPR 2024] OneLLM: One Framework to Align All Modalities with Languageβ665Oct 22, 2024Updated last year
- [ACL 2024 π₯] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capβ¦β1,503Aug 5, 2025Updated 10 months ago
- The official code of Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval (AAAI2024)β32Mar 29, 2024Updated 2 years ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Modelβ281Jun 25, 2024Updated 2 years ago
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenizationβ587Jun 7, 2024Updated 2 years ago
- γEMNLP 2024π₯γVideo-LLaVA: Learning United Visual Representation by Alignment Before Projectionβ3,496Dec 3, 2024Updated last year
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videosβ30Jun 24, 2024Updated 2 years ago
- β4,695Jun 15, 2026Updated 2 weeks ago
- Bare Metal GPUs on DigitalOcean Gradient AI β’ AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Modelsβ264Aug 5, 2025Updated 10 months ago
- γTMM 2025π₯γ Mixture-of-Experts for Large Vision-Language Modelsβ2,322Jul 15, 2025Updated 11 months ago
- [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Textβ424May 5, 2025Updated last year
- γNeurIPS 2024γDense Connector for MLLMsβ183Oct 14, 2024Updated last year
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understandingβ703Jan 29, 2025Updated last year
- [CVPR 2024] Context-Guided Spatio-Temporal Video Groundingβ67Jun 28, 2024Updated 2 years ago
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactionsβ2,924May 26, 2025Updated last year
- The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025β280May 26, 2025Updated last year
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoningβ43Mar 2, 2026Updated 4 months ago
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- β12Jul 23, 2024Updated last year
- Long Context Transfer from Language to Visionβ408Mar 18, 2025Updated last year
- β19Nov 6, 2023Updated 2 years ago
- β158Oct 31, 2024Updated last year
- [ECCV2024] Video Foundation Models & Data for Multimodal Understandingβ2,316Updated this week
- [IJCAI-2024] The official code of Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognitionβ10Aug 10, 2025Updated 10 months ago
- [ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modelingβ526Nov 18, 2025Updated 7 months ago