[ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model
β343Nov 4, 2024Updated last year
Alternatives and similar repositories for GroundingGPT
Users that are interested in GroundingGPT are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Modelβ22Aug 5, 2024Updated last year
- [CVPR 2024 π₯] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses thaβ¦β959Aug 5, 2025Updated 10 months ago
- [CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".β296Jun 13, 2024Updated last year
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".β254Feb 5, 2024Updated 2 years ago
- [AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Groundingβ128Dec 10, 2024Updated last year
- Deploy open-source AI quickly and easily - Special Bonus Offer β’ AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- β413Jul 29, 2024Updated last year
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videosβ47Apr 29, 2024Updated 2 years ago
- [CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understandingβ420May 8, 2025Updated last year
- β134Dec 22, 2023Updated 2 years ago
- The official repository of "Video assistant towards large language model makes everything easy"β230Dec 24, 2024Updated last year
- [ECCV 22] LocVTP: Video-Text Pre-training for Temporal Localizationβ39Jul 29, 2022Updated 3 years ago
- β812Jul 8, 2024Updated last year
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)β861Jul 29, 2024Updated last year
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understandingβ639Dec 10, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Pβ¦β26Nov 28, 2023Updated 2 years ago
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of β¦β507Aug 9, 2024Updated last year
- [CVPR 2024] OneLLM: One Framework to Align All Modalities with Languageβ667Oct 22, 2024Updated last year
- [ACL 2024 π₯] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capβ¦β1,502Aug 5, 2025Updated 10 months ago
- The official code of Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval (AAAI2024)β32Mar 29, 2024Updated 2 years ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Modelβ281Jun 25, 2024Updated last year
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenizationβ586Jun 7, 2024Updated 2 years ago
- γEMNLP 2024π₯γVideo-LLaVA: Learning United Visual Representation by Alignment Before Projectionβ3,493Dec 3, 2024Updated last year
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videosβ30Jun 24, 2024Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- β4,687Apr 15, 2026Updated last month
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Modelsβ264Aug 5, 2025Updated 10 months ago
- γTMM 2025π₯γ Mixture-of-Experts for Large Vision-Language Modelsβ2,319Jul 15, 2025Updated 10 months ago
- [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Textβ423May 5, 2025Updated last year
- γNeurIPS 2024γDense Connector for MLLMsβ183Oct 14, 2024Updated last year
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understandingβ699Jan 29, 2025Updated last year
- [CVPR 2024] Context-Guided Spatio-Temporal Video Groundingβ67Jun 28, 2024Updated last year
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactionsβ2,926May 26, 2025Updated last year
- The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025β279May 26, 2025Updated last year
- Proton VPN Special Offer - Get 70% off β’ AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoningβ43Mar 2, 2026Updated 3 months ago
- β12Jul 23, 2024Updated last year
- Long Context Transfer from Language to Visionβ405Mar 18, 2025Updated last year
- β19Nov 6, 2023Updated 2 years ago
- β157Oct 31, 2024Updated last year
- [ECCV2024] Video Foundation Models & Data for Multimodal Understandingβ2,276May 26, 2026Updated 2 weeks ago
- [IJCAI-2024] The official code of Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognitionβ10Aug 10, 2025Updated 9 months ago