Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"
β53Jul 5, 2025Updated 8 months ago
Alternatives and similar repositories for VideoMolmo
Users that are interested in VideoMolmo are comparing it to the libraries listed below
Sorting:
- [NAACL'25] Contains code and documentation for our VANE-Bench paper.β23Aug 19, 2025Updated 6 months ago
- [CVPR 2025 π₯]A Large Multimodal Model for Pixel-Level Visual Grounding in Videosβ97Apr 14, 2025Updated 10 months ago
- β11Jan 18, 2025Updated last year
- Visual Generation Tuningβ99Jan 27, 2026Updated last month
- A new multi-task learning framework using Vision Transformersβ11Jun 19, 2024Updated last year
- A Novel Semantic Segmentation Network using Enhanced Boundaries in Cluttered Scenes (WACV 2025)β11Aug 11, 2025Updated 6 months ago
- [MICCAI 2024] Official code for the paper "MedContext: Learning Contextual Cues for Efficient Volumetric Medical Segmentation"β14Nov 1, 2024Updated last year
- [MICCAI 2025] Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathologyβ12Jun 17, 2025Updated 8 months ago
- Official code repository of paper titled "Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Visioβ¦β32May 11, 2025Updated 9 months ago
- [CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentaβ¦β66Jun 23, 2025Updated 8 months ago
- [ECCV 2024] ControlCap: Controllable Region-level Captioningβ80Oct 25, 2024Updated last year
- β13Jun 26, 2023Updated 2 years ago
- [ICCV 2025] GroundingSuite: Measuring Complex Multi-Granular Pixel Groundingβ73Jun 26, 2025Updated 8 months ago
- [BMVC 2024] On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Modelsβ15Nov 1, 2024Updated last year
- [CVPR-2023] Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentationβ18Jul 2, 2023Updated 2 years ago
- Vision-Language based Visual Object Trackingβ27Oct 10, 2025Updated 4 months ago
- Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"β33Oct 12, 2024Updated last year
- [ECCVW 2024 -- ORAL] Official repository of paper titled "Makeup-Guided Facial Privacy Protection via Untrained Neural Network Priors".β12Oct 11, 2024Updated last year
- SAM 2++: Tracking Anything at Any Granularityβ56Dec 15, 2025Updated 2 months ago
- We introduce OpenStory++, a large-scale open-domain dataset focusing on enabling MLLMs to perform storytelling generation tasks.β16Aug 30, 2024Updated last year
- The application of large pre-trained vision model DINOv2 from MetaAI for feature points matching, and a ViT decoder used for Auto Encoderβ17Apr 27, 2023Updated 2 years ago
- VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videosβ22Jan 26, 2026Updated last month
- β19Jul 23, 2024Updated last year
- [CVPR2025] Official Repository for IMMUNE: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignmentβ27Jun 11, 2025Updated 8 months ago
- AIN - The First Arabic Inclusive Large Multimodal Model. It is a versatile bilingual LMM excelling in visual and contextual understandingβ¦β51Mar 13, 2025Updated 11 months ago
- [ECCV 2024] OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Modelsβ49Jan 8, 2025Updated last year
- ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation (CVPR'25)β18Apr 2, 2025Updated 11 months ago
- WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning (CVPR 2026)β55Dec 30, 2025Updated 2 months ago
- Simulation-Ready Garment Optimization with Differentiable Simulationβ49Jun 4, 2024Updated last year
- Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervisionβ42Oct 19, 2025Updated 4 months ago
- Code release for "Understanding Bias in Large-Scale Visual Datasets"β22Dec 4, 2024Updated last year
- [ICLR2026] "Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models"β30Feb 4, 2026Updated last month
- Disentangled Pre-training for Human-Object Interaction Detectionβ27Sep 17, 2025Updated 5 months ago
- [ECCV 2024 Oral] Audio-Synchronized Visual Animationβ57Sep 12, 2024Updated last year
- Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoningβ24Sep 9, 2024Updated last year
- A codeβ29Jan 23, 2025Updated last year
- β46Jun 24, 2025Updated 8 months ago
- [MICCAI 2024] Official code repository of paper titled "BAPLe: Backdoor Attacks on Medical Foundation Models using Prompt Learning" accepβ¦β56Oct 22, 2024Updated last year
- Visual Instruction-guided Explainable Metric. Code for "Towards Explainable Metrics for Conditional Image Synthesis Evaluation" (ACL 2024β¦β68Nov 19, 2024Updated last year