mbzuai-oryx/VideoMolmo

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/mbzuai-oryx/VideoMolmo)

mbzuai-oryx / VideoMolmo

Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"

☆56

Alternatives and similar repositories for VideoMolmo

Users that are interested in VideoMolmo are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mbzuai-oryx / VideoMathQA
View on GitHub
VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos
☆24May 7, 2026Updated 2 months ago
hananshafi / MTL-ViT
View on GitHub
A new multi-task learning framework using Vision Transformers
☆11Jun 19, 2024Updated 2 years ago
HashmatShadab / HSAT
View on GitHub
[MICCAI 2025] Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology
☆12Jun 17, 2025Updated last year
umair1221 / WorldCache
View on GitHub
WorldCache: Content-Aware Caching for Accelerated Video World Models
☆21Jun 28, 2026Updated 3 weeks ago
rohit901 / VANE-Bench
View on GitHub
[NAACL'25] Contains code and documentation for our VANE-Bench paper.
☆24Aug 19, 2025Updated 11 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
HashmatShadab / Robustness-of-Volumetric-Medical-Segmentation-Models
View on GitHub
[BMVC 2024] On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models
☆15Nov 1, 2024Updated last year
hananshafi / MedContext
View on GitHub
[MICCAI 2024] Official code for the paper "MedContext: Learning Contextual Cues for Efficient Volumetric Medical Segmentation"
☆14Nov 1, 2024Updated last year
mbzuai-oryx / VideoGLaMM
View on GitHub
[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
☆104Apr 14, 2025Updated last year
akhtarvision / weather-regional
View on GitHub
☆11Oct 29, 2024Updated last year
HashmatShadab / MambaRobustness
View on GitHub
[CVPRW 2025] Official repository of paper titled "Towards Evaluating the Robustness of Visual State Space Models"
☆26Jun 8, 2025Updated last year
sheng-eatamath / S3A
View on GitHub
repo for paper titled: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment (AAAI'24 Oral)
☆25May 16, 2024Updated 2 years ago
fahadshamshad / deep-facial-privacy-prior
View on GitHub
[ECCVW 2024 -- ORAL] Official repository of paper titled "Makeup-Guided Facial Privacy Protection via Untrained Neural Network Priors".
☆12Oct 11, 2024Updated last year
techmn / cosnet
View on GitHub
A Novel Semantic Segmentation Network using Enhanced Boundaries in Cluttered Scenes (WACV 2025)
☆12Aug 11, 2025Updated 11 months ago
mbzuai-oryx / CVRR-Evaluation-Suite
View on GitHub
[CVPRW-25 MMFM] Official repository of paper titled "How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite fo…
☆50Aug 23, 2024Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
Razaimam45 / TTL-Test-Time-Low-Rank-Adaptation
View on GitHub
Official code repository of paper titled "Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Visio…
☆34May 11, 2025Updated last year
Jayce1kk / SpaceVLLM
View on GitHub
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
☆17May 8, 2025Updated last year
mbzuai-oryx / AIN
View on GitHub
AIN - The First Arabic Inclusive Large Multimodal Model. It is a versatile bilingual LMM excelling in visual and contextual understanding…
☆55Mar 13, 2025Updated last year
mbzuai-oryx / Camel-Bench
View on GitHub
[NAACL 2025 🔥] CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.
☆38Apr 17, 2025Updated last year
ShahinaKK / LWI-VMS
View on GitHub
Learnable Weight Initialization for Volumetric Medical Image Segmentation [Elsevier AIM2024]
☆22Oct 27, 2024Updated last year
umair1221 / AgriCLIP
View on GitHub
A code
☆29Jan 23, 2025Updated last year
TimeBlindness / time-blindness
View on GitHub
[CVPR 2026 🔥] Time Blindness: Why Video-Language Models Can't See What Humans Can?
☆67Jan 28, 2026Updated 5 months ago
Hasindri / HLSS
View on GitHub
[MICCAI 2024 🔥] HLSS, the first study to explore hierarchical information inherent in histopathology images and their language descripti…
☆27Aug 5, 2024Updated last year
Muhammad-Huzaifaa / ObjectCompose
View on GitHub
[ACCV 2024] ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes 🚀🚀🚀
☆37Jan 21, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
akhtarvision / cal-detr
View on GitHub
☆42Nov 9, 2023Updated 2 years ago
mbzuai-oryx / KITAB-Bench
View on GitHub
[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
☆76May 24, 2025Updated last year
hustvl / GroundingSuite
View on GitHub
[ICCV 2025] GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
☆77Jun 26, 2025Updated last year
Muzammal-Naseer / DCViT-AT
View on GitHub
Official repository for "Boosting Adversarial Transferability using Dynamic Cues " (ICLR 2023)
☆20Aug 24, 2023Updated 2 years ago
ShahinaKK / LG_SDG
View on GitHub
Language Grounded Single Source Domain Generalization in Medical Image Segmentation [ISBI2024]
☆33Oct 27, 2024Updated last year
mbzuai-oryx / EvoLMM
View on GitHub
Self Evolving Large Multimodal Models with Continuous Rewards
☆25Jun 9, 2026Updated last month
Ali2500 / ViCaS
View on GitHub
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation (CVPR'25)
☆21Apr 2, 2025Updated last year
mzeeshankaramat / SafeAgents
View on GitHub
☆20Jun 4, 2026Updated last month
mbzuai-oryx / Video-R2
View on GitHub
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
☆19Jan 21, 2026Updated 6 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
OpenGVLab / VRBench
View on GitHub
[ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos
☆28Jun 4, 2026Updated last month
jiaangli / VILA
View on GitHub
[TACL/EMNLP'24] Do Vision and Language Models Share Concepts? A Vector Space Alignment Study
☆16Nov 22, 2024Updated last year
asif-hanif / baple
View on GitHub
[MICCAI 2024] Official code repository of paper titled "BAPLe: Backdoor Attacks on Medical Foundation Models using Prompt Learning" accep…
☆56Oct 22, 2024Updated last year
eslambakr / HRS_benchmark
View on GitHub
☆60Oct 13, 2023Updated 2 years ago
aminebdj / 3D-OWIS
View on GitHub
[NeurIPS2023] 3D-OWIS is capable of detecting unknown instances in inference, and progressively learning novel classes in the process of …
☆68Dec 3, 2023Updated 2 years ago
renytek13 / Soft-Prompt-Generation
View on GitHub
[ECCV 2024] Soft Prompt Generation for Domain Generalization
☆33Oct 1, 2024Updated last year
SalesforceAIResearch / strefer
View on GitHub
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
☆19Jun 2, 2026Updated last month