mbzuai-oryx / VideoGLaMMLinks

[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

☆75

Alternatives and similar repositories for VideoGLaMM

Users that are interested in VideoGLaMM are comparing it to the libraries listed below

Sorting:

showlab / VideoLISA
[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
☆123Updated 7 months ago
wusize / F-LMM
[CVPR2025] Code Release of F-LMM: Grounding Frozen Large Multimodal Models
☆100Updated 2 months ago
wuw2019 / LoTLIP
[NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
☆43Updated 6 months ago
callsys / ControlCap
[ECCV 2024] ControlCap: Controllable Region-level Captioning
☆78Updated 9 months ago
Rubics-Xuan / MRES
This repo holds the official code and data for "Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentati…
☆70Updated last year
congvvc / InstructSeg
[ICCV 2025] Official implementation of "InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models"
☆42Updated 5 months ago
ExplainableML / flair
[CVPR 2025] FLAIR: VLM with Fine-grained Language-informed Image Representations
☆89Updated last month
SuleBai / SC-CLIP
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation
☆52Updated 2 months ago
mbzuai-oryx / CVRR-Evaluation-Suite
[CVPRW-25 MMFM] Official repository of paper titled "How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite fo…
☆49Updated 11 months ago
lizhou-cs / mglmm
☆31Updated 10 months ago
ant-research / DreamLIP
[ECCV 2024] Official PyTorch implementation of DreamLIP: Language-Image Pre-training with Long Captions
☆134Updated 2 months ago
cilinyan / VISA
[ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model
☆182Updated 11 months ago
LeapLabTHU / GSVA
[CVPR2024] GSVA: Generalized Segmentation via Multimodal Large Language Models
☆137Updated 10 months ago
Hon-Wong / Elysium
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
☆80Updated 9 months ago
yunlong10 / CAT-V
Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal Prompting
☆46Updated 3 weeks ago
ncTimTang / AKS
[CVPR 2025] Adaptive Keyframe Sampling for Long Video Understanding
☆87Updated 3 months ago
meetdavidwan / crg
PyTorch code for "Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training"
☆35Updated last year
mrwu-mac / ControlMLLM
[NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'
☆186Updated 2 weeks ago
hshjerry / VideoEspresso
[CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
☆103Updated last week
yongliu20 / SCAN
[CVPR 2024] The repository contains the official implementation of "Open-Vocabulary Segmentation with Semantic-Assisted Calibration"
☆72Updated 10 months ago
FeipengMa6 / VLoRA
[NeurIPS 2024] Visual Perception by Large Language Model’s Weights
☆45Updated 4 months ago
mc-lan / ProxyCLIP
[ECCV2024] ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation
☆98Updated 4 months ago
AFeng-x / Draw-and-Understand
[ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
☆84Updated last month
cilinyan / ReVOS-api
[ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model
☆17Updated last year
zhouyiks / CoLVA
☆33Updated 3 weeks ago
buxiangzhiren / VD-IT
☆44Updated 10 months ago
TempleX98 / MoVA
[NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context
☆165Updated 10 months ago
XMUDeepLIT / AVG-LLaVA
Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"
☆30Updated 9 months ago
Zi-hao-Wei / Efficient-Vision-Language-Pre-training-by-Cluster-Masking
[CVPR 2024] Improving language-visual pretraining efficiency by perform cluster-based masking on images.
☆28Updated last year
Shengcao-Cao / groundLMM
Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
☆41Updated 4 months ago