UMass-Foundation-Model / CoVLM

Official implementation for CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

☆44

Alternatives and similar repositories for CoVLM:

Users that are interested in CoVLM are comparing it to the libraries listed below

PVIT-official / PVIT
Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
☆37Updated last year
Shengcao-Cao / groundLMM
Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
☆30Updated 4 months ago
keshik6 / HourVideo
[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
☆62Updated last month
HenryHZY / VL-PET
[ICCV2023] Official code for "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control"
☆53Updated last year
eric-ai-lab / MMWorld
Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"
☆25Updated 5 months ago
Share14 / ShareGemini
☆26Updated 6 months ago
ChenYi99 / EgoPlan
☆66Updated 2 months ago
HAWLYQ / ET-Cap
☆25Updated last year
yonseivnl / vlm-rlaif
ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
☆61Updated 5 months ago
MikeWangWZHL / Paxion
Repo for paper: "Paxion: Patching Action Knowledge in Video-Language Foundation Models" Neurips 23 Spotlight
☆37Updated last year
kkahatapitiya / LangRepo
Language Repository for Long Video Understanding
☆31Updated 8 months ago
yale-nlp / TOMATO
☆19Updated 3 months ago
ChocoWu / SeTok
Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
☆49Updated 4 months ago
sIncerass / MVLPT
code for "Multitask Vision-Language Prompt Tuning" https://arxiv.org/abs/2211.11720
☆55Updated 8 months ago
joez17 / VideoNIAH
VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs
☆37Updated 4 months ago
TencentARC / FLM
Accelerating Vision-Language Pretraining with Free Language Modeling (CVPR 2023)
☆31Updated last year
TencentARC / GVT
Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".
☆58Updated last year
alanaai / EVUD
Egocentric Video Understanding Dataset (EVUD)
☆26Updated 7 months ago
PolyU-ChenLab / ETBench
👾 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)
☆53Updated last month
showlab / MovieSeq
[ECCV2024] Learning Video Context as Interleaved Multimodal Sequences
☆35Updated last month
mu-cai / TemporalBench
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
☆27Updated 3 months ago
VincentDENGP / 3D-LR
Can 3D Vision-Language Models Truly Understand Natural Language?
☆21Updated 10 months ago
Yui010206 / CREMA
[ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
☆40Updated 3 weeks ago
Becomebright / GroundVQA
Official PyTorch code of "Grounded Question-Answering in Long Egocentric Videos", accepted by CVPR 2024.
☆56Updated 5 months ago
CVMI-Lab / clip-beyond-tail
(NeurIPS 2024) What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights
☆24Updated 3 months ago
orrzohar / Video-STaR
[ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
☆59Updated 7 months ago
amitakamath / whatsup_vlms
Code and datasets for "What’s “up” with vision-language models? Investigating their struggle with spatial reasoning".
☆40Updated 11 months ago
yuhui-zh15 / VLMClassifier
Official implementation of "Why are Visually-Grounded Language Models Bad at Image Classification?" (NeurIPS 2024)
☆70Updated 4 months ago
whwu95 / FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant
☆55Updated 8 months ago