Yui010206 / CREMALinks

[ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

☆54

Alternatives and similar repositories for CREMA

Users that are interested in CREMA are comparing it to the libraries listed below

Sorting:

AV-Odyssey / AV-Odyssey
This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"
☆30Updated 10 months ago
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆60Updated 3 months ago
MCR-PEFT / Ex-MCR
☆45Updated 5 months ago
aszala / VPEval
VPEval Codebase from Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)
☆44Updated last year
UMass-Embodied-AGI / CoVLM
[ICLR 2023] CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
☆45Updated 4 months ago
BriansIDP / video-SALMONN-o1
☆36Updated 2 months ago
showlab / MovieSeq
[ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences
☆40Updated 7 months ago
EvolvingLMMs-Lab / MGPO
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
☆51Updated 3 months ago
Hritikbansal / videocon
☆57Updated last year
md-mohaiminul / BIMBA
☆27Updated 3 months ago
TencentARC / Video-Holmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆76Updated 3 months ago
j-min / VPGen
Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)
☆56Updated 2 years ago
sterzhang / PVIT
Official Repository of Personalized Visual Instruct Tuning
☆32Updated 8 months ago
TIGER-AI-Lab / Vamba
Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]
☆91Updated 3 months ago
JaaackHongggg / WorldSense
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
☆32Updated last month
XMUDeepLIT / AVG-LLaVA
Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"
☆32Updated last year
TencentARC / GRPO-CARE
☆76Updated 4 months ago
jh-yi / Video-Panda
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [CVPR 2025]
☆74Updated 4 months ago
JiuTian-VL / MoME
[NeurIPS 2024] MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
☆72Updated 6 months ago
orrzohar / Video-STaR
[ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
☆70Updated last year
CeeZh / SILVR
Official Implementation for "SiLVR : A Simple Language-based Video Reasoning Framework"
☆19Updated 2 months ago
rxtan2 / Koala-video-llm
☆36Updated last year
ytaek-oh / fsc-clip
[EMNLP 2024] Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
☆19Updated last year
Dongping-Chen / ISG
(ICLR 2025 Spotlight) Official code repository for Interleaved Scene Graph.
☆29Updated 3 months ago
fansunqi / AKeyS
Agentic Keyframe Search for Video Question Answering
☆13Updated 7 months ago
zeyofu / BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…
☆147Updated last month
yonseivnl / vlm-rlaif
ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
☆76Updated last year
xjtupanda / Sparrow
Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"
☆48Updated 2 months ago
yunlong10 / CAT-V
[AAAI 26 Demo] Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal P…
☆58Updated last week
TIGER-AI-Lab / QuickVideo
Quick Long Video Understanding
☆68Updated last week