kyegomez / MC-ViTLinks

Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"

☆25

Alternatives and similar repositories for MC-ViT

Users that are interested in MC-ViT are comparing it to the libraries listed below

Sorting:

RenShuhuai-Andy / TESTA
[EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
☆49Updated last year
SHI-Labs / VisPer-LM
[NeurIPS 2025] Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation, arXiv 2024
☆66Updated last month
kkahatapitiya / LangRepo
Code for our ACL 2025 paper "Language Repository for Long Video Understanding"
☆33Updated last year
kyegomez / LIMoE
Implementation of the "the first large-scale multimodal mixture of experts models." from the paper: "Multimodal Contrastive Learning with…
☆36Updated last month
sunsmarterjie / ChatterBox
[AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues
☆58Updated 7 months ago
kyegomez / Mirasol
Pytorch Implementation of the Model from "MIRASOL3B: A MULTIMODAL AUTOREGRESSIVE MODEL FOR TIME-ALIGNED AND CONTEXTUAL MODALITIES"
☆26Updated 10 months ago
joslefaure / HERMES
[ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
☆37Updated 2 months ago
mlvlab / RALF
Official implementation of CVPR 2024 paper "Retrieval-Augmented Open-Vocabulary Object Detection".
☆44Updated last year
bigai-nlco / VideoLLaMB
[ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
☆79Updated 9 months ago
m1k2zoo / negbench
Evaluation and dataset construction code for the CVPR 2025 paper "Vision-Language Models Do Not Understand Negation"
☆41Updated 7 months ago
callsys / TextVR
[PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension
☆28Updated last year
artemisp / LAVIS-XInstructBLIP
LAVIS - A One-stop Library for Language-Vision Intelligence
☆48Updated last year
ByungKwanLee / Phantom
[Technical Report] Official PyTorch implementation code for realizing the technical part of Phantom of Latent representing equipped with …
☆61Updated last year
MCR-PEFT / Ex-MCR
☆45Updated 6 months ago
FactoDeepLearning / MultitaskVLFM
☆26Updated 2 years ago
mbzuai-oryx / VideoMolmo
Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"
☆54Updated 5 months ago
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆62Updated 4 months ago
top-yun / SPARK
A benchmark dataset and simple code examples for measuring the perception and reasoning of multi-sensor Vision Language models.
☆19Updated 11 months ago
NVlabs / STL
Official Pytorch Implementation of Self-emerging Token Labeling
☆35Updated last year
eric-ai-lab / ComCLIP
Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"
☆37Updated last year
om-ai-lab / ZoomEye
[EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
☆62Updated 2 weeks ago
jihaonew / MM-Instruct
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
☆35Updated last year
csarron / PuMer
[ACL 2023] PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
☆36Updated last year
klauscc / DAM
Official code for DAM: Dynamic Adapter Merging for Continual Video QA Learning
☆14Updated last year
Hritikbansal / videocon
☆58Updated last year
Hao840 / ADEM-VL
PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"
☆21Updated last year
showlab / MovieSeq
[ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences
☆40Updated 8 months ago
yuecao0119 / MMFuser
The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …
☆60Updated last year
whwu95 / FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant
☆65Updated last year
kahnchana / mvu
🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)
☆51Updated 10 months ago