RenShuhuai-Andy / TESTALinks

[EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

☆49

Alternatives and similar repositories for TESTA

Users that are interested in TESTA are comparing it to the libraries listed below

Sorting:

jihaonew / MM-Instruct
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
☆35Updated last year
patrick-tssn / VideoHallucer
VideoHallucer, The first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
☆38Updated 3 weeks ago
Hritikbansal / videocon
☆58Updated last year
showlab / MovieSeq
[ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences
☆40Updated 8 months ago
cliangyu / Cola
[NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"
☆103Updated 2 years ago
whwu95 / FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant
☆65Updated last year
isekai-portal / Link-Context-Learning
☆100Updated last year
kkahatapitiya / LangRepo
Code for our ACL 2025 paper "Language Repository for Long Video Understanding"
☆32Updated last year
TencentARC / GVT
Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".
☆58Updated 2 years ago
icoz69 / StableLLAVA
Official repo for StableLLAVA
☆94Updated last year
bigai-nlco / VideoLLaMB
[ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
☆78Updated 8 months ago
Victorwz / MLM_Filter
Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
☆68Updated 7 months ago
mlfoundations / VisIT-Bench
☆50Updated 2 years ago
orrzohar / Video-STaR
[ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
☆71Updated last year
xjtupanda / Sparrow
Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"
☆48Updated 2 months ago
joez17 / VideoNIAH
VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs
☆50Updated 8 months ago
DCDmllm / MorphTokens
☆43Updated last year
artemisp / LAVIS-XInstructBLIP
LAVIS - A One-stop Library for Language-Vision Intelligence
☆48Updated last year
imagegridworth / IG-VLM
☆139Updated last year
Hon-Wong / ByteVideoLLM
[ICCV 2025] Dynamic-VLM
☆26Updated 11 months ago
muirbench / MuirBench
A Comprehensive Benchmark for Robust Multi-image Understanding
☆15Updated last year
MengLcool / DeepStack-VL
[NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…
☆66Updated last year
Yui010206 / CREMA
[ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
☆54Updated 4 months ago
Yangyi-Chen / SOLO
[TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
☆148Updated last year
llyx97 / TempCompass
[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …
☆125Updated 7 months ago
Ahnsun / merlin
[ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds
☆95Updated last year
AGI-Edgerunners / IIL
Code for our Paper "All in an Aggregated Image for In-Image Learning"
☆29Updated last year
rxtan2 / Koala-video-llm
☆36Updated last year
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆61Updated 4 months ago
thunlp / Muffin
☆66Updated last year