nguyentthong / video-language-understanding
[ACL’24 Findings] Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
☆33Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for video-language-understanding
- Source code for EMNLP 2022 paper “PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models”☆47Updated 2 years ago
- [CVPR 2022] A large-scale public benchmark dataset for video question-answering, especially about evidence and commonsense reasoning. The…☆51Updated 4 months ago
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videos☆34Updated 6 months ago
- ☆11Updated 10 months ago
- ☆66Updated 3 weeks ago
- ☆79Updated 2 years ago
- MMICL, a state-of-the-art VLM with the in context learning ability from ICL, PKU☆40Updated last year
- NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)☆129Updated 3 months ago
- Official repository for the A-OKVQA dataset☆63Updated 6 months ago
- Can I Trust Your Answer? Visually Grounded Video Question Answering (CVPR'24, Highlight)☆58Updated 4 months ago
- [ICML 2024] Official implementation for "HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding"☆70Updated 5 months ago
- [Preprint] TRACE: Temporal Grounding Video LLM via Casual Event Modeling☆37Updated this week
- A Prompted Visual Hallucination Evaluation Dataset, featuring over 100,000 data points and four advanced evaluation modes. The dataset in…☆11Updated last week
- Official implementation of paper 'Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal …☆27Updated last week
- [EMNLP’24 Main] Encoding and Controlling Global Semantics for Long-form Video Question Answering☆13Updated last month
- Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners☆112Updated 2 years ago
- Official code for our paper "Model Composition for Multimodal Large Language Models"☆17Updated 6 months ago
- ☆68Updated last year
- Official Code for the ICCV23 Paper: "LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval…☆41Updated last year
- This repo contains code for Invariant Grounding for Video Question Answering☆26Updated last year
- ☆33Updated 10 months ago
- A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability☆18Updated this week
- ☆63Updated 5 years ago
- NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media, EMNLP 2021☆33Updated 2 months ago
- ☆27Updated last year
- VQACL: A Novel Visual Question Answering Continual Learning Setting (CVPR'23)☆31Updated 7 months ago
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos☆19Updated 4 months ago
- The official GitHub page for ''Evaluating Object Hallucination in Large Vision-Language Models''☆178Updated 7 months ago
- [CVPR 2024] Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension☆34Updated 7 months ago
- ☆23Updated 6 months ago