nguyentthong / video-language-understanding
[ACL’24 Findings] Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
☆38Updated 7 months ago
Alternatives and similar repositories for video-language-understanding:
Users that are interested in video-language-understanding are comparing it to the libraries listed below
- [EMNLP’24 Main] Encoding and Controlling Global Semantics for Long-form Video Question Answering☆17Updated 5 months ago
- Can I Trust Your Answer? Visually Grounded Video Question Answering (CVPR'24, Highlight)☆66Updated 8 months ago
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videos☆40Updated 11 months ago
- 【ICLR 2024, Spotlight】Sentence-level Prompts Benefit Composed Image Retrieval☆80Updated 11 months ago
- (ICML 2024) Improve Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning☆27Updated 6 months ago
- ☆16Updated 4 months ago
- [CVPR 2024] Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension☆48Updated 11 months ago
- ☆69Updated 4 months ago
- MMICL, a state-of-the-art VLM with the in context learning ability from ICL, PKU☆47Updated last year
- This repo contains code for Invariant Grounding for Video Question Answering☆27Updated 2 years ago
- NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)☆28Updated last year
- ICCV 2023 (Oral) Open-domain Visual Entity Recognition Towards Recognizing Millions of Wikipedia Entities☆38Updated 6 months ago
- HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data (Accepted by CVPR 2024)☆44Updated 8 months ago
- NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)☆147Updated 8 months ago
- Official github repo for ICCV2023 paper 'Multi-event Video-Text Retrieval'☆18Updated last year
- VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs☆46Updated 3 weeks ago
- Source code for EMNLP 2022 paper “PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models”☆48Updated 2 years ago
- [ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modeling☆76Updated 2 months ago
- Official PyTorch code of GroundVQA (CVPR'24)☆58Updated 6 months ago
- This is the official repository for the paper "Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World"…☆47Updated last year
- Latest Advances on (RL based) Multimodal Reasoning and Generation in Multimodal Large Language Models☆17Updated this week
- [CVPR 2024] Official PyTorch implementation of the paper "One For All: Video Conversation is Feasible Without Video Instruction Tuning"☆32Updated last year
- Video Graph Transformer for Video Question Answering (ECCV'22)☆47Updated last year
- [CVPR 2022] A large-scale public benchmark dataset for video question-answering, especially about evidence and commonsense reasoning. The…☆54Updated 8 months ago
- [ACM MM 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives☆30Updated 5 months ago
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos☆23Updated 9 months ago
- A comprehensive survey of Composed Multi-modal Retrieval (CMR), including Composed Image Retrieval (CIR) and Composed Video Retrieval (CV…☆22Updated 3 weeks ago
- An official implementation for MS-DETR in ACL'23☆16Updated last year
- Benchmark data for "Rethinking Benchmarks for Cross-modal Image-text Retrieval" (SIGIR 2023)☆25Updated last year
- [2023 ACL] CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding☆30Updated last year