Skyline-9 / Visionary-Vids
Multi-modal transformer approach for natural language query based joint video summarization and highlight detection
☆11Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for Visionary-Vids
- The code repo for ICASSP 2023 Paper "MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning"☆18Updated last year
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆22Updated 10 months ago
- Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model☆39Updated last year
- [ICCV 2023] Accurate and Fast Compressed Video Captioning☆34Updated 9 months ago
- [ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenario…☆41Updated 2 months ago
- [ICLR2024] The official implementation of paper "UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling", by …☆70Updated 9 months ago
- The repo for "Class-aware Sounding Objects Localization", TPAMI 2021.☆29Updated 2 years ago
- MUSIC-AVQA, CVPR2022 (ORAL)☆67Updated last year
- (ACL'2023) MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning☆35Updated 3 months ago
- Official code for WACV 2024 paper, "Annotation-free Audio-Visual Segmentation"☆27Updated last month
- ☆18Updated last month
- Narrative movie understanding benchmark☆59Updated 6 months ago
- [CVPR 2024] Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection☆75Updated 4 months ago
- Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports☆30Updated 10 months ago
- Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.☆14Updated 3 weeks ago
- Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"☆98Updated 9 months ago
- [2023 TPAMI] Contrastive Positive Sample Propagation along the Audio-Visual Event Line☆26Updated last year
- Unified Audio-Visual Perception for Multi-Task Video Localization☆22Updated 7 months ago
- [ACL2023] VSTAR is a multimodal dialogue dataset with scene and topic transition information☆12Updated 3 weeks ago
- ACM Multimedia 2023 (Oral) - RTQ: Rethinking Video-language Understanding Based on Image-text Model☆15Updated 9 months ago
- NeurIPS'2023 official implementation code☆59Updated last year
- LMM which strictly superset LLM embedded☆30Updated 2 weeks ago
- Vision Transformers are Parameter-Efficient Audio-Visual Learners☆89Updated last year
- 一个近几年来各大视觉顶会关于视频文本检索的库,同步我的博客:https://blog.csdn.net/AAliuxiaolei/article/details/121433833☆14Updated 2 years ago
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos☆20Updated 4 months ago
- 🦩 Visual Instruction Tuning with Polite Flamingo - training multi-modal LLMs to be both clever and polite! (AAAI-24 Oral)☆63Updated 11 months ago
- Codebase for the paper: "TIM: A Time Interval Machine for Audio-Visual Action Recognition"☆37Updated 2 weeks ago
- Source code of our MM'22 paper Partially Relevant Video Retrieval☆51Updated 2 weeks ago
- ☆17Updated 7 months ago
- LAVIS - A One-stop Library for Language-Vision Intelligence☆47Updated 3 months ago