alibaba / alimama-video-narratorLinks
Research code for ACL2024 paper: "Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline"
☆33Updated 6 months ago
Alternatives and similar repositories for alimama-video-narrator
Users that are interested in alimama-video-narrator are comparing it to the libraries listed below
Sorting:
- ☆188Updated last year
- ☆154Updated 6 months ago
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.☆145Updated 5 months ago
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"☆213Updated last week
- [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions☆227Updated last year
- Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Groundi…☆49Updated last year
- The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"☆121Updated last month
- ☆75Updated 4 months ago
- [ICCV 2025] LVBench: An Extreme Long Video Understanding Benchmark☆95Updated last week
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with g…☆428Updated 2 months ago
- What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness☆16Updated 2 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆118Updated 3 months ago
- AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection - CVPR NAS 2023☆172Updated 2 years ago
- A lightweight flexible Video-MLLM developed by TencentQQ Multimedia Research Team.☆72Updated 9 months ago
- Long Context Transfer from Language to Vision☆384Updated 4 months ago
- A Large-scale Dataset for training and evaluating model's ability on Dense Text Image Generation☆71Updated 4 months ago
- FunQA benchmarks funny, creative, and magic videos for challenging tasks including timestamp localization, video description, reasoning, …☆102Updated 7 months ago
- 🌀 R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)☆84Updated last year
- Narrative movie understanding benchmark☆73Updated last month
- Precision Search through Multi-Style Inputs☆71Updated 3 months ago
- 🔥🔥First-ever hour scale video understanding models☆500Updated last week
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds☆94Updated last year
- [CVPR 2025] Online Video Understanding: OVBench and VideoChat-Online☆49Updated last week
- [NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆157Updated 9 months ago
- LinVT: Empower Your Image-level Large Language Model to Understand Videos☆81Updated 6 months ago
- ☆152Updated 8 months ago
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer☆230Updated last year
- Official repository of MMDU dataset☆93Updated 9 months ago
- Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks☆298Updated last year
- [ICCV'25] Explore the Limits of Omni-modal Pretraining at Scale☆105Updated 10 months ago