bytedance / Shot2Story
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
☆117Updated 3 weeks ago
Alternatives and similar repositories for Shot2Story:
Users that are interested in Shot2Story are comparing it to the libraries listed below
- ☆174Updated 7 months ago
- ☆134Updated last month
- ☆173Updated 7 months ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆163Updated last month
- ☆68Updated 2 months ago
- A simple script that reads a directory of videos, grabs a random frame, and automatically discovers a prompt for it☆133Updated last year
- Long Context Transfer from Language to Vision☆360Updated 3 months ago
- Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"☆124Updated 3 months ago
- Multimodal Models in Real World☆437Updated 3 months ago
- [AAAI 2025] Official pytorch implementation of "VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion …☆157Updated 10 months ago
- Official code for 'Paragraph-to-Image Generation with Information-Enriched Diffusion Model'☆102Updated 2 months ago
- Code repository for T2V-Turbo and T2V-Turbo-v2☆285Updated 3 weeks ago
- EILeV: Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties☆118Updated 3 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆101Updated 2 weeks ago
- InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions☆129Updated last year
- (CVPR 2024) Official code for paper "Towards Language-Driven Video Inpainting via Multimodal Large Language Models"☆90Updated 10 months ago
- ☆351Updated 3 months ago
- Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding☆258Updated 6 months ago
- VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling☆310Updated this week
- [ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation☆243Updated last week
- Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models (ICLR 2024)☆136Updated 9 months ago
- [IJCV'24] AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort☆146Updated 2 months ago
- [ICLR 2024] LLM-grounded Video Diffusion Models (LVD): official implementation for the LVD paper☆145Updated 9 months ago
- ☆160Updated 4 months ago
- The HD-VG-130M Dataset☆116Updated 10 months ago
- ☆72Updated 9 months ago
- Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"☆405Updated 5 months ago
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models☆250Updated last year
- [NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆133Updated 4 months ago
- 🔥🔥First-ever hour scale video understanding models☆234Updated last month