bytedance / Shot2StoryLinks

A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.

☆159

Alternatives and similar repositories for Shot2Story

Users that are interested in Shot2Story are comparing it to the libraries listed below

Sorting:

md-mohaiminul / VideoRecap
☆200Updated last year
hyc2026 / StoryTeller
☆80Updated 8 months ago
bytedance / vidi
The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"
☆148Updated last week
Deaddawn / DreamFrame-code
☆184Updated 3 months ago
WangWenhao0716 / VidProM
[NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
☆166Updated last year
EvolvingLMMs-Lab / LongVA
Long Context Transfer from Language to Vision
☆397Updated 8 months ago
HL-hanlin / VideoDirectorGPT
official implementation of VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (COLM 2024)
☆176Updated last year
Vision-CAIR / LongVU
[ICML 2025] Official PyTorch implementation of LongVU
☆412Updated 6 months ago
mutonix / Vript
☆156Updated 10 months ago
IVGSZ / Flash-VStream
This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"
☆247Updated last month
mbzuai-oryx / VideoGPT-plus
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
☆291Updated 3 months ago
yukw777 / VideoBLIP
Supercharged BLIP-2 that can handle videos
☆123Updated last year
invictus717 / InteractiveVideo
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
☆131Updated last year
bytedance / tarsier
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with g…
☆501Updated 3 months ago
weijiawu / ParaDiffusion
[IJCV 2025] Paragraph-to-Image Generation with Information-Enriched Diffusion Model
☆105Updated 8 months ago
mbzuai-oryx / Video-LLaVA
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
☆259Updated 3 months ago
farewellthree / PPLLaVA
Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"
☆130Updated last year
AILab-CVC / SEED-X
Multimodal Models in Real World
☆549Updated 9 months ago
yukw777 / EILEV
EILeV: Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
☆131Updated last year
gpt4video / GPT4Video
Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
☆144Updated last year
NVlabs / LITA
☆189Updated last year
sterzhang / image-textualization
Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)
☆169Updated last year
SHI-Labs / VCoder
[CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Models
☆280Updated last year
JourneyDB / JourneyDB
☆179Updated last week
zhenyuw16 / GenArtist
Code release for our NeurIPS 2024 Spotlight paper "GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing"
☆156Updated last year
magic-research / PLLaVA
Official repository for the paper PLLaVA
☆672Updated last year
aim-uofa / AutoStory
[IJCV'24] AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort
☆151Updated last year
Yuanshi9815 / Video-Infinity
Video-Infinity generates long videos quickly using multiple GPUs without extra training.
☆187Updated last year
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆132Updated 5 months ago
imagegridworth / IG-VLM
☆139Updated last year