KastanDay / video-pretrained-transformer

Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scratch on YouTube (YT-1B dataset).
52Updated last year

Alternatives and similar repositories for video-pretrained-transformer:

Users that are interested in video-pretrained-transformer are comparing it to the libraries listed below