ttharden / Keyframe-Extraction-for-video-summarization
☆23Updated 4 months ago
Alternatives and similar repositories for Keyframe-Extraction-for-video-summarization:
Users that are interested in Keyframe-Extraction-for-video-summarization are comparing it to the libraries listed below
- Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Groundi…☆47Updated last year
- Incredibly descriptive audiovisual summaries for videos☆40Updated 5 months ago
- ☆173Updated 6 months ago
- SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama: https://arxiv.org/abs/2408.09333v2☆110Updated 2 months ago
- ☆170Updated 6 months ago
- ☆27Updated 5 months ago
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation☆134Updated 3 months ago
- A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents☆41Updated 2 months ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)☆152Updated 6 months ago
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.☆107Updated 4 months ago
- ☆28Updated last week
- Evaluate the Opinion Leadership of LLMs in the Werewolf Game☆9Updated 5 months ago
- ☆66Updated 2 months ago
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆40Updated 7 months ago
- Pytorch Implementation of the Model from "MIRASOL3B: A MULTIMODAL AUTOREGRESSIVE MODEL FOR TIME-ALIGNED AND CONTEXTUAL MODALITIES"☆26Updated this week
- SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems☆80Updated last year
- Reproduction of LLaVA-v1.5 based on Llama-3-8b LLM backbone.☆61Updated 3 months ago
- ☆32Updated 8 months ago
- The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.☆80Updated this week
- ☆64Updated 4 months ago
- ☆15Updated last year
- ☆66Updated last year
- 🔥🔥First-ever hour scale video understanding models☆226Updated last month
- Narrative movie understanding benchmark☆63Updated 8 months ago
- ☆73Updated 10 months ago
- Repository for the NeurIPS 2024 paper "SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up…☆20Updated last month
- Long Context Transfer from Language to Vision☆359Updated 2 months ago
- A collection of strong multimodal models for building multimodal AGI agents☆38Updated 6 months ago
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.☆197Updated last week
- Official PyTorch Implementation of MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced …☆53Updated 2 months ago