farewellthree / PPLLaVALinks
Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"
β130Updated 11 months ago
Alternatives and similar repositories for PPLLaVA
Users that are interested in PPLLaVA are comparing it to the libraries listed below
Sorting:
- π‘ VideoMind: A Chain-of-LoRA Agent for Long Video Reasoningβ270Updated last week
- The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"β141Updated last month
- [ICML 2025] Official PyTorch implementation of LongVUβ403Updated 5 months ago
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.β157Updated 8 months ago
- β78Updated 7 months ago
- [ACL2025 Findings] Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Modelsβ78Updated 5 months ago
- LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)β290Updated last week
- β183Updated 2 months ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)β167Updated last year
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generationβ142Updated 11 months ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Enginesβ126Updated 11 months ago
- MovieAgent: Automated Movie Generation via Multi-Agent CoT Planningβ255Updated 6 months ago
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"β238Updated last week
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reactionβ139Updated 7 months ago
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexibleβ104Updated 2 months ago
- β130Updated 2 months ago
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Modelsβ277Updated last year
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.β252Updated 2 months ago
- β194Updated last year
- β¨β¨[NeurIPS 2025] This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehensiβ¦β321Updated last month
- Official repository for "VideoPrism: A Foundational Visual Encoder for Video Understanding" (ICML 2024)β312Updated 3 weeks ago
- Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Groundiβ¦β51Updated last year
- Long Context Transfer from Language to Visionβ395Updated 7 months ago
- Multimodal Models in Real Worldβ548Updated 8 months ago
- Structured Video Comprehension of Real-World Shortsβ208Updated last month
- [ICLR 2025] VideoGrain: This repo is the official implementation of "VideoGrain: Modulating Space-Time Attention for Multi-Grained Video β¦β154Updated 7 months ago
- [CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Modelsβ277Updated last year
- Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understandingβ288Updated 2 months ago
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with gβ¦β494Updated 2 months ago
- ICML 2025 - Impossible Videosβ77Updated 3 months ago