farewellthree / PPLLaVALinks
Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"
β130Updated last year
Alternatives and similar repositories for PPLLaVA
Users that are interested in PPLLaVA are comparing it to the libraries listed below
Sorting:
- π‘ VideoMind: A Chain-of-LoRA Agent for Long Video Reasoningβ284Updated last month
- [ICML 2025] Official PyTorch implementation of LongVUβ412Updated 6 months ago
- [ACL2025 Findings] Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Modelsβ81Updated 6 months ago
- The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"β393Updated this week
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.β159Updated 10 months ago
- β184Updated 4 months ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)β169Updated last year
- β80Updated 8 months ago
- β141Updated 4 months ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Enginesβ128Updated last year
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generationβ144Updated last year
- MovieAgent: Automated Movie Generation via Multi-Agent CoT Planningβ263Updated 8 months ago
- LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)β312Updated last month
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"β251Updated last month
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexibleβ109Updated 3 months ago
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Modelsβ285Updated last year
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.β263Updated this week
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reactionβ146Updated 8 months ago
- Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"β119Updated 3 weeks ago
- [CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Modelsβ280Updated last year
- β201Updated last year
- [ICLR 2025] VideoGrain: This repo is the official implementation of "VideoGrain: Modulating Space-Time Attention for Multi-Grained Video β¦β157Updated 8 months ago
- Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Groundiβ¦β52Updated last year
- Code release for our NeurIPS 2024 Spotlight paper "GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing"β156Updated last year
- [CVPR 2025] EgoLife: Towards Egocentric Life Assistantβ351Updated 8 months ago
- β¨β¨[NeurIPS 2025] This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehensiβ¦β351Updated last month
- [IJCV'24] AutoStory: Generating Diverse Storytelling Images with Minimal Human Effortβ151Updated last year
- Long Context Transfer from Language to Visionβ398Updated 8 months ago
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with gβ¦β502Updated 3 months ago
- β35Updated 10 months ago