farewellthree / PPLLaVALinks
Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"
β131Updated last year
Alternatives and similar repositories for PPLLaVA
Users that are interested in PPLLaVA are comparing it to the libraries listed below
Sorting:
- π§ VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning (ICLR 2026)β305Updated 2 weeks ago
- [ICML 2025] Official PyTorch implementation of LongVUβ421Updated 9 months ago
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.β167Updated last year
- β186Updated 6 months ago
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generationβ144Updated last year
- β148Updated 6 months ago
- [ACL2025 Findings] Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Modelsβ90Updated 8 months ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)β171Updated last year
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Modelsβ286Updated last year
- β82Updated 11 months ago
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"β267Updated 3 months ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Enginesβ130Updated last year
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.β270Updated 3 weeks ago
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexibleβ119Updated 6 months ago
- LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)β418Updated 3 months ago
- The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"β578Updated 2 weeks ago
- β203Updated last year
- Long Context Transfer from Language to Visionβ398Updated 10 months ago
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reactionβ168Updated 10 months ago
- Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Groundiβ¦β52Updated 2 years ago
- MovieAgent: Automated Movie Generation via Multi-Agent CoT Planningβ288Updated 10 months ago
- [CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Modelsβ279Updated last year
- [ICLR 2025] VideoGrain: This repo is the official implementation of "VideoGrain: Modulating Space-Time Attention for Multi-Grained Video β¦β160Updated 10 months ago
- Implementation for the paper "ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems".β201Updated last month
- β¨β¨[NeurIPS 2025] This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehensiβ¦β395Updated 3 weeks ago
- Multimodal Models in Real Worldβ555Updated 11 months ago
- β187Updated last year
- [AAAI 2025] StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customizationβ227Updated 3 weeks ago
- Code release for our NeurIPS 2024 Spotlight paper "GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing"β160Updated last year
- Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understandingβ292Updated 6 months ago