farewellthree / PPLLaVA
Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"
☆124Updated 2 months ago
Alternatives and similar repositories for PPLLaVA:
Users that are interested in PPLLaVA are comparing it to the libraries listed below
- ☆348Updated 3 months ago
- ☆172Updated 7 months ago
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆202Updated 4 months ago
- Multimodal Models in Real World☆435Updated 3 months ago
- Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆79Updated last month
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆159Updated last month
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆114Updated 3 months ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)☆155Updated 6 months ago
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.☆117Updated 2 weeks ago
- 🔥🔥First-ever hour scale video understanding models☆231Updated last month
- ☆66Updated 2 months ago
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation☆137Updated 3 months ago
- Implementation for the paper "ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems".☆134Updated 3 weeks ago
- Long Context Transfer from Language to Vision☆360Updated 2 months ago
- ☆174Updated 7 months ago
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆128Updated last month
- 🔥 CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models☆203Updated 7 months ago
- [AAAI 2025] StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization☆185Updated this week
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with g…☆269Updated this week
- Official repository for the paper PLLaVA☆637Updated 6 months ago
- MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)☆131Updated 3 weeks ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆73Updated 3 months ago
- [IJCV'24] AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort☆146Updated 2 months ago
- [ECCV 2024] Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models☆70Updated 3 months ago
- Code release for our NeurIPS 2024 Spotlight paper "GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing"☆105Updated 3 months ago
- NeurIPS 2024☆352Updated 4 months ago
- ☆61Updated last week