Vision-CAIR / LongVU
☆259Updated last week
Related projects ⓘ
Alternatives and complementary repositories for LongVU
- Multimodal Models in Real World☆400Updated 2 weeks ago
- Long Context Transfer from Language to Vision☆328Updated 2 weeks ago
- Official repository for the paper PLLaVA☆584Updated 3 months ago
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆166Updated last month
- ☆165Updated 4 months ago
- VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv 2023 / CVPR 2024☆261Updated 6 months ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆127Updated 3 months ago
- ☆164Updated 4 months ago
- LLaVA-HR: High-Resolution Large Language-Vision Assistant☆213Updated 2 months ago
- 🔥🔥First-ever hour scale video understanding models☆152Updated 2 weeks ago
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding☆553Updated last month
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆144Updated last week
- ☆145Updated 3 weeks ago
- This is the official code of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)☆128Updated 2 months ago
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models☆242Updated 10 months ago
- A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision,…☆171Updated 3 weeks ago
- Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraini…☆495Updated 2 months ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)☆142Updated 3 months ago
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation☆132Updated 2 weeks ago
- Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"☆366Updated 2 months ago
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images☆318Updated last month
- Movie Gen Bench - two media generation evaluation benchmarks released with Meta Movie Gen☆326Updated 3 weeks ago
- [ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model☆302Updated last week
- Official implementation of SEED-LLaMA (ICLR 2024).☆574Updated last month
- ☆181Updated last week
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆178Updated last month
- HPT - Open Multimodal LLMs from HyperGAI☆312Updated 5 months ago
- Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding☆212Updated 3 months ago
- a family of highly capabale yet efficient large multimodal models☆161Updated 2 months ago
- Data release for the ImageInWords (IIW) paper.☆200Updated 5 months ago