cnzzx / VSA
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
☆117Updated 4 months ago
Alternatives and similar repositories for VSA:
Users that are interested in VSA are comparing it to the libraries listed below
- ☆80Updated 10 months ago
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.☆219Updated 2 weeks ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆169Updated 2 months ago
- ☆172Updated last month
- [CVPR2025] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆146Updated 2 weeks ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)☆155Updated 7 months ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆95Updated 2 weeks ago
- 🔥🔥First-ever hour scale video understanding models☆247Updated last week
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation☆138Updated 4 months ago
- Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"☆126Updated 3 months ago
- The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆144Updated last month
- Explore the Limits of Omni-modal Pretraining at Scale☆96Updated 6 months ago
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer☆216Updated 11 months ago
- This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension"☆152Updated 2 weeks ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆59Updated 4 months ago
- MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)☆132Updated last month
- [ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆66Updated last month
- Precision Search through Multi-Style Inputs☆64Updated 7 months ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆74Updated 4 months ago
- [ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant☆235Updated 7 months ago
- Official code implementation of Slow Perception:Let's Perceive Geometric Figures Step-by-step☆116Updated 3 weeks ago
- Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).☆153Updated 5 months ago
- ☆73Updated last year
- ☆164Updated 8 months ago
- LVBench: An Extreme Long Video Understanding Benchmark☆84Updated 6 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆198Updated 2 months ago