Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
☆130Nov 6, 2024Updated last year
Alternatives and similar repositories for VSA
Users that are interested in VSA are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Parsing-free RAG supported by VLMs☆939Dec 7, 2025Updated 3 months ago
- ☆13Feb 2, 2025Updated last year
- Large Multimodal Model☆15Apr 8, 2024Updated last year
- ☆99Jun 23, 2025Updated 9 months ago
- Introduce a novel Video Trimming (VT) task and proposes an agent-based approach (AVT) for detecting wasted footage, selecting valuable se…☆23Jan 20, 2025Updated last year
- [ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale☆124Sep 2, 2024Updated last year
- [NAACL 2025] Representing Rule-based Chatbots with Transformers☆23Feb 9, 2025Updated last year
- Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning☆45Mar 6, 2026Updated 2 weeks ago
- Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent☆416Apr 22, 2025Updated 11 months ago
- ☆13Apr 23, 2025Updated 11 months ago
- Code and data of "Controllable Unsupervised Event-based Video Generation" (accepted as ICIP oral and invited by WACV workshop)☆19Nov 5, 2024Updated last year
- [NeurIPS VLM workshop 2024] In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Underst…☆23Mar 16, 2025Updated last year
- What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness☆26May 16, 2025Updated 10 months ago
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆1,439Feb 11, 2026Updated last month
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆1,168Jul 15, 2025Updated 8 months ago
- ☆19Jun 10, 2025Updated 9 months ago
- NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing☆579Oct 20, 2024Updated last year
- Implementation of "DIME-FM: DIstilling Multimodal and Efficient Foundation Models"☆15Oct 12, 2023Updated 2 years ago
- InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions☆132Feb 7, 2024Updated 2 years ago
- Related papers about Referring Image Segmentation (RIS)☆16Dec 26, 2023Updated 2 years ago
- [CVPR 2024] OneLLM: One Framework to Align All Modalities with Language☆665Oct 22, 2024Updated last year
- [ACL 2024] Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding☆15Nov 10, 2025Updated 4 months ago
- [ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆84Feb 27, 2025Updated last year
- EgoToM is an egocentric theory-of-mind benchmark built on Ego4D videos, containing multi-choice questions that evaluate multimodal large …☆13Apr 1, 2025Updated 11 months ago
- ☆34Oct 9, 2025Updated 5 months ago
- [NeurIPS 2025] Official code implementation of Perception R1: Pioneering Perception Policy with Reinforcement Learning☆287Jul 15, 2025Updated 8 months ago
- ☆93Feb 23, 2026Updated last month
- ☆41Jan 10, 2025Updated last year
- ☆32Mar 25, 2024Updated 2 years ago
- [ICCV 2025 Highlight] The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆196Mar 17, 2025Updated last year
- ☆24Jun 18, 2025Updated 9 months ago
- 🔥 [NeurIPS 2025] Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospe…☆56Jan 22, 2026Updated 2 months ago
- SW components and demos for visual kinship recognition. An emphasis is put on the FIW dataset-- data loaders, benchmarks, results in summ…☆17Mar 13, 2023Updated 3 years ago
- [ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning☆296Mar 13, 2024Updated 2 years ago
- 最简易的R1结果在小模型上的复现,阐述类O1与DeepSeek R1最重要的本质。Think is all your need。利用实验佐证,对于强推理能力,think思考过程性内容是AGI/ASI的核心。☆45Feb 8, 2025Updated last year
- The repo for "On-the-fly Modulation for Balanced Multimodal Learning", T-PAMI 2024☆19Sep 29, 2024Updated last year
- Structured Video Comprehension of Real-World Shorts☆233Sep 21, 2025Updated 6 months ago
- [NeurIPS 2024] RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models☆31Nov 12, 2024Updated last year
- Implementation and evaluation of multimodal RAG with text and image inputs for industrial applications☆70Nov 6, 2024Updated last year