Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
☆129Nov 6, 2024Updated last year
Alternatives and similar repositories for VSA
Users that are interested in VSA are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Parsing-free RAG supported by VLMs☆949Dec 7, 2025Updated 4 months ago
- ☆13Feb 2, 2025Updated last year
- Large Multimodal Model☆15Apr 8, 2024Updated 2 years ago
- ☆100Jun 23, 2025Updated 10 months ago
- Introduce a novel Video Trimming (VT) task and proposes an agent-based approach (AVT) for detecting wasted footage, selecting valuable se…☆25Jan 20, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- [ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale☆124Sep 2, 2024Updated last year
- [NAACL 2025] Representing Rule-based Chatbots with Transformers☆23Feb 9, 2025Updated last year
- IROS☆18Aug 10, 2025Updated 8 months ago
- Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent☆423Apr 22, 2025Updated last year
- Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning☆49Mar 25, 2026Updated last month
- ☆13Apr 23, 2025Updated last year
- Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)☆11Oct 24, 2021Updated 4 years ago
- Code and data of "Controllable Unsupervised Event-based Video Generation" (accepted as ICIP oral and invited by WACV workshop)☆19Nov 5, 2024Updated last year
- [NeurIPS VLM workshop 2024] In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Underst…☆23Mar 16, 2025Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆1,447Feb 11, 2026Updated 2 months ago
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆1,184Jul 15, 2025Updated 9 months ago
- What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness☆27May 16, 2025Updated 11 months ago
- NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing☆576Oct 20, 2024Updated last year
- [CVPR 2024] OneLLM: One Framework to Align All Modalities with Language☆665Oct 22, 2024Updated last year
- The official repository of MM-R5☆29Jun 22, 2025Updated 10 months ago
- [ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆85Feb 27, 2025Updated last year
- EgoToM is an egocentric theory-of-mind benchmark built on Ego4D videos, containing multi-choice questions that evaluate multimodal large …☆14Apr 1, 2025Updated last year
- ☆35Oct 9, 2025Updated 6 months ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- [NeurIPS 2025] Official code implementation of Perception R1: Pioneering Perception Policy with Reinforcement Learning☆291Jul 15, 2025Updated 9 months ago
- ☆41Jan 10, 2025Updated last year
- [ACL 2024] Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding☆17Nov 10, 2025Updated 5 months ago
- [ICCV 2025 Highlight] The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆198Mar 17, 2025Updated last year
- ☆17Nov 17, 2023Updated 2 years ago
- ☆24Jun 18, 2025Updated 10 months ago
- 🔥 [NeurIPS 2025] Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospe…☆57Jan 22, 2026Updated 3 months ago
- SW components and demos for visual kinship recognition. An emphasis is put on the FIW dataset-- data loaders, benchmarks, results in summ…☆17Mar 13, 2023Updated 3 years ago
- 最简易的R1结果在小模型上的复现,阐述类O1与DeepSeek R1最重要的本质。Think is all your need。利用实验佐证,对于强推理能力,think思考过程性内容是AGI/ASI的核心。☆45Feb 8, 2025Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- [ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning☆296Mar 13, 2024Updated 2 years ago
- The repo for "On-the-fly Modulation for Balanced Multimodal Learning", T-PAMI 2024☆19Sep 29, 2024Updated last year
- Structured Video Comprehension of Real-World Shorts☆237Sep 21, 2025Updated 7 months ago
- [NeurIPS 2024] RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models☆31Nov 12, 2024Updated last year
- ☆20Jun 10, 2025Updated 10 months ago
- Implementation and evaluation of multimodal RAG with text and image inputs for industrial applications☆70Nov 6, 2024Updated last year
- Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"☆86Mar 21, 2024Updated 2 years ago