Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
☆128Nov 6, 2024Updated last year
Alternatives and similar repositories for VSA
Users that are interested in VSA are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Parsing-free RAG supported by VLMs☆956Dec 7, 2025Updated 5 months ago
- ☆13Feb 2, 2025Updated last year
- Large Multimodal Model☆15Apr 8, 2024Updated 2 years ago
- ☆101Jun 23, 2025Updated 11 months ago
- [ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale☆125Sep 2, 2024Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- OPSTL: Self-supervised Skeleton-based Action Recognition in Occluded Environments☆14Oct 25, 2023Updated 2 years ago
- [NAACL 2025] Representing Rule-based Chatbots with Transformers☆23Feb 9, 2025Updated last year
- IROS☆17Aug 10, 2025Updated 9 months ago
- Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent☆424Apr 22, 2025Updated last year
- ☆13Apr 23, 2025Updated last year
- Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)☆11Oct 24, 2021Updated 4 years ago
- Code and data of "Controllable Unsupervised Event-based Video Generation" (accepted as ICIP oral and invited by WACV workshop)☆19Nov 5, 2024Updated last year
- [NeurIPS VLM workshop 2024] In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Underst…☆23Mar 16, 2025Updated last year
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆1,452Feb 11, 2026Updated 3 months ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆1,192Jul 15, 2025Updated 10 months ago
- What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness☆27May 16, 2025Updated last year
- NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing☆575Oct 20, 2024Updated last year
- InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions☆133Feb 7, 2024Updated 2 years ago
- Related papers about Referring Image Segmentation (RIS)☆16Dec 26, 2023Updated 2 years ago
- The official repository of MM-R5☆29Jun 22, 2025Updated 11 months ago
- [ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆85Feb 27, 2025Updated last year
- EgoToM is an egocentric theory-of-mind benchmark built on Ego4D videos, containing multi-choice questions that evaluate multimodal large …☆15Apr 1, 2025Updated last year
- [NeurIPS 2025] Official code implementation of Perception R1: Pioneering Perception Policy with Reinforcement Learning☆290Jul 15, 2025Updated 10 months ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ☆41Jan 10, 2025Updated last year
- [ACL 2024] Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding☆17Nov 10, 2025Updated 6 months ago
- [ICCV 2025 Highlight] The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆198Mar 17, 2025Updated last year
- ☆17Nov 17, 2023Updated 2 years ago
- ☆24Jun 18, 2025Updated 11 months ago
- 🔥 [NeurIPS 2025] Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospe…☆57Jan 22, 2026Updated 4 months ago
- 最简易的R1结果在小模型上的复现,阐述类O1与DeepSeek R1最重要的本质。Think is all your need。利用实验佐证,对于强推理能力,think思考过程性内容是AGI/ASI的核心。☆45Feb 8, 2025Updated last year
- [ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning☆297Mar 13, 2024Updated 2 years ago
- ☆35Oct 9, 2025Updated 7 months ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Structured Video Comprehension of Real-World Shorts☆237Sep 21, 2025Updated 8 months ago
- ☆31Mar 25, 2024Updated 2 years ago
- ☆20Jun 10, 2025Updated 11 months ago
- Implementation and evaluation of multimodal RAG with text and image inputs for industrial applications☆71Nov 6, 2024Updated last year
- [CVPR2025] Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models☆21Apr 30, 2025Updated last year
- Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"☆86Mar 21, 2024Updated 2 years ago
- ☆12Nov 13, 2024Updated last year