Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
☆128Nov 6, 2024Updated last year
Alternatives and similar repositories for VSA
Users that are interested in VSA are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Parsing-free RAG supported by VLMs☆962Dec 7, 2025Updated 6 months ago
- ☆13Feb 2, 2025Updated last year
- Large Multimodal Model☆15Apr 8, 2024Updated 2 years ago
- ☆101Jun 23, 2025Updated 11 months ago
- Introduce a novel Video Trimming (VT) task and proposes an agent-based approach (AVT) for detecting wasted footage, selecting valuable se…☆26Jan 20, 2025Updated last year
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- [ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale☆124Sep 2, 2024Updated last year
- [NAACL 2025] Representing Rule-based Chatbots with Transformers☆23Feb 9, 2025Updated last year
- Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent☆428Apr 22, 2025Updated last year
- Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning☆49Mar 25, 2026Updated 2 months ago
- ☆14Apr 23, 2025Updated last year
- Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)☆11Oct 24, 2021Updated 4 years ago
- Code and data of "Controllable Unsupervised Event-based Video Generation" (accepted as ICIP oral and invited by WACV workshop)☆19Nov 5, 2024Updated last year
- [NeurIPS VLM workshop 2024] In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Underst…☆23Mar 16, 2025Updated last year
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆1,453Feb 11, 2026Updated 4 months ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆1,198Jul 15, 2025Updated 10 months ago
- What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness☆27May 16, 2025Updated last year
- NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing☆578Oct 20, 2024Updated last year
- Implementation of "DIME-FM: DIstilling Multimodal and Efficient Foundation Models"☆15Oct 12, 2023Updated 2 years ago
- InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions☆133Feb 7, 2024Updated 2 years ago
- Related papers about Referring Image Segmentation (RIS)☆16Dec 26, 2023Updated 2 years ago
- [CVPR 2024] OneLLM: One Framework to Align All Modalities with Language☆667Oct 22, 2024Updated last year
- The official repository of MM-R5☆29Jun 22, 2025Updated 11 months ago
- [ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆85Feb 27, 2025Updated last year
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- [NeurIPS 2025] Official code implementation of Perception R1: Pioneering Perception Policy with Reinforcement Learning☆289Jul 15, 2025Updated 10 months ago
- ☆41Jan 10, 2025Updated last year
- [ACL 2024] Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding☆17Nov 10, 2025Updated 7 months ago
- [ICCV 2025 Highlight] The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆197Mar 17, 2025Updated last year
- ☆17Nov 17, 2023Updated 2 years ago
- ☆24Jun 18, 2025Updated 11 months ago
- 🔥 [NeurIPS 2025] Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospe…☆57Jan 22, 2026Updated 4 months ago
- SW components and demos for visual kinship recognition. An emphasis is put on the FIW dataset-- data loaders, benchmarks, results in summ…☆17Mar 13, 2023Updated 3 years ago
- 最简易的R1结果在小模型上的复现,阐述类O1与DeepSeek R1最重要的本质。Think is all your need。利用实验佐证,对于强推理能力,think思考过程性内容是AGI/ASI的核心。☆45Feb 8, 2025Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- [ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning☆297Mar 13, 2024Updated 2 years ago
- ☆35Oct 9, 2025Updated 8 months ago
- The repo for "On-the-fly Modulation for Balanced Multimodal Learning", T-PAMI 2024☆19Sep 29, 2024Updated last year
- Structured Video Comprehension of Real-World Shorts☆238Sep 21, 2025Updated 8 months ago
- EgoToM is an egocentric theory-of-mind benchmark built on Ego4D videos, containing multi-choice questions that evaluate multimodal large …☆16Apr 1, 2025Updated last year
- ☆31Mar 25, 2024Updated 2 years ago
- [NeurIPS 2024] RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models☆32Nov 12, 2024Updated last year