abinthomasonline / clip-faissLinks
Image Search Application with OpenAI CLIP Model and Faiss Library
☆25Updated last year
Alternatives and similar repositories for clip-faiss
Users that are interested in clip-faiss are comparing it to the libraries listed below
Sorting:
- Florence-2☆67Updated 3 months ago
- New generation of CLIP with fine grained discrimination capability, ICML2025☆180Updated 2 weeks ago
- An open-source implementaion for fine-tuning SmolVLM.☆34Updated last month
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆181Updated 5 months ago
- [ICCV2023] Segment Every Reference Object in Spatial and Temporal Spaces☆240Updated 3 months ago
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆145Updated 2 months ago
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models☆256Updated last year
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".☆244Updated last year
- ☆187Updated 10 months ago
- Reproduction of LLaVA-v1.5 based on Llama-3-8b LLM backbone.☆65Updated 7 months ago
- Research Code for Multimodal-Cognition Team in Ant Group☆147Updated 2 weeks ago
- Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs☆90Updated 4 months ago
- Official Pytorch implementation of LinCIR: Language-only Training of Zero-shot Composed Image Retrieval (CVPR 2024)☆134Updated 10 months ago
- a family of highly capabale yet efficient large multimodal models☆183Updated 9 months ago
- [CVPR 24] The repository provides code for running inference and training for "Segment and Caption Anything" (SCA) , links for downloadin…☆224Updated 8 months ago
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆26Updated last year
- [CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge☆143Updated 10 months ago
- Chinese CLIP models with SOTA performance.☆55Updated last year
- Grounded Segment Anything: From Objects to Parts☆411Updated 2 years ago
- ☆87Updated 11 months ago
- This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension"☆192Updated 3 months ago
- Combining "segment-anything" with MOT, it create the era of "MOTS"☆155Updated 2 years ago
- [NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT☆136Updated last year
- [ACL2025 Findings] Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models☆63Updated 2 weeks ago
- GRiT: A Generative Region-to-text Transformer for Object Understanding (ECCV2024)☆326Updated last year
- [ICCV2023] TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance☆96Updated 10 months ago
- CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts☆149Updated last year
- GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)☆67Updated last year
- The official implementation for BLIP4CIR with bi-directional training | Bi-directional Training for Composed Image Retrieval via Text Pro…☆31Updated last year
- ☆133Updated last year