enrico310786 / image_text_retrieval_BLIP_BLIP2
Experiments with LAVIS library to perform image2text and text2image retrieval with BLIP and BLIP2 models
☆13Updated last year
Related projects ⓘ
Alternatives and complementary repositories for image_text_retrieval_BLIP_BLIP2
- Research Code for Multimodal-Cognition Team in Ant Group☆122Updated 4 months ago
- ☆156Updated 8 months ago
- Vary-tiny codebase upon LAVIS (for training from scratch)and a PDF image-text pairs data (about 600k including English/Chinese)☆68Updated last month
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆143Updated this week
- ☆85Updated 4 months ago
- transformers结构的中文OFA模型☆123Updated last year
- [CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge☆121Updated 3 months ago
- Lion: Kindling Vision Intelligence within Large Language Models☆53Updated 9 months ago
- An open-source implementaion for fine-tuning Qwen2-VL series by Alibaba Cloud.☆106Updated last week
- The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.☆68Updated 2 months ago
- The official code for NeurIPS 2024 paper: Harmonizing Visual Text Comprehension and Generation☆65Updated last month
- Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks☆283Updated 10 months ago
- Multimodal chatbot with computer vision capabilities integrated☆98Updated 5 months ago
- [CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback☆230Updated 2 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆294Updated 3 months ago
- [CVPR 2023] Official repository of paper titled "Fine-tuned CLIP models are efficient video learners".☆248Updated 7 months ago
- [CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"☆174Updated 5 months ago
- ☆77Updated 6 months ago
- ☆112Updated 8 months ago
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".☆217Updated 9 months ago
- Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).☆147Updated last month
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆32Updated last month
- This is the official repository for Retrieval Augmented Visual Question Answering☆181Updated 2 months ago
- TaiSu(太素)--a large-scale Chinese multimodal dataset(亿级大规模中文视觉语言预训练数据集)☆175Updated 11 months ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model☆244Updated 4 months ago
- mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)☆83Updated last year
- InstructionGPT-4☆37Updated 10 months ago
- [CVPR 2023 Workshop] The code reproduce the results of our solutions on both tracks for Meta AI Video Similarity Challenge (CVPR 2023 Wor…☆47Updated last year
- ☆57Updated 9 months ago
- A collection of visual instruction tuning datasets.☆76Updated 7 months ago