enrico310786 / image_text_retrieval_BLIP_BLIP2Links
Experiments with LAVIS library to perform image2text and text2image retrieval with BLIP and BLIP2 models
☆15Updated 2 years ago
Alternatives and similar repositories for image_text_retrieval_BLIP_BLIP2
Users that are interested in image_text_retrieval_BLIP_BLIP2 are comparing it to the libraries listed below
Sorting:
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".☆255Updated last year
- Research Code for Multimodal-Cognition Team in Ant Group☆169Updated last month
- ☆30Updated last year
- [CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge☆152Updated 2 months ago
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆145Updated 3 weeks ago
- ☆186Updated last year
- Fine-tuning Qwen2.5-VL for vision-language tasks | Optimized for Vision understanding | LoRA & PEFT support.☆141Updated 9 months ago
- ☆87Updated last year
- The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.☆101Updated 5 months ago
- ☆57Updated last year
- AAAI 2024: Visual Instruction Generation and Correction☆93Updated last year
- ☆79Updated last year
- Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).☆158Updated last year
- Vary-tiny codebase upon LAVIS (for training from scratch)and a PDF image-text pairs data (about 600k including English/Chinese)☆86Updated last year
- Multimodal chatbot with computer vision capabilities integrated, our 1st-gen LMM☆101Updated last year
- ☆142Updated last year
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆331Updated last year
- Lion: Kindling Vision Intelligence within Large Language Models☆51Updated last year
- Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA, VL-Vicuna.☆271Updated 2 years ago
- InstructionGPT-4☆42Updated last year
- [CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"☆210Updated last year
- mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)☆229Updated 2 years ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model☆276Updated last year
- [CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback☆297Updated last year
- My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"☆265Updated 3 weeks ago
- [NeurIPS 2023] Official implementations of "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models"☆524Updated last year
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆44Updated last year
- Building a VLM model starts from the basic module.☆18Updated last year
- mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)☆96Updated 2 years ago
- ☆47Updated 7 months ago