dino-chiio / blip-vqa-finetune
This is implementation of finetuning BLIP model for Visual Question Answering
☆60Updated last year
Alternatives and similar repositories for blip-vqa-finetune:
Users that are interested in blip-vqa-finetune are comparing it to the libraries listed below
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆315Updated 7 months ago
- InstructionGPT-4☆39Updated last year
- LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning☆133Updated 10 months ago
- [Paper][AAAI2024]Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations☆131Updated 8 months ago
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆146Updated 5 months ago
- GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)☆64Updated last year
- [CVPR 2024] Official Code for the Paper "Compositional Chain-of-Thought Prompting for Large Multimodal Models"☆109Updated 8 months ago
- [ACM TOMM 2023] - Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features☆173Updated last year
- An open-source implementaion for fine-tuning Molmo-7B-D and Molmo-7B-O by allenai.☆49Updated last month
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆145Updated last month
- SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation☆101Updated last year
- [CVPR'24 Highlight] Implementation of "Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models"☆13Updated 5 months ago
- (ACL'2023) MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning☆35Updated 6 months ago
- Official code for paper "UniIR: Training and Benchmarking Universal Multimodal Information Retrievers" (ECCV 2024)☆130Updated 5 months ago
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models☆252Updated last year
- Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"☆88Updated 11 months ago
- The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.☆82Updated last month
- All-In-One VLM: Image + Video + Transfer to Other Languages / Domains (TPAMI 2023)☆153Updated 6 months ago
- Finetuning CLIP on a small image/text dataset using huggingface libs☆44Updated 2 years ago
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]☆202Updated this week
- CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts☆143Updated 8 months ago
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer☆367Updated last month
- [CVPR 24] The repository provides code for running inference and training for "Segment and Caption Anything" (SCA) , links for downloadin…☆213Updated 5 months ago
- Contextual Object Detection with Multimodal Large Language Models☆222Updated 4 months ago
- [CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"☆197Updated 8 months ago
- 【NeurIPS 2024】Dense Connector for MLLMs☆156Updated 4 months ago
- ☆64Updated 7 months ago
- code for studying OpenAI's CLIP explainability☆29Updated 3 years ago
- Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization☆81Updated last year
- MMICL, a state-of-the-art VLM with the in context learning ability from ICL, PKU☆46Updated last year