aimagelab / DiCO
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization (BMVC 2024 Oral ✨)
☆14Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for DiCO
- Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. CVPR 2023☆56Updated 3 weeks ago
- [ECCV2024] ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation☆57Updated 2 months ago
- [ICLR 2024] Official code for the paper "LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts"☆69Updated 6 months ago
- FreeDA: Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation (CVPR 2024)☆29Updated 2 months ago
- [CBMI2024 Best Paper] Official repository of the paper "Is CLIP the main roadblock for fine-grained open-world perception?".☆20Updated last month
- [CVPR 2024] Improving language-visual pretraining efficiency by perform cluster-based masking on images.☆22Updated 6 months ago
- Visual self-questioning for large vision-language assistant.☆32Updated last month
- Official code repo of PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs☆24Updated 5 months ago
- [ICCV 2023] - Composed Image Retrieval on Common Objects in context (CIRCO) dataset☆52Updated 3 months ago
- Composed Video Retrieval☆46Updated 6 months ago
- The official implementation for BLIP4CIR with bi-directional training | Bi-directional Training for Composed Image Retrieval via Text Pro…☆23Updated 9 months ago
- [ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation☆48Updated 3 weeks ago
- [ICLR 2024] Official repository for "Vision-by-Language for Training-Free Compositional Image Retrieval"☆50Updated 4 months ago
- [BMVC 2023] Zero-shot Composed Text-Image Retrieval☆44Updated last year
- [Preprint] Number it: Temporal Grounding Videos like Flipping Manga☆24Updated this week
- Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision☆24Updated last month
- [ECCV2024] Learning Video Context as Interleaved Multimodal Sequences☆30Updated last month
- ☆20Updated 7 months ago
- Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"☆33Updated 3 months ago
- Code for paper: VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning☆29Updated 7 months ago
- The official implementation for Candidate Set Re-ranking for Composed Image Retrieval (TMLR) 01/2024☆13Updated 9 months ago
- A Large Multimodal Model for Pixel-Level Visual Grounding in Videos☆34Updated 2 weeks ago
- ☆16Updated last year
- Official repository of paper "Subobject-level Image Tokenization"☆62Updated 7 months ago
- Official implementation of the paper "STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models"☆15Updated 2 months ago
- Multimodal Video Understanding Framework (MVU)☆24Updated 6 months ago
- [NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding☆32Updated last week
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆52Updated this week
- Code and data for the paper "Emergent Visual-Semantic Hierarchies in Image-Text Representations" (ECCV 2024)☆22Updated 3 months ago
- Code and Models for "GeneCIS A Benchmark for General Conditional Image Similarity"☆54Updated last year