DRSY / MoTIS
[NAACL 2022]Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP)
☆120Updated last year
Related projects: ⓘ
- ☆19Updated last year
- Easily compute clip embeddings from video frames☆133Updated 10 months ago
- Using pretrained encoder and language models to generate captions from multimedia inputs.☆94Updated last year
- VideoCC is a dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automa…☆76Updated last year
- Efficiently read embedding in streaming from any filesystem☆94Updated 4 months ago
- Use CLIP to represent video for Retrieval Task☆67Updated 3 years ago
- ☆84Updated 8 months ago
- CapDec: SOTA Zero Shot Image Captioning Using CLIP and GPT2, EMNLP 2022 (findings)☆181Updated 7 months ago
- ☆64Updated 11 months ago
- Big-Interleaved-Dataset☆57Updated last year
- ALIGN trained on COYO-dataset☆28Updated 4 months ago
- [BMVC22] Official Implementation of ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment"☆52Updated last year
- Implementation of the deepmind Flamingo vision-language model, based on Hugging Face language models and ready for training☆163Updated last year
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆252Updated 3 months ago
- This is the official repository for CookGAN: Meal Image Synthesis from Ingredients☆23Updated last year
- PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)☆233Updated 2 years ago
- ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.☆81Updated last year
- Repository for the data in the paper "Explain Me the Painting: Multi-TopicKnowledgeable Art Description Generation".☆17Updated 3 years ago
- Let's make a video clip☆90Updated 2 years ago
- ☆98Updated 7 months ago
- M4 experiment logbook☆56Updated last year
- The official PyTorch implementation for arXiv'23 paper 'LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer'☆68Updated 11 months ago
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning☆131Updated last year
- ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …☆87Updated 2 months ago
- Official code for infimm-hd☆14Updated 2 weeks ago
- [ACM TOMM 2023] - Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features☆156Updated last year
- Diffusion-based markup-to-image generation☆78Updated last year
- ☆227Updated last year
- Official Pytorch implementation of "CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion" (TMLR 2024)☆73Updated last month
- Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training☆130Updated last year