DRSY / MoTIS
[NAACL 2022]Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP)
☆123Updated last year
Alternatives and similar repositories for MoTIS:
Users that are interested in MoTIS are comparing it to the libraries listed below
- ☆18Updated last year
- PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)☆241Updated 2 years ago
- Search photos on Unsplash based on OpenAI's CLIP model, support search with joint image+text queries and attention visualization.☆213Updated 3 years ago
- ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.☆84Updated last year
- OpenAI CLIP coreML version for iOS text-image embeddings, image search, image clustering, image classifiy☆17Updated last year
- Using pretrained encoder and language models to generate captions from multimedia inputs.☆94Updated last year
- The official PyTorch implementation for arXiv'23 paper 'LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer'☆85Updated this week
- [ECCV 2022] FashionViL: Fashion-Focused V+L Representation Learning☆60Updated 2 years ago
- ☆180Updated last month
- Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21]☆355Updated 2 years ago
- ☆64Updated last year
- CVPR2023 paper☆50Updated last year
- Release of ImageNet-Captions☆45Updated 2 years ago
- ALIGN trained on COYO-dataset☆29Updated 9 months ago
- CapDec: SOTA Zero Shot Image Captioning Using CLIP and GPT2, EMNLP 2022 (findings)☆189Updated last year
- Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.☆377Updated last year
- Easily compute clip embeddings from video frames☆140Updated last year
- [BMVC22] Official Implementation of ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment"☆54Updated 2 years ago
- Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training☆133Updated last year
- L-Verse: Bidirectional Generation Between Image and Text☆108Updated 2 years ago
- Language Models Can See: Plugging Visual Controls in Text Generation☆257Updated 2 years ago
- ☆101Updated last year
- M4 experiment logbook☆56Updated last year
- ☆89Updated last year
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆261Updated 7 months ago
- CLIP中文encoder☆22Updated 2 years ago
- Diffusion-based markup-to-image generation☆78Updated last year
- ☆67Updated last year
- Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training (ACL 2023))☆89Updated last year
- ☆22Updated 8 months ago