DRSY / MoTIS
[NAACL 2022]Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP)
☆121Updated last year
Related projects ⓘ
Alternatives and complementary repositories for MoTIS
- Using pretrained encoder and language models to generate captions from multimedia inputs.☆95Updated last year
- Efficiently read embedding in streaming from any filesystem☆96Updated 6 months ago
- CLIP中文encoder☆21Updated 2 years ago
- Use CLIP to represent video for Retrieval Task☆69Updated 3 years ago
- Big-Interleaved-Dataset☆57Updated last year
- ☆18Updated last year
- PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)☆235Updated 2 years ago
- Search photos on Unsplash based on OpenAI's CLIP model, support search with joint image+text queries and attention visualization.☆210Updated 3 years ago
- ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.☆85Updated last year
- [ECCV2022] Contrastive Vision-Language Pre-training with Limited Resources☆44Updated 2 years ago
- Easily compute clip embeddings from video frames☆137Updated last year
- ☆100Updated 9 months ago
- A non-JIT version implementation / replication of CLIP of OpenAI in pytorch☆34Updated 3 years ago
- ☆87Updated 10 months ago
- ☆36Updated last year
- A huge dataset for Document Visual Question Answering☆14Updated 3 months ago
- OpenAI CLIP coreML version for iOS text-image embeddings, image search, image clustering, image classifiy☆17Updated last year
- Diffusion-based markup-to-image generation☆78Updated last year
- ☆64Updated last year
- Implementation of the deepmind Flamingo vision-language model, based on Hugging Face language models and ready for training☆165Updated last year
- M4 experiment logbook☆56Updated last year
- Release of ImageNet-Captions☆45Updated last year
- source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT☆73Updated 2 years ago
- ☆129Updated last year
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆258Updated 5 months ago
- ☆102Updated last year
- Aggregating embeddings over time☆31Updated last year
- The official PyTorch implementation for arXiv'23 paper 'LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer'☆73Updated last year
- A reimplementation of KOSMOS-1 from "Language Is Not All You Need: Aligning Perception with Language Models"☆27Updated last year
- MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. A comprehensive evaluation of multimodal large model multilingua…☆45Updated last month