TheoCoombes / ClipCap
Using pretrained encoder and language models to generate captions from multimedia inputs.
☆94Updated 2 years ago
Alternatives and similar repositories for ClipCap:
Users that are interested in ClipCap are comparing it to the libraries listed below
- L-Verse: Bidirectional Generation Between Image and Text☆108Updated 2 years ago
- Command-line tool for downloading and extending the RedCaps dataset.☆46Updated last year
- multimodal video-audio-text generation and retrieval between every pair of modalities on the MUGEN dataset. The repo. contains the traini…☆39Updated last year
- ☆98Updated 4 months ago
- ☆47Updated 4 years ago
- ☆157Updated 2 years ago
- [BMVC22] Official Implementation of ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment"☆54Updated 2 years ago
- A task-agnostic vision-language architecture as a step towards General Purpose Vision☆92Updated 3 years ago
- CapDec: SOTA Zero Shot Image Captioning Using CLIP and GPT2, EMNLP 2022 (findings)☆191Updated last year
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models (ICCV 2023)☆140Updated last year
- Use CLIP to represent video for Retrieval Task☆69Updated 4 years ago
- ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.☆84Updated last year
- Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic☆272Updated 2 years ago
- ☆50Updated 2 years ago
- Easily compute clip embeddings from video frames☆143Updated last year
- Let's make a video clip☆93Updated 2 years ago
- PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)☆242Updated 2 years ago
- Un-*** 50 billions multimodality dataset☆24Updated 2 years ago
- Script and models for clustering LAION-400m CLIP embeddings.☆25Updated 3 years ago
- Language Models Can See: Plugging Visual Controls in Text Generation☆256Updated 2 years ago
- Aggregating embeddings over time☆31Updated 2 years ago
- Simple script to compute CLIP-based scores given a DALL-e trained model.☆30Updated 3 years ago
- ☆76Updated 2 years ago
- CLOOB training (JAX) and inference (JAX and PyTorch)☆70Updated 2 years ago
- ☆64Updated last year
- Release of ImageNet-Captions☆45Updated 2 years ago
- Generate text captions for images from their embeddings.☆105Updated last year
- PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)☆123Updated 2 years ago
- Refactoring dalle-pytorch and taming-transformers for TPU VM☆60Updated 3 years ago
- Training simple models to predict CLIP image embeddings from text embeddings, and vice versa.☆60Updated 2 years ago