jmisilo / clip-gpt-captioning
CLIPxGPT Captioner is Image Captioning Model based on OpenAI's CLIP and GPT-2.
☆111Updated 11 months ago
Related projects ⓘ
Alternatives and complementary repositories for clip-gpt-captioning
- Pytorch implementation of image captioning using transformer-based model.☆61Updated last year
- Implementation code of the work "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning"☆84Updated 6 months ago
- PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)☆235Updated 2 years ago
- CapDec: SOTA Zero Shot Image Captioning Using CLIP and GPT2, EMNLP 2022 (findings)☆186Updated 9 months ago
- Fine tuning OpenAI's CLIP model on Indian Fashion Dataset☆50Updated last year
- Image Captioning Vision Transformers (ViTs) are transformer models that generate descriptive captions for images by combining the power o…☆27Updated last month
- Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic☆269Updated 2 years ago
- GRIT: Faster and Better Image-captioning Transformer (ECCV 2022)☆185Updated last year
- [TMM 2023] VideoXum: Cross-modal Visual and Textural Summarization of Videos☆34Updated 7 months ago
- A simple script that reads a directory of videos, grabs a random frame, and automatically discovers a prompt for it☆131Updated 10 months ago
- Using pretrained encoder and language models to generate captions from multimedia inputs.☆95Updated last year
- Exploring multimodal fusion-type transformer models for visual question answering (on DAQUAR dataset)☆34Updated 2 years ago
- Implementation of the deepmind Flamingo vision-language model, based on Hugging Face language models and ready for training☆165Updated last year
- Generate text captions for images from their embeddings.☆101Updated last year
- An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.☆111Updated 2 years ago
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models☆245Updated 10 months ago
- Implementation of the CPTR model by https://arxiv.org/pdf/2101.10804.pdf☆10Updated 2 years ago
- This repository provides a comprehensive collection of research papers focused on multimodal representation learning, all of which have b…☆68Updated last year
- CaMEL: Mean Teacher Learning for Image Captioning. ICPR 2022☆29Updated last year
- Using LSTM or Transformer to solve Image Captioning in Pytorch☆75Updated 3 years ago
- (WACV 2025) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, B…☆81Updated 2 months ago
- [ACM TOMM 2023] - Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features☆163Updated last year
- LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1☆86Updated last month
- Finetuning CLIP on a small image/text dataset using huggingface libs☆41Updated last year
- [CVPR 2023] Official repository of paper titled "Fine-tuned CLIP models are efficient video learners".☆249Updated 7 months ago
- SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation☆95Updated 9 months ago
- [ICLR 2024] Official code for the paper "LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts"☆69Updated 6 months ago
- Summary about Video-to-Text datasets. This repository is part of the review paper *Bridging Vision and Language from the Video-to-Text Pe…☆116Updated last year
- CLIPScore EMNLP code☆194Updated last year
- ☆73Updated 2 years ago