inuwamobarak / Image-captioning-ViT

Image Captioning Vision Transformers (ViTs) are transformer models that generate descriptive captions for images by combining the power of Transformers and computer vision. It leverages state-of-the-art pre-trained ViT models and employs technique
27Updated last month

Related projects

Alternatives and complementary repositories for Image-captioning-ViT