awsaf49 / flickr-datasetLinks
Download flickr8k, flickr30k image caption datasets
☆29Updated last year
Alternatives and similar repositories for flickr-dataset
Users that are interested in flickr-dataset are comparing it to the libraries listed below
Sorting:
- An efficient multi-modal instruction-following data synthesis tool and the official implementation of Oasis https://arxiv.org/abs/2503.08…☆31Updated 3 months ago
- Code for AAAI 2023 Paper : “Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models”☆18Updated 2 years ago
- 1st Place Solution in Google Universal Image Embedding☆67Updated 2 years ago
- Code for experiments for "ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy"☆101Updated last year
- ☆35Updated last year
- [CVPR2025] VDocRAG: Retirval-Augmented Generation over Visually-Rich Documents☆40Updated 4 months ago
- Official Pytorch Implementation of Self-emerging Token Labeling☆35Updated last year
- EfficientViT is a new family of vision models for efficient high-resolution vision.☆27Updated 2 years ago
- "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs" 2023☆15Updated 9 months ago
- [ICCV2023] TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance☆106Updated last year
- Reproduction of LLaVA-v1.5 based on Llama-3-8b LLM backbone.☆65Updated 11 months ago
- A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.☆96Updated 9 months ago
- OCR-VQGAN, a discrete image encoder (tokenizer and detokenizer) for figure images in Paper2Fig100k dataset. Implementation of OCR Percept…☆81Updated 2 years ago
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆27Updated last year
- Fine-tuning Qwen2.5-VL for vision-language tasks | Optimized for Vision understanding | LoRA & PEFT support.☆125Updated 7 months ago
- Official Training and Inference Code of Amodal Expander, Proposed in Tracking Any Object Amodally☆18Updated last year
- Multi-label classification based on timm, and add SimCLR to timm.☆38Updated 4 years ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆66Updated last year
- TensorFlow implementation of "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"☆35Updated 3 years ago
- Implementation for the CVPR 2023 paper "Improving Selective Visual Question Answering by Learning from Your Peers" (https://arxiv.org/abs…☆25Updated 2 years ago
- Timm model explorer☆41Updated last year
- ViT trained on COYO-Labeled-300M dataset☆32Updated 2 years ago
- Finetuning CLIP on a small image/text dataset using huggingface libs☆52Updated 2 years ago
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆41Updated last year
- Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types☆30Updated 2 months ago
- Deploy Swin Transformer using TorchServe☆27Updated 4 years ago
- 🔥MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer [Official, ICLR 2023]☆21Updated last year
- 4th place solution for the Google Universal Image Embedding Kaggle Challenge. Instance-Level Recognition workshop at ECCV 2022☆42Updated 2 years ago
- An open-source implementaion for fine-tuning SmolVLM.☆48Updated 2 weeks ago
- A handwritten Chemical Structure Image data set named EDU-CHEMC, which consists of totally 52,987 handwritten molecular structure images …☆12Updated 4 months ago