awsaf49 / flickr-datasetLinks
Download flickr8k, flickr30k image caption datasets
☆23Updated last year
Alternatives and similar repositories for flickr-dataset
Users that are interested in flickr-dataset are comparing it to the libraries listed below
Sorting:
- EfficientViT is a new family of vision models for efficient high-resolution vision.☆26Updated last year
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆40Updated 8 months ago
- [NeurIPS2022] This is the official implementation of the paper "Expediting Large-Scale Vision Transformer for Dense Prediction without Fi…☆84Updated last year
- Estimate dataset difficulty and detect label mistakes using reconstruction error ratios!☆25Updated 4 months ago
- An open-source implementaion for fine-tuning SmolVLM.☆34Updated last month
- "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs" 2023☆14Updated 6 months ago
- Implementation of ViTaR: ViTAR: Vision Transformer with Any Resolution in PyTorch☆35Updated 6 months ago
- Code for experiments for "ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy"☆101Updated 8 months ago
- Clipora is a powerful toolkit for fine-tuning OpenCLIP models using Low Rank Adapters (LoRA).☆22Updated 9 months ago
- Fine-tuning Qwen2.5-VL for vision-language tasks | Optimized for Vision understanding | LoRA & PEFT support.☆78Updated 4 months ago
- [NIPS2023]Implementation of Foundation Model is Efficient Multimodal Multitask Model Selector☆36Updated last year
- Official Pytorch Implementation of Self-emerging Token Labeling☆33Updated last year
- [CVPR 2025] DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception☆55Updated 2 weeks ago
- 4th place solution for the Google Universal Image Embedding Kaggle Challenge. Instance-Level Recognition workshop at ECCV 2022☆42Updated last year
- ☆34Updated last year
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆156Updated 8 months ago
- [ICCV2023] TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance☆96Updated 10 months ago
- Our public repo ranked 1st 🏆🏆 at MMSports2023 challenge on segmentation task☆17Updated last year
- Deploy Swin Transformer using TorchServe☆27Updated 3 years ago
- We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their…☆13Updated 5 months ago
- ViT trained on COYO-Labeled-300M dataset☆32Updated 2 years ago
- The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …☆53Updated 7 months ago
- Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types☆18Updated last month
- Encourage Medical LLM to engage in deep thinking similar to DeepSeek-R1.☆25Updated last month
- [ACL 2023] PuMer: Pruning and Merging Tokens for Efficient Vision Language Models☆29Updated 8 months ago
- Official Training and Inference Code of Amodal Expander, Proposed in Tracking Any Object Amodally☆18Updated 10 months ago
- ☆13Updated 4 months ago
- The official code of "Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs"☆73Updated 2 weeks ago
- Masked Vision-Language Transformer in Fashion☆33Updated last year
- ☆40Updated last year