ssppp / Click4CaptionLinks
A visual LLM for image region description or QA.
☆16Updated 2 years ago
Alternatives and similar repositories for Click4Caption
Users that are interested in Click4Caption are comparing it to the libraries listed below
Sorting:
- Test-Time Training on Video Streams☆66Updated 2 years ago
- [ICCV2023] EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding☆78Updated 2 years ago
- Official PyTorch implementation of the paper "CoVR: Learning Composed Video Retrieval from Web Video Captions".☆118Updated 3 months ago
- Code base of SynthCLIP: CLIP training with purely synthetic text-image pairs from LLMs and TTIs.☆102Updated 10 months ago
- ☆13Updated last year
- ☆58Updated 2 years ago
- ☆43Updated 8 months ago
- Implementation of paper 'Helping Hands: An Object-Aware Ego-Centric Video Recognition Model'☆33Updated 2 years ago
- CVPR2025: Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning☆38Updated 10 months ago
- CycleReward is a reward model trained on cycle consistency preferences to measure image-text alignment.☆53Updated 3 months ago
- Benchmarking and Analyzing Generative Data for Visual Recognition☆26Updated 2 years ago
- Official Pytorch codebase for Open-Vocabulary Instance Segmentation without Manual Mask Annotations [CVPR 2023]☆52Updated 3 months ago
- ☆37Updated last year
- [IJCV 2025] VLPrompt-PSG: Vision-Language Prompting for Panoptic Scene Graph Generation☆28Updated last year
- Code and data for the paper "Emergent Visual-Semantic Hierarchies in Image-Text Representations" (ECCV 2024)☆33Updated last year
- ICLR‘24 Offical Implementation of Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization☆74Updated 2 years ago
- [NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning☆72Updated 11 months ago
- Official code for the paper, "TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter".☆16Updated 2 years ago
- [ICLR 2025] IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model☆37Updated last year
- ☆53Updated 2 years ago
- Unifying Specialized Visual Encoders for Video Language Models☆25Updated 2 months ago
- [CVPR 2024 Highlight] ImageNet-D☆46Updated last year
- FunQA benchmarks funny, creative, and magic videos for challenging tasks including timestamp localization, video description, reasoning, …☆104Updated last month
- ☆19Updated 2 years ago
- [ECCV2024] PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects☆57Updated last year
- This repo contains the official implementation of ICLR 2024 paper "Is ImageNet worth 1 video? Learning strong image encoders from 1 long …☆95Updated last year
- The 1st place solution of 2022 Ego4d Natural Language Queries.☆32Updated 3 years ago
- [ICLR 2022] RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning☆63Updated 3 years ago
- [NeurIPS 2023 Datasets and Benchmarks] "FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation", Yuanxin L…☆57Updated last year
- [ECCV2022] New benchmark for evaluating pre-trained model; New supervised contrastive learning framework.☆110Updated 2 years ago