ssppp / Click4Caption
A visual LLM for image region description or QA.
☆15Updated last year
Alternatives and similar repositories for Click4Caption
Users that are interested in Click4Caption are comparing it to the libraries listed below
Sorting:
- ☆61Updated last year
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆35Updated 11 months ago
- ☆39Updated last year
- [ICCV2023] EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding☆76Updated last year
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆73Updated 2 months ago
- Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization☆20Updated last month
- FleVRS: Towards Flexible Visual Relationship Segmentation, NeurIPS 2024☆20Updated 5 months ago
- ☆58Updated last year
- A curated list of papers and resources for text-to-image evaluation.☆29Updated last year
- [ICLR 2025] IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model☆30Updated 5 months ago
- FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax☆18Updated last year
- ReNeg: Learning Negative Embedding with Reward Guidance☆31Updated 4 months ago
- ☆19Updated last year
- ☆12Updated 9 months ago
- [ECCV 2024] This is the official implementation of "Stitched ViTs are Flexible Vision Backbones".☆27Updated last year
- Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"☆27Updated 7 months ago
- T2VScore: Towards A Better Metric for Text-to-Video Generation☆80Updated last year
- IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks☆58Updated 7 months ago
- VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation☆21Updated 2 months ago
- Unifying Specialized Visual Encoders for Video Language Models☆18Updated last week
- ☆28Updated 4 months ago
- [ICLR 2022] RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning☆63Updated 2 years ago
- SOIT: Segmenting Objects with Instance-Aware Transformers☆14Updated 2 years ago
- ROOT: VLM based System for Indoor Scene Understanding and Beyond☆27Updated 3 months ago
- ☆19Updated 2 years ago
- ☆59Updated last year
- [NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.☆47Updated 7 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated 9 months ago
- [WACV2025 Oral] DeepMIM: Deep Supervision for Masked Image Modeling☆53Updated last week
- ☆23Updated 7 months ago