hanghuacs / FineCaptionLinks
☆37Updated 5 months ago
Alternatives and similar repositories for FineCaption
Users that are interested in FineCaption are comparing it to the libraries listed below
Sorting:
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆143Updated 11 months ago
- [ICCV 2025] Official implementation of "InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models"☆51Updated 10 months ago
- ICML2025☆61Updated 3 months ago
- The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"☆68Updated last month
- [AAAI 26 Demo] Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal P…☆60Updated last month
- Transactions on Multimedia (TMM25)☆18Updated 8 months ago
- [CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos☆91Updated 7 months ago
- Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’☆58Updated 5 months ago
- [NeurIPS25] Official Implementation (Pytorch) of "DeepVideo-R1"☆30Updated 3 weeks ago
- Code for the paper "Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation", ECCV 2024☆45Updated last year
- [CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories☆83Updated 4 months ago
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆80Updated 4 months ago
- TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation☆234Updated 3 months ago
- [ICCV2025]Code Release of Harmonizing Visual Representations for Unified Multimodal Understanding and Generation☆178Updated 6 months ago
- Code and dataset link for "DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World"☆116Updated 2 months ago
- [NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding☆46Updated 10 months ago
- Official respository for ReasonGen-R1☆73Updated 5 months ago
- ☆32Updated last year
- (ICLR 2025 Spotlight) Official code repository for Interleaved Scene Graph.☆31Updated 4 months ago
- [ICLR'25] Reconstructive Visual Instruction Tuning☆129Updated 8 months ago
- [ECCV 2024] ControlCap: Controllable Region-level Captioning☆80Updated last year
- [NIPS 2025 DB Oral] Official Repository of paper: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing☆120Updated 2 weeks ago
- ☆40Updated 5 months ago
- [CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆129Updated 4 months ago
- Official code for NeurIPS 2025 paper "GRIT: Teaching MLLMs to Think with Images"☆164Updated last week
- Video Reasoning Segmentation☆28Updated last year
- Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward☆49Updated last week
- [IEEE TIP 2025] Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation☆57Updated last week
- ☆21Updated 10 months ago
- GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning☆100Updated 6 months ago