geshang777 / pix2capLinks
[arXiv'25] Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"
☆18Updated 6 months ago
Alternatives and similar repositories for pix2cap
Users that are interested in pix2cap are comparing it to the libraries listed below
Sorting:
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆28Updated last month
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆55Updated 3 weeks ago
- Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model☆105Updated last week
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆126Updated 4 months ago
- ☆27Updated 5 months ago
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models☆119Updated 4 months ago
- Official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"☆36Updated 5 months ago
- 🔥 Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resamplin…☆40Updated 3 weeks ago
- Make Your Training Flexible: Towards Deployment-Efficient Video Models☆30Updated 2 months ago
- ☆87Updated last month
- [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM☆80Updated 9 months ago
- [CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆57Updated last month
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆51Updated last year
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆131Updated 7 months ago
- [ICLR 2025] Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding☆29Updated 4 months ago
- ☆35Updated last month
- Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"☆30Updated 4 months ago
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning☆27Updated last week
- Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"☆30Updated 10 months ago
- Official implementation of paper VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interact…☆34Updated 6 months ago
- (ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"☆37Updated last month
- Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal Prompting☆48Updated last month
- [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆180Updated last month
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆78Updated 2 weeks ago
- [ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆85Updated 2 months ago
- VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning☆169Updated 2 months ago
- [CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos☆76Updated 3 months ago
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆40Updated 5 months ago
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆63Updated last month
- TStar is a unified temporal search framework for long-form video question answering☆60Updated 4 months ago