geshang777 / pix2capLinks
[arXiv'25] Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"
☆17Updated 4 months ago
Alternatives and similar repositories for pix2cap
Users that are interested in pix2cap are comparing it to the libraries listed below
Sorting:
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆27Updated last month
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆51Updated 5 months ago
- Official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"☆29Updated 3 months ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆122Updated 5 months ago
- VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning☆140Updated 3 weeks ago
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆49Updated this week
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆66Updated 4 months ago
- ☆25Updated 2 months ago
- FreeVA: Offline MLLM as Training-Free Video Assistant☆60Updated 11 months ago
- ☆81Updated 2 months ago
- [NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective☆69Updated 7 months ago
- ☆17Updated last month
- Official Implementation of "Open-Vocabulary Audio-Visual Semantic Segmentation" [ACM MM 2024 Oral].☆29Updated 7 months ago
- [CVPR 2024] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"☆43Updated 3 months ago
- ☆32Updated 2 months ago
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆110Updated 2 months ago
- ☆43Updated 8 months ago
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆20Updated 7 months ago
- Transactions on Multimedia (TMM25)☆14Updated 2 months ago
- [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model☆16Updated 10 months ago
- Official repo for "Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge" ICLR2025☆53Updated 2 months ago
- ☆30Updated 4 months ago
- SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding☆41Updated this week
- Project for "LaSagnA: Language-based Segmentation Assistant for Complex Queries".☆56Updated last year
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆37Updated 5 months ago
- Official implementation of "TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models"☆23Updated this week
- This repo contains the code for our TMLR paper: A Simple Video Segmenter by Tracking Objects Along Axial Trajectories☆27Updated 2 months ago
- [ICLR 2025] Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding☆17Updated 2 months ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆45Updated 4 months ago
- Official PyTorch Code of ReKV (ICLR'25)☆26Updated 2 months ago