geshang777 / pix2cap
[arXiv'25] Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"
☆16Updated 3 months ago
Alternatives and similar repositories for pix2cap
Users that are interested in pix2cap are comparing it to the libraries listed below
Sorting:
- FreeVA: Offline MLLM as Training-Free Video Assistant☆61Updated 11 months ago
- Code for "CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning"☆15Updated last month
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆50Updated last year
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆20Updated 6 months ago
- ☆80Updated last month
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆23Updated 2 weeks ago
- Code for CVPR25 paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆111Updated 2 months ago
- VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning☆121Updated last week
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆60Updated 4 months ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆118Updated 4 months ago
- ☆32Updated 3 months ago
- ☆31Updated 2 months ago
- Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal Prompting☆37Updated 3 weeks ago
- This repo contains the code for our TMLR paper: A Simple Video Segmenter by Tracking Objects Along Axial Trajectories☆27Updated last month
- Official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"☆30Updated 2 months ago
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆106Updated last month
- Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model☆92Updated last month
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆45Updated 3 months ago
- [CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos☆64Updated last month
- Official PyTorch code of GroundVQA (CVPR'24)☆61Updated 8 months ago
- PyTorch code for "Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training"☆34Updated last year
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆50Updated 4 months ago
- [AAAI 2025] Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos☆24Updated last month
- 🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)☆40Updated 3 months ago
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆62Updated 10 months ago
- ☆15Updated last month
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆39Updated 2 months ago
- VisRL: Intention-Driven Visual Perception via Reinforced Reasoning☆28Updated 2 months ago
- Language Repository for Long Video Understanding☆31Updated 11 months ago
- [ICLR 2025] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning☆33Updated last month