geshang777 / pix2capLinks
Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"
☆21Updated last week
Alternatives and similar repositories for pix2cap
Users that are interested in pix2cap are comparing it to the libraries listed below
Sorting:
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆141Updated 7 months ago
 - Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆60Updated 3 months ago
 - Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model☆122Updated 2 months ago
 - [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆34Updated 4 months ago
 - [EMNLP 2025 Findings] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models☆131Updated 2 months ago
 - [NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning☆218Updated 2 weeks ago
 - ☆91Updated 4 months ago
 - [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆139Updated 10 months ago
 - [CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos☆89Updated 6 months ago
 - [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM☆86Updated last year
 - Official PyTorch Code of ReKV (ICLR'25)☆62Updated 7 months ago
 - ☆30Updated 7 months ago
 - [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model☆193Updated last year
 - ☆37Updated 4 months ago
 - ☆104Updated last year
 - [CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆75Updated 3 months ago
 - ICML2025☆59Updated 2 months ago
 - [ACL 2025] PruneVid: Visual Token Pruning for Efficient Video Large Language Models☆55Updated 5 months ago
 - [ICCV 2025 Oral] Official implementation of Learning Streaming Video Representation via Multitask Training.☆62Updated last month
 - Make Your Training Flexible: Towards Deployment-Efficient Video Models☆30Updated 4 months ago
 - Official Implementation for "SiLVR : A Simple Language-based Video Reasoning Framework"☆19Updated last month
 - [CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆125Updated 3 months ago
 - Official code for NeurIPS 2025 paper "GRIT: Teaching MLLMs to Think with Images"☆159Updated 2 weeks ago
 - [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆192Updated 4 months ago
 - [AAAI 2025] Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos☆27Updated 5 months ago
 - Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning☆29Updated 2 months ago
 - An open source implementation of CLIP (With TULIP Support)☆163Updated 5 months ago
 - ☆38Updated last month
 - Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆76Updated 3 months ago
 - ☆24Updated 6 months ago