geshang777 / pix2cap
[arXiv'25] Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"
☆16Updated 2 months ago
Alternatives and similar repositories for pix2cap:
Users that are interested in pix2cap are comparing it to the libraries listed below
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆45Updated 2 months ago
- [CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆35Updated last month
- Harnessing CLIP, DINO and SAM for Open Vocabulary Segmentation☆44Updated 3 weeks ago
- Project for "LaSagnA: Language-based Segmentation Assistant for Complex Queries".☆53Updated 11 months ago
- 💡 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning☆37Updated this week
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆40Updated last week
- 🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)☆31Updated last month
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆49Updated 2 months ago
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆49Updated last year
- An open source implementation of CLIP (With TULIP Support)☆113Updated last week
- Official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"☆29Updated last month
- This repo contains the code for our TMLR paper: A Simple Video Segmenter by Tracking Objects Along Axial Trajectories☆27Updated last week
- [NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective☆66Updated 4 months ago
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆19Updated 5 months ago
- ☆29Updated 2 weeks ago
- LLMBind: A Unified Modality-Task Integration Framework☆18Updated 9 months ago
- Official Implementation of "Open-Vocabulary Audio-Visual Semantic Segmentation" [ACM MM 2024 Oral].☆23Updated 4 months ago
- FreeVA: Offline MLLM as Training-Free Video Assistant☆57Updated 9 months ago
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models☆96Updated last week
- ☆27Updated 2 months ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆110Updated 3 months ago
- The official repository for paper "PruneVid: Visual Token Pruning for Efficient Video Large Language Models".☆35Updated last month
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning☆16Updated last week
- ☆16Updated this week
- Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision☆36Updated last week
- Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆102Updated last month
- VisRL: Intention-Driven Visual Perception via Reinforced Reasoning☆20Updated last week
- ☆13Updated 6 months ago
- [AAAI-2025] The official code of Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation☆28Updated 2 weeks ago
- Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models☆75Updated 6 months ago