geshang777 / pix2capLinks

Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"

☆21

Alternatives and similar repositories for pix2cap

Users that are interested in pix2cap are comparing it to the libraries listed below

Sorting:

Mark12Ding / Dispider
[CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
☆141Updated 7 months ago
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆60Updated 3 months ago
Amshaker / Mobile-VideoGPT
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
☆122Updated 2 months ago
HaroldChen19 / VistaDPO
[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
☆34Updated 4 months ago
WHB139426 / Grounded-Video-LLM
[EMNLP 2025 Findings] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
☆131Updated 2 months ago
OpenGVLab / VideoChat-R1
[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
☆218Updated 2 weeks ago
TencentARC / SEED-Bench-R1
☆91Updated 4 months ago
showlab / VideoLISA
[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
☆139Updated 10 months ago
mbzuai-oryx / VideoGLaMM
[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
☆89Updated 6 months ago
Hon-Wong / Elysium
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
☆86Updated last year
Becomebright / ReKV
Official PyTorch Code of ReKV (ICLR'25)
☆62Updated 7 months ago
MacavityT / REF-VLM
☆30Updated 7 months ago
cilinyan / VISA
[ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model
☆193Updated last year
hanghuacs / FineCaption
☆37Updated 4 months ago
ziplab / LongVLM
☆104Updated last year
appletea233 / LLaVA-ST
[CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
☆75Updated 3 months ago
Tencent / HaploVLM
ICML2025
☆59Updated 2 months ago
Visual-AI / PruneVid
[ACL 2025] PruneVid: Visual Token Pruning for Efficient Video Large Language Models
☆55Updated 5 months ago
Go2Heart / StreamFormer
[ICCV 2025 Oral] Official implementation of Learning Streaming Video Representation via Multitask Training.
☆62Updated last month
OpenGVLab / FluxViT
Make Your Training Flexible: Towards Deployment-Efficient Video Models
☆30Updated 4 months ago
CeeZh / SILVR
Official Implementation for "SiLVR : A Simple Language-based Video Reasoning Framework"
☆19Updated last month
hshjerry / VideoEspresso
[CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
☆125Updated 3 months ago
eric-ai-lab / GRIT
Official code for NeurIPS 2025 paper "GRIT: Teaching MLLMs to Think with Images"
☆159Updated 2 weeks ago
Yxxxb / VoCo-LLaMA
[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
☆192Updated 4 months ago
qirui-chen / MultiHop-EgoQA
[AAAI 2025] Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
☆27Updated 5 months ago
V-STaR-Bench / V-STaR
Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
☆29Updated 2 months ago
tulip-berkeley / open_clip
An open source implementation of CLIP (With TULIP Support)
☆163Updated 5 months ago
ruili33 / TPO
☆38Updated last month
TencentARC / Video-Holmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆76Updated 3 months ago
SHI-Labs / Slow-Fast-Video-Multimodal-LLM
☆24Updated 6 months ago