Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"
☆25Dec 16, 2025Updated 2 months ago
Alternatives and similar repositories for pix2cap
Users that are interested in pix2cap are comparing it to the libraries listed below
Sorting:
- ☆30Jan 18, 2026Updated last month
- FNIN: A Fourier Neural Operator-based Numerical Integration Network for Surface-form-gradients☆13Jan 22, 2025Updated last year
- [AAAI 2025] Official Implementation of "FOCUS: Towards Universal Foreground Segmentation"☆56Jul 8, 2025Updated 7 months ago
- Problem-Oriented Segmentation and Retrieval EMNLP 2024 Findings☆34Nov 12, 2024Updated last year
- SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability☆16May 8, 2025Updated 9 months ago
- [NeurIPS 2025] The official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tun…☆40Feb 20, 2025Updated last year
- ☆16Apr 4, 2025Updated 11 months ago
- [NeurIPS-W 2025] Official Implementation of "Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning"☆60Jul 1, 2025Updated 8 months ago
- code for downloading videos from HowTo100M dataset☆17May 13, 2021Updated 4 years ago
- ☆21Jan 17, 2025Updated last year
- Improving Your Model Ranking on Chatbot Arena by Vote Rigging (ICML 2025)☆26Feb 25, 2025Updated last year
- The official implementation of Cross-Task Experience Sharing (COPS)☆29Oct 23, 2024Updated last year
- ☆24Dec 26, 2024Updated last year
- This repository contains the resource introduced in the paper: "Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis"…☆25Oct 15, 2025Updated 4 months ago
- This repo contains the code for our TMLR paper: A Simple Video Segmenter by Tracking Objects Along Axial Trajectories☆27Mar 20, 2025Updated 11 months ago
- Universal Video Temporal Grounding with Generative Multi-modal Large Language Models☆46Nov 25, 2025Updated 3 months ago
- OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models☆56Feb 1, 2026Updated last month
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning☆41Aug 4, 2025Updated 7 months ago
- This is an official implementation for "SAM-Swin: SAM-Driven Dual-Swin Transformers with Adaptive Lesion Enhancement for Laryngo-Pharynge…☆31Oct 4, 2025Updated 5 months ago
- [ECCV'24 Workshops Oral] DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling☆31Feb 6, 2026Updated last month
- [IJCV 2025] MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning☆77May 30, 2025Updated 9 months ago
- The official implementation of our work Hawkeye: Discovering and Grounding Implicit Anomalous Sentiment in Recon-videos via Scene-enhanc…☆12Oct 14, 2024Updated last year
- [NeurIPS 2024] TALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight☆35Feb 27, 2026Updated last week
- [ECCV 2024] Beyond MOT: Semantic Multi-Object Tracking☆29Sep 12, 2024Updated last year
- Codes for ICML 2023 Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation☆37Sep 12, 2023Updated 2 years ago
- XmodelLM☆38Nov 19, 2024Updated last year
- WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs☆41Jan 26, 2026Updated last month
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆39Jun 14, 2025Updated 8 months ago
- (NeurIPS 2024) Official repository of paper "Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models"☆35Mar 22, 2025Updated 11 months ago
- Finetuning & extending DiffusionDet to video & pedestrian multi-object-tracking☆13Apr 12, 2023Updated 2 years ago
- Breaking the SSL-AL Barrier: A Synergistic Semi-Supervised Active Learning Framework for 3D Object Detection☆13Mar 23, 2025Updated 11 months ago
- The repository of VG-Refiner paper☆17Dec 9, 2025Updated 2 months ago
- This is the pytorch implement of our paper "CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware…☆37Nov 20, 2024Updated last year
- Official repository of the IEEE SLT 2024 paper "Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT"☆45Updated this week
- PyTorch Implementation of "ASTRA: An Action Spotting TRAnsformer for Soccer Videos", ACM MMSports 2023. | 3rd place solution for SoccerNe…☆42May 20, 2024Updated last year
- ☆10Apr 24, 2024Updated last year
- ☆11Jan 18, 2025Updated last year
- DisTime: Distribution-based Time Representation for Video Large Language Models.☆19Jul 10, 2025Updated 7 months ago
- ☆10Apr 7, 2025Updated 11 months ago