geshang777 / pix2capView external linksLinks
Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"
☆25Dec 16, 2025Updated last month
Alternatives and similar repositories for pix2cap
Users that are interested in pix2cap are comparing it to the libraries listed below
Sorting:
- ☆30Jan 18, 2026Updated 3 weeks ago
- FNIN: A Fourier Neural Operator-based Numerical Integration Network for Surface-form-gradients☆13Jan 22, 2025Updated last year
- SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability☆16May 8, 2025Updated 9 months ago
- [NeurIPS 2025] The official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tun…☆40Feb 20, 2025Updated 11 months ago
- ☆16Apr 4, 2025Updated 10 months ago
- [NeurIPS-W 2025] Official Implementation of "Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning"☆59Jul 1, 2025Updated 7 months ago
- code for downloading videos from HowTo100M dataset☆16May 13, 2021Updated 4 years ago
- ☆21Jan 17, 2025Updated last year
- Improving Your Model Ranking on Chatbot Arena by Vote Rigging (ICML 2025)☆26Feb 25, 2025Updated 11 months ago
- (CVPR 2025) Official implementation to DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation which outperforms SOTA…☆26Aug 23, 2025Updated 5 months ago
- This repository contains the resource introduced in the paper: "Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis"…☆25Oct 15, 2025Updated 3 months ago
- ☆24Dec 26, 2024Updated last year
- OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models☆54Feb 1, 2026Updated last week
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning☆41Aug 4, 2025Updated 6 months ago
- This is an official implementation for "SAM-Swin: SAM-Driven Dual-Swin Transformers with Adaptive Lesion Enhancement for Laryngo-Pharynge…☆31Oct 4, 2025Updated 4 months ago
- [ECCV'24 Workshops Oral] DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling☆31Feb 6, 2026Updated last week
- [IJCV 2025] MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning☆76May 30, 2025Updated 8 months ago
- The official implementation of our work Hawkeye: Discovering and Grounding Implicit Anomalous Sentiment in Recon-videos via Scene-enhanc…☆12Oct 14, 2024Updated last year
- WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs☆38Jan 26, 2026Updated 2 weeks ago
- [NeurIPS 2024] TALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight☆34Dec 3, 2025Updated 2 months ago
- Codes for ICML 2023 Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation☆37Sep 12, 2023Updated 2 years ago
- XmodelLM☆38Nov 19, 2024Updated last year
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆39Jun 14, 2025Updated 8 months ago
- This is the pytorch implement of our paper "CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware…☆37Nov 20, 2024Updated last year
- DisTime: Distribution-based Time Representation for Video Large Language Models.☆18Jul 10, 2025Updated 7 months ago
- (NeurIPS 2024) Official repository of paper "Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models"☆35Mar 22, 2025Updated 10 months ago
- Finetuning & extending DiffusionDet to video & pedestrian multi-object-tracking☆13Apr 12, 2023Updated 2 years ago
- PyTorch Implementation of "ASTRA: An Action Spotting TRAnsformer for Soccer Videos", ACM MMSports 2023. | 3rd place solution for SoccerNe…☆41May 20, 2024Updated last year
- ☆10Apr 7, 2025Updated 10 months ago
- ☆11Jan 18, 2025Updated last year
- [CVPR 2024] LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation☆13Jun 17, 2024Updated last year
- ☆10Apr 24, 2024Updated last year
- AutoTrackAnything is a universal, flexible and interactive tool for insane automatic object tracking over thousands of frames. It is deve…☆92Apr 8, 2024Updated last year
- Code for paper: Reinforced Vision Perception with Tools☆69Oct 3, 2025Updated 4 months ago
- [ICCV'25 Oral] The official implementation of Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion☆63Jul 24, 2025Updated 6 months ago
- [WACV 2025] Official implementation of "Online-LoRA: Task-free Online Continual Learning via Low Rank Adaptation" by Xiwen Wei, Guihong L…☆55Aug 26, 2025Updated 5 months ago
- ☆13Jan 21, 2025Updated last year
- [ICLR 2026] Official repo for "FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting"☆37Oct 9, 2025Updated 4 months ago
- Official pytorch implementation of "Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use"☆20Sep 16, 2025Updated 4 months ago