zzzhhzzz / Ground-R1Links

☆15

Alternatives and similar repositories for Ground-R1

Users that are interested in Ground-R1 are comparing it to the libraries listed below

Sorting:

shilinyan99 / CrossLMM
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
☆23Updated last month
Haochen-Wang409 / ross
[ICLR'25] Reconstructive Visual Instruction Tuning
☆98Updated 3 months ago
baoxiaoyi / CoReS
code for the paper "CoReS: Orchestrating the Dance of Reasoning and Segmentation"
☆18Updated 4 months ago
showlab / VideoLISA
[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
☆123Updated 6 months ago
zhousheng97 / EgoTextVQA
[CVPR'25] 🌟🌟 EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
☆34Updated 3 weeks ago
Haochen-Wang409 / ross3d
Official implementation of "Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness".
☆44Updated 3 weeks ago
OuyangKun10 / SpaceR
SpaceR: The first MLLM empowered by SG-RLVR for video spatial reasoning
☆69Updated last week
yu-rp / VisualPerceptionToken
☆89Updated 3 months ago
songw-zju / PixelThink
The official implementation of "PixelThink: Towards Efficient Chain-of-Pixel Reasoning" (arXiv 2025)
☆35Updated last month
VCG-team / DiffSegmenter
Official implementation for "Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter"
☆42Updated last year
Becomebright / ReKV
Official PyTorch Code of ReKV (ICLR'25)
☆35Updated 4 months ago
ch3cook-fdu / Vote2Cap-DETR
[CVPR 2023] Vote2Cap-DETR and [T-PAMI 2024] Vote2Cap-DETR++; A set-to-set perspective towards 3D Dense Captioning; State-of-the-Art 3D De…
☆96Updated 11 months ago
Visual-AI / PruneVid
The official repository for ACL2025 paper "PruneVid: Visual Token Pruning for Efficient Video Large Language Models".
☆49Updated 2 months ago
clownrat6 / OpenVIS
[AAAI 2025] Open-vocabulary Video Instance Segmentation Codebase built upon Detectron2, which is really easy to use.
☆23Updated 6 months ago
appletea233 / AL-Ref-SAM2
[AAAI 2025] AL-Ref-SAM 2: Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video…
☆84Updated 6 months ago
Hui-design / Open-LLaVA-Video-R1
[LLaVA-Video-R1]✨First Adaptation of R1 to LLaVA-Video (2025-03-18)
☆29Updated 2 months ago
congvvc / InstructSeg
[ICCV 2025] Official implementation of "InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models"
☆43Updated 5 months ago
ProvenceStar / PartGLEE
[ECCV2024] PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
☆50Updated 10 months ago
TungChintao / FlowCut
Official repository for “FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models”
☆19Updated 2 weeks ago
JoeLeelyf / OVO-Bench
[CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
☆75Updated 3 months ago
shxie2020 / Awesome-UGVFM
A collection of vision foundation models unifying understanding and generation.
☆57Updated 6 months ago
franciszzj / OpenPSG
[ECCV 2024] OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
☆47Updated 6 months ago
sosppxo / MDIN
[MM2024 Oral] 3D-GRES: Generalized 3D Referring Expression Segmentation
☆37Updated 7 months ago
ncTimTang / AKS
[CVPR 2025] Adaptive Keyframe Sampling for Long Video Understanding
☆80Updated 2 months ago
hmxiong / StreamChat
Official repo for "Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge" ICLR2025
☆59Updated 4 months ago
z-x-yang / DoraemonGPT
Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models
☆85Updated 10 months ago
Wang-Xiaodong1899 / CVPR25-MLLM-Paper-List
🔥CVPR 2025 Multimodal Large Language Models Paper List
☆147Updated 4 months ago
Cooperx521 / ScaleCap
Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’
☆50Updated 3 weeks ago
PhoenixZ810 / RISEBench
Official Repository of paper: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
☆69Updated last week
Leon1207 / 3DRefTR
This is a PyTorch implementation of 3DRefTR proposed by our paper "A Unified Framework for 3D Point Cloud Visual Grounding"
☆24Updated last year