appletea233/LLaVA-ST

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/appletea233/LLaVA-ST)

appletea233 / LLaVA-ST

[CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

☆84

Alternatives and similar repositories for LLaVA-ST

Users that are interested in LLaVA-ST are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

WHB139426 / Grounded-Video-LLM
View on GitHub
[EMNLP 2025 Findings] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
☆149Aug 21, 2025Updated 11 months ago
minjoong507 / Consistency-of-Video-LLM
View on GitHub
[CVPR 2025] Official Repository of the paper "On the Consistency of Video Large Language Models in Temporal Comprehension"
☆16Oct 13, 2025Updated 9 months ago
V-STaR-Bench / V-STaR
View on GitHub
Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
☆45Mar 2, 2026Updated 4 months ago
HengLan / TA-STVG
View on GitHub
[ICLR 2025] Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
☆44Mar 18, 2025Updated last year
ziqipang / MR-Video
View on GitHub
MR. Video: MapReduce is the Principle for Long Video Understanding
☆31Jun 18, 2026Updated last month
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
Lzq5 / UniTime
View on GitHub
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
☆56May 20, 2026Updated 2 months ago
appletea233 / Temporal-R1
View on GitHub
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency
☆62Jun 6, 2025Updated last year
PolyU-ChenLab / ETBench
View on GitHub
👾 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)
☆74Jan 20, 2025Updated last year
TimeMarker-LLM / TimeMarker
View on GitHub
A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
☆107Nov 28, 2024Updated last year
qirui-chen / RGA3-release
View on GitHub
[ICCV 2025] Object-centric Video Question Answering with Visual Grounding and Referring
☆24Aug 8, 2025Updated 11 months ago
HuiGuanLab / RaTSG
View on GitHub
This is a repository contains the implementation of our NeurIPS'24 paper "Temporal Sentence Grounding with Relevance Feedback in Videos"
☆13Aug 22, 2025Updated 11 months ago
lntzm / MESM
View on GitHub
The official code of Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval (AAAI2024)
☆32Mar 29, 2024Updated 2 years ago
Tanveer81 / ReVisionLLM
View on GitHub
This is the official implementation of ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
☆47Nov 5, 2025Updated 8 months ago
SalesforceAIResearch / strefer
View on GitHub
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
☆19Jun 2, 2026Updated last month
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
Hon-Wong / ByteVideoLLM
View on GitHub
[ICCV 2025] Dynamic-VLM
☆28Dec 16, 2024Updated last year
mbzuai-oryx / VideoGLaMM
View on GitHub
[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
☆104Apr 14, 2025Updated last year
zhang9302002 / ThinkingWithVideos
View on GitHub
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
☆102Oct 15, 2025Updated 9 months ago
yingsen1 / UniMD
View on GitHub
UniMD: Towards Unifying Moment retrieval and temporal action Detection
☆57Jul 5, 2024Updated 2 years ago
DCDmllm / Momentor
View on GitHub
☆81Nov 24, 2024Updated last year
OpenGVLab / VideoChat-R1
View on GitHub
[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
☆268Oct 18, 2025Updated 9 months ago
marinero4972 / Open-o3-Video
View on GitHub
[ICML 2026] Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"
☆157May 1, 2026Updated 2 months ago
hshjerry / VideoEspresso
View on GitHub
[CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
☆140Jul 28, 2025Updated 11 months ago
josephzpng / DisTime
View on GitHub
DisTime: Distribution-based Time Representation for Video Large Language Models.
☆21Jul 10, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
ttgeng233 / LongVALE
View on GitHub
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))
☆61Jun 9, 2025Updated last year
minghangz / TFVTG
View on GitHub
☆57Sep 13, 2024Updated last year
yellow-binary-tree / HawkEye
View on GitHub
Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videos
☆47Apr 29, 2024Updated 2 years ago
HengLan / CGSTVG
View on GitHub
[CVPR 2024] Context-Guided Spatio-Temporal Video Grounding
☆66Jun 28, 2024Updated 2 years ago
THUNLP-MT / MUSEG
View on GitHub
Repo for paper "MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding".
☆40Jun 9, 2025Updated last year
TencentARC / TimeLens
View on GitHub
[CVPR 2026] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
☆161Updated this week
jbistanbul / universalvtg
View on GitHub
Official Code for the paper "UniversalVTG: A Univeral and Lightweight Foundation Model for Video Temporal Grounding"
☆15Apr 15, 2026Updated 3 months ago
houzhijian / CONE
View on GitHub
[2023 ACL] CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
☆31Aug 5, 2023Updated 2 years ago
mlvlab / ST-VLM
View on GitHub
☆13Mar 28, 2025Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
Andy-Cheng / TEMPURA
View on GitHub
TEMPURA enables video-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of u…
☆27Jun 4, 2025Updated last year
mbzuai-oryx / VideoMolmo
View on GitHub
Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"
☆56Jul 5, 2025Updated last year
hwjiang1510 / VQLoC
View on GitHub
(NeurIPS 2023) Open-set visual object query search & localization in long-form videos
☆26Feb 1, 2024Updated 2 years ago
ttgeng233 / UniAV
View on GitHub
Unified Audio-Visual Perception for Multi-Task Video Localization
☆33Apr 19, 2024Updated 2 years ago
ludc506 / InternVL-X
View on GitHub
☆16Mar 26, 2025Updated last year
xiaoqian-shen / Vgent
View on GitHub
[NeurIPS 2025 Spotlight] Official PyTorch implementation of Vgent
☆48Nov 30, 2025Updated 7 months ago
gyxxyg / TRACE
View on GitHub
[ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modeling
☆156Aug 22, 2025Updated 11 months ago