quangminhdinh / TrafficVLMLinks

[CVPRW 2024] TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning. Official code for the 3rd place solution of the AI City Challenge 2024 Track 2.

☆48

Alternatives and similar repositories for TrafficVLM

Users that are interested in TrafficVLM are comparing it to the libraries listed below

Sorting:

woven-visionai / wts-dataset
☆46Updated 5 months ago
wxh1996 / VideoAgent
☆124Updated 7 months ago
alibaba / AICITY2024_Track2_AliOpenTrek_CityLLaVA
☆54Updated last year
XLiu443 / Tem-adapter
[ICCV2023] Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
☆37Updated 2 years ago
jeffreychou777 / LOTVS-MM-AU
[CVPR2024 Highlight] The official repo for paper "Abductive Ego-View Accident Video Understanding for Safe Driving Perception"
☆62Updated 7 months ago
ZhangXJ199 / TinyLLaVA-Video-R1
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
☆107Updated 6 months ago
Hon-Wong / Elysium
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
☆86Updated last year
Xuchen-Li / cv-arxiv-daily
Automatically update arXiv papers about SOT & VLT, Multi-modal Learning, LLM and Video Understanding using Github Actions.
☆39Updated this week
appletea233 / AL-Ref-SAM2
[AAAI 2025] AL-Ref-SAM 2: Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video…
☆90Updated 11 months ago
ziplab / LongVLM
☆107Updated last year
zhangce01 / HiKER-SGG
[CVPR 2024] Code for HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
☆75Updated last year
appletea233 / LLaVA-ST
[CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
☆77Updated 4 months ago
cyc-gh / TADS
☆10Updated last year
Xuange923 / Surveillance-Video-Understanding
Official project page of the paper "Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges" (Accep…
☆64Updated last year
om-ai-lab / GroundVLP
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)
☆72Updated last year
Meituan-AutoML / Lenna
☆86Updated last year
TempleX98 / MoVA
[NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context
☆168Updated last year
lorebianchi98 / FG-OVD
[CVPR 2024 Highlight] Official repository of the paper "The devil is in the fine-grained details: Evaluating open-vocabulary object detec…
☆61Updated 7 months ago
hotfinda / VideoMambaPro
Improving Mamaba performance on Video Understanding task
☆39Updated last year
whwu95 / FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant
☆65Updated last year
kahnchana / mvu
🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)
☆49Updated 9 months ago
mlvlab / SpeaQ
Official PyTorch implementation of "Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relati…
☆39Updated last year
LilyDaytoy / OpenPVSG
Benchmarking Panoptic Video Scene Graph Generation (PVSG), CVPR'23
☆99Updated last year
ZhengYu518 / VL-Mamba
Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"
☆84Updated last year
yunncheng / MMRL
[CVPR 2025] Official PyTorch Code for "MMRL: Multi-Modal Representation Learning for Vision-Language Models" and its extension "MMRL++: P…
☆83Updated 4 months ago
tychen-SJTU / MECD-Benchmark
[NeurIPS'24 spotlight] MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning. [TPAMI'25] MECD+
☆41Updated 3 weeks ago
ZhangXJ199 / TinyLLaVA-Video
A Simple Framework of Small-scale LMMs for Video Understanding
☆96Updated 5 months ago
OpenGVLab / MUTR
「AAAI 2024」 Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation
☆82Updated 5 months ago
eshoyuan / TrackGPT
TrackGPT: Track What You Need in Videos via Text Prompts
☆25Updated 2 years ago
mlvlab / RALF
Official implementation of CVPR 2024 paper "Retrieval-Augmented Open-Vocabulary Object Detection".
☆44Updated last year