zhengrongz / AoTDLinks

[CVPR 2025] Official PyTorch code of "Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation".

☆51

Alternatives and similar repositories for AoTD

Users that are interested in AoTD are comparing it to the libraries listed below

Sorting:

Becomebright / GroundVQA
Official PyTorch code of GroundVQA (CVPR'24)
☆64Updated last year
qirui-chen / MultiHop-EgoQA
[AAAI 2025] Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
☆28Updated 5 months ago
appletea233 / Temporal-R1
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency
☆58Updated 5 months ago
Becomebright / ReKV
Official PyTorch Code of ReKV (ICLR'25)
☆66Updated 2 weeks ago
yellow-binary-tree / HawkEye
Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videos
☆44Updated last year
doc-doc / NExT-GQA
Can I Trust Your Answer? Visually Grounded Video Question Answering (CVPR'24, Highlight)
☆83Updated last year
Tanveer81 / ReVisionLLM
This is the official implementation of ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
☆35Updated 2 weeks ago
hshjerry / VideoEspresso
[CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
☆127Updated 3 months ago
minghangz / TFVTG
☆42Updated last year
OpenGVLab / TimeSuite
[ICLR 2025] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
☆48Updated 7 months ago
HengLan / CGSTVG
[CVPR 2024] Context-Guided Spatio-Temporal Video Grounding
☆61Updated last year
Ziyang412 / VideoTree
Code for CVPR25 paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"
☆146Updated 4 months ago
egoschema / EgoSchema
☆103Updated 10 months ago
Visual-AI / PruneVid
[ACL 2025] PruneVid: Visual Token Pruning for Efficient Video Large Language Models
☆55Updated 6 months ago
ExplainableML / EgoCVR
[ECCV 2024] EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval
☆41Updated 7 months ago
Hui-design / Open-LLaVA-Video-R1
[LLaVA-Video-R1]✨First Adaptation of R1 to LLaVA-Video (2025-03-18)
☆35Updated 6 months ago
JoeLeelyf / OVO-Bench
[CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
☆103Updated 3 months ago
gyxxyg / TRACE
[ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modeling
☆136Updated 2 months ago
icq-benchmark / icq-benchmark
☆20Updated 3 months ago
xiaomi-research / time-r1
[NeurIPS'25] Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
☆61Updated last month
showlab / VideoLISA
[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
☆140Updated 10 months ago
Lzq5 / Video-Text-Alignment
☆25Updated 4 months ago
yongliang-wu / NumPro
[CVPR2025] Number it: Temporal Grounding Videos like Flipping Manga
☆126Updated last month
yaolinli / DeCo
Code for DeCo: Decoupling token compression from semanchc abstraction in multimodal large language models
☆74Updated 4 months ago
aiming-lab / ReAgent-V
[NeurIPS'25] ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
☆42Updated 2 months ago
zhang9302002 / ThinkingWithVideos
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
☆64Updated last month
z-x-yang / DoraemonGPT
Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models
☆88Updated last year
Jayce1kk / SpaceVLLM
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
☆15Updated 6 months ago
FeipengMa6 / VLoRA
[NeurIPS 2024] Visual Perception by Large Language Model’s Weights
☆54Updated 7 months ago
mll-lab-nu / TStar
TStar is a unified temporal search framework for long-form video question answering
☆71Updated 2 months ago