yaolinli/TimeChat-Captioner

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/yaolinli/TimeChat-Captioner)

yaolinli / TimeChat-Captioner

[ICML 2026] Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

☆47

Alternatives and similar repositories for TimeChat-Captioner

Users that are interested in TimeChat-Captioner are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

WPR001 / UGC_VideoCaptioner
View on GitHub
☆15Jun 23, 2026Updated 3 weeks ago
Lliar-liar / Daily-Omni
View on GitHub
This is the official repository of Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
☆42Apr 28, 2026Updated 2 months ago
AVoCaDO-Captioner / AVoCaDO
View on GitHub
https://avocado-captioner.github.io/
☆37Oct 16, 2025Updated 9 months ago
WeChatCV / D-ORCA
View on GitHub
D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning
☆15Feb 11, 2026Updated 5 months ago
InternLM / CapRL
View on GitHub
[ICLR 2026] An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"
☆225Jun 23, 2026Updated 3 weeks ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
HVision-NKU / ASID-Caption
View on GitHub
ASID-Caption: Attribute-Structured and Quality-Verified Audiovisual Instruction Dataset and Training Pipeline for Fine-Grained Video Unde…
☆67Mar 3, 2026Updated 4 months ago
multimodal-art-projection / OmniBench
View on GitHub
A project for tri-modal LLM benchmarking and instruction tuning.
☆61Mar 27, 2025Updated last year
bytedance / video-SALMONN-2
View on GitHub
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is d…
☆204Feb 23, 2026Updated 4 months ago
markywg / transagent
View on GitHub
[NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
☆25Oct 17, 2024Updated last year
dingyue772 / OmniSIFT
View on GitHub
[ICML2026] OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
☆25May 21, 2026Updated last month
TencentARC / Video-Holmes
View on GitHub
[ECCV 2026] Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆95Jul 13, 2025Updated last year
NJU-LINK / T2AV-Compass
View on GitHub
The Source Code for T2AV-Compass @ ICML 2026
☆20Jun 21, 2026Updated last month
OmniMMI / OmniMMI
View on GitHub
[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
☆23Updated this week
yaolinli / GenS
View on GitHub
[ACL 2025 Findings] GenS: Generative Frame Sampler for Long Video Understanding
☆22Aug 21, 2025Updated 11 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
mbzuai-oryx / TrackingMeetsLMM
View on GitHub
☆10Apr 7, 2025Updated last year
HumanMLLM / HumanOmniV2
View on GitHub
☆161Jul 31, 2025Updated 11 months ago
yaolinli / TimeChat-Online
View on GitHub
[ACM MM 2025] TimeChat-online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
☆132Jun 29, 2026Updated 3 weeks ago
yellow-binary-tree / MMDuet2
View on GitHub
[ICLR 2026] MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
☆40Jan 14, 2026Updated 6 months ago
BRZ911 / ViTCoT
View on GitHub
[ACM MM 2025] ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models
☆18Jul 15, 2025Updated last year
JaaackHongggg / WorldSense
View on GitHub
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
☆50Jul 12, 2026Updated last week
UCSB-AI / MMWorld
View on GitHub
Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"
☆28Jul 15, 2025Updated last year
KlingAIResearch / MemFlow
View on GitHub
Official Implementation of "MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives"
☆214Dec 29, 2025Updated 6 months ago
JavisVerse / Awesome-AVI
View on GitHub
Awesome Audio-Visual Intelligence, Survey of Audio-Visual Intelligence
☆83May 8, 2026Updated 2 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
ali-vilab / DreamRelation
View on GitHub
[ICCV2025] The official code of "DreamRelation: Relation-Centric Video Customization"
☆27Feb 4, 2026Updated 5 months ago
apple / ml-streambridge
View on GitHub
☆40Nov 5, 2025Updated 8 months ago
FudanCVL / AVI-Bench
View on GitHub
[ICML'26] Toward Human-like Audio-Visual Intelligence of Omni-MLLMs
☆15Jun 20, 2026Updated last month
Vchitect / Cut2Next
View on GitHub
Cut2Next: Generating Next Shot via In-Context Tuning
☆33Aug 21, 2025Updated 11 months ago
www-Ye / Time-R1
View on GitHub
R1-like Video-LLM for Temporal Grounding
☆138Jun 20, 2025Updated last year
TerryPei / CSP
View on GitHub
Cross-Self KV Cache Pruning for Efficient Vision-Language Inference
☆10Dec 15, 2024Updated last year
marinero4972 / CyberV
View on GitHub
☆20Jun 10, 2025Updated last year
byhuang123 / PoCo
View on GitHub
[CVPR2026] Official implementation of our paper “Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot…
☆19Apr 8, 2026Updated 3 months ago
ttgeng233 / LongVALE
View on GitHub
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))
☆61Jun 9, 2025Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
jiyt17 / IDA-VLM
View on GitHub
[ICLR 2025] IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
☆37Nov 27, 2024Updated last year
JPShi12 / VideoLoom
View on GitHub
[ICML 2026] VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
☆27Jul 3, 2026Updated 2 weeks ago
bronyayang / CaptionQA
View on GitHub
[CVPR '26] CaptionQA: Is Your Caption as Useful as the Image Itself?
☆38Mar 3, 2026Updated 4 months ago
HumanMLLM / ViSpeak
View on GitHub
(ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"
☆52Jul 1, 2025Updated last year
yehonathanlitman / EditCtrl
View on GitHub
[CVPR 2026] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
☆46Jul 13, 2026Updated last week
leoisufa / ICVE
View on GitHub
[Preprint 2025] ICVE: In-Context Learning with Unpaired Clips for Instruction-based Video Editing
☆25Jun 2, 2026Updated last month
snap-research / VIMI
View on GitHub
☆13Jul 10, 2024Updated 2 years ago