bytedance / video-SALMONN-2Links

video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.

☆123

Alternatives and similar repositories for video-SALMONN-2

Users that are interested in video-SALMONN-2 are comparing it to the libraries listed below

Sorting:

hyc2026 / StoryTeller
☆80Updated 8 months ago
EvolvingLMMs-Lab / Aero-1
☆78Updated 7 months ago
BriansIDP / video-SALMONN-o1
☆37Updated 3 months ago
emova-ollm / EMOVA
Official PyTorch implementation of EMOVA in CVPR 2025 (https://arxiv.org/abs/2409.18042)
☆74Updated 8 months ago
ZhangXJ199 / TinyLLaVA-Video
A Simple Framework of Small-scale LMMs for Video Understanding
☆103Updated 5 months ago
HumanMLLM / ViSpeak
(ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"
☆40Updated 5 months ago
lucasjinreal / LLaVA-Magvit2
LLaVA combines with Magvit Image tokenizer, training MLLM without an Vision Encoder. Unifying image understanding and generation.
☆37Updated last year
invictus717 / MiCo
[ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale
☆121Updated last year
TencentARC / MindOmni
☆135Updated last month
FrankYang-17 / MME-VideoOCR
☆35Updated 6 months ago
360CVGroup / Inner-Adaptor-Architecture
LMM solved catastrophic forgetting, AAAI2025
☆44Updated 7 months ago
SCZwangxiao / video-FlexReduc
Official implementation of paper AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
☆88Updated 7 months ago
InternLM / CapRL
An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"
☆147Updated last month
Mark12Ding / Dispider
[CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
☆149Updated 8 months ago
RainBowLuoCS / OpenOmni
(NIPS 2025) OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Align…
☆112Updated 3 weeks ago
litwellchi / MMTrail
[Arxiv 2024] Official code for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
☆33Updated 10 months ago
multimodal-art-projection / OmniBench
A project for tri-modal LLM benchmarking and instruction tuning.
☆52Updated 8 months ago
bytedance / Portrait-Mode-Video
Video dataset dedicated to portrait-mode video recognition.
☆55Updated last month
jacklishufan / LaViDa
Official Implementation of LaViDa: :A Large Diffusion Language Model for Multimodal Understanding
☆174Updated last month
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆133Updated 6 months ago
qishisuren123 / AnyCap
A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable ca…
☆52Updated 4 months ago
baichuan-inc / Baichuan-Omni-1.5
☆181Updated 9 months ago
Tencent / VITA
The official implement of VITA, VITA15, LongVITA, VITA-Audio, VITA-VLA, and VITA-E.
☆129Updated last month
HumanMLLM / HumanOmniV2
☆141Updated 4 months ago
Gen-Verse / HermesFlow
[NeurIPS 2025] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
☆72Updated 2 months ago
ali-vilab / alitok
AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
☆49Updated last month
JaaackHongggg / WorldSense
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
☆34Updated 2 weeks ago
marinero4972 / Open-o3-Video
Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"
☆119Updated 3 weeks ago
TencentARC / Video-Holmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆78Updated 4 months ago
TIGER-AI-Lab / QuickVideo
Quick Long Video Understanding
☆70Updated last month