bytedance / video-SALMONN-2View external linksLinks
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.
☆150Jan 28, 2026Updated 2 weeks ago
Alternatives and similar repositories for video-SALMONN-2
Users that are interested in video-SALMONN-2 are comparing it to the libraries listed below
Sorting:
- https://avocado-captioner.github.io/☆29Oct 16, 2025Updated 4 months ago
- OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models☆54Feb 1, 2026Updated 2 weeks ago
- WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs☆38Jan 26, 2026Updated 3 weeks ago
- FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection☆24Updated this week
- Vox-Profile Benchmark☆67Sep 12, 2025Updated 5 months ago
- ☆11Mar 11, 2025Updated 11 months ago
- Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation☆28Dec 10, 2025Updated 2 months ago
- The benchmark for "Video Object Segmentation in Panoptic Wild Scenes".☆12Oct 17, 2023Updated 2 years ago
- This is the official repository of Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities☆36Jul 4, 2025Updated 7 months ago
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.☆168Jan 30, 2025Updated last year
- Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries☆34Nov 19, 2025Updated 2 months ago
- SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability☆16May 8, 2025Updated 9 months ago
- TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs☆103Feb 2, 2026Updated 2 weeks ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆64Jul 22, 2025Updated 6 months ago
- ☆22Jan 29, 2026Updated 2 weeks ago
- ☆17Apr 7, 2025Updated 10 months ago
- Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding☆39Mar 16, 2025Updated 11 months ago
- WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning☆49Dec 30, 2025Updated last month
- ☆16Apr 4, 2025Updated 10 months ago
- noise reduction☆17Jul 3, 2024Updated last year
- The official repository of our paper "Reinforcing Video Reasoning with Focused Thinking"☆34Jun 12, 2025Updated 8 months ago
- [TCSVT 2024] Temporally Consistent Referring Video Object Segmentation with Hybrid Memory☆19Apr 9, 2025Updated 10 months ago
- ☆36Jul 9, 2025Updated 7 months ago
- The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"☆81Oct 15, 2025Updated 4 months ago
- [AAAI 26 Demo] Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal P…☆64Jan 27, 2026Updated 2 weeks ago
- Quick Long Video Understanding [TMLR2025]☆75Oct 27, 2025Updated 3 months ago
- [EMNLP 2025 Industry] Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning☆35Oct 22, 2025Updated 3 months ago
- LEO: A powerful Hybrid Multimodal LLM☆19Jan 18, 2025Updated last year
- ☆36Jun 25, 2025Updated 7 months ago
- The official repo for the technical report "Scalable Mask Annotation for Video Text Spotting"☆16May 3, 2023Updated 2 years ago
- ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models☆91Sep 12, 2025Updated 5 months ago
- Taming Self-Training for Open-Vocabulary Object Detection, CVPR 2024☆21Dec 30, 2023Updated 2 years ago
- ☆19Jul 25, 2024Updated last year
- Proteus is an experimental platform that combines the power of Large Language Models with the Genesis physics engine☆25Dec 20, 2024Updated last year
- Code for CVPR25 paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆154Jun 23, 2025Updated 7 months ago
- ☆148Jul 31, 2025Updated 6 months ago
- ☆23Oct 17, 2024Updated last year
- Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"☆25Dec 16, 2025Updated 2 months ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆55Jul 1, 2025Updated 7 months ago