bytedance / tarsierLinks
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.
β439Updated 3 months ago
Alternatives and similar repositories for tarsier
Users that are interested in tarsier are comparing it to the libraries listed below
Sorting:
- [ICML 2025] Official PyTorch implementation of LongVUβ392Updated 2 months ago
- π₯π₯First-ever hour scale video understanding modelsβ517Updated 3 weeks ago
- VideoChat-Flash: Hierarchical Compression for Long-Context Video Modelingβ451Updated last month
- Long Context Transfer from Language to Visionβ388Updated 4 months ago
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"β216Updated 3 weeks ago
- Official repository for the paper PLLaVAβ663Updated last year
- β¨β¨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysisβ611Updated 2 months ago
- Multimodal Models in Real Worldβ530Updated 5 months ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [π₯the first paper to explore R1 for video]β646Updated last week
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Modelsβ235Updated 10 months ago
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understandingβ628Updated 7 months ago
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understandingβ637Updated 6 months ago
- π‘ VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning