hyc2026 / StoryTeller
☆66Updated 2 months ago
Alternatives and similar repositories for StoryTeller:
Users that are interested in StoryTeller are comparing it to the libraries listed below
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆41Updated 5 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆39Updated 3 weeks ago
- ☆173Updated 6 months ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆45Updated 3 months ago
- Video dataset dedicated to portrait-mode video recognition.☆43Updated last month
- [NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆128Updated 4 months ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)☆152Updated 6 months ago
- ☆133Updated 2 weeks ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆63Updated 4 months ago
- Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Groundi…☆47Updated last year
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆66Updated last week
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆97Updated 2 months ago
- ☆47Updated last month
- Official code for 'Paragraph-to-Image Generation with Information-Enriched Diffusion Model'☆102Updated 2 months ago
- Official repo for StableLLAVA☆94Updated last year
- Code release for our NeurIPS 2024 Spotlight paper "GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing"☆99Updated 3 months ago
- Official implementation of MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis☆83Updated 6 months ago
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models☆75Updated last month
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.☆107Updated 4 months ago
- ☆38Updated last month
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆127Updated last month
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆46Updated 2 weeks ago
- A lightweight flexible Video-MLLM developed by TencentQQ Multimedia Research Team.☆68Updated 3 months ago
- The official implementation of PAR: Parallelized Autoregressive Visual Generation. https://epiphqny.github.io/PAR-project/☆108Updated 3 weeks ago
- [ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆64Updated this week
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆40Updated 7 months ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆153Updated last month
- [NeurIPS 2024] Efficient Multi-modal Models via Stage-wise Visual Context Compression☆51Updated 5 months ago
- T2VScore: Towards A Better Metric for Text-to-Video Generation☆78Updated 9 months ago
- ☆20Updated last month