bytedance / Valley
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.
☆211Updated last week
Alternatives and similar repositories for Valley:
Users that are interested in Valley are comparing it to the libraries listed below
- ☆170Updated last week
- MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)☆131Updated 3 weeks ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆114Updated 3 months ago
- 🔥🔥First-ever hour scale video understanding models☆232Updated last month
- ☆78Updated 9 months ago
- Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent☆221Updated 2 weeks ago
- Multimodal Models in Real World☆435Updated 3 months ago
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text☆306Updated 3 months ago
- ☆206Updated this week
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆159Updated last month
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)☆155Updated 6 months ago
- Research Code for Multimodal-Cognition Team in Ant Group☆136Updated 7 months ago
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer☆210Updated 10 months ago
- ☆172Updated 7 months ago
- ☆348Updated 3 months ago
- SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama: https://arxiv.org/abs/2408.09333v2☆111Updated 3 months ago
- Implementation for the paper "ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems".☆134Updated 3 weeks ago
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".☆231Updated last year
- FlexRAG: A RAG Framework for Information Retrieval and Generation.☆120Updated this week
- ☆104Updated last year
- Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊☆262Updated 3 weeks ago
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with g…☆282Updated this week
- DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought☆205Updated last month
- Long Context Transfer from Language to Vision☆360Updated 2 months ago
- ☆316Updated last week
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆202Updated 5 months ago
- The official code for NeurIPS 2024 paper: Harmonizing Visual Text Comprehension and Generation☆111Updated 3 months ago
- ☆308Updated 2 months ago