bytedance / Valley
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.
☆228Updated last month
Alternatives and similar repositories for Valley:
Users that are interested in Valley are comparing it to the libraries listed below
- ☆173Updated 2 months ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆120Updated 5 months ago
- MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval☆142Updated this week
- MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)☆135Updated 2 months ago
- ☆78Updated 11 months ago
- 🔥🔥First-ever hour scale video understanding models☆281Updated last week
- The official code for NeurIPS 2024 paper: Harmonizing Visual Text Comprehension and Generation☆118Updated 5 months ago
- Collect every awesome work about r1!☆334Updated 2 weeks ago
- [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text☆338Updated 3 weeks ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)☆161Updated 8 months ago
- Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent☆297Updated 3 weeks ago
- ☆221Updated last month
- 💡 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning☆166Updated last week
- Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊☆266Updated 2 months ago
- ☆145Updated 2 months ago
- Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models☆50Updated 2 weeks ago
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer☆221Updated last year
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆177Updated 3 months ago
- Research Code for Multimodal-Cognition Team in Ant Group☆141Updated 9 months ago
- Multimodal Models in Real World☆492Updated last month
- ☆175Updated 9 months ago
- mllm-npu: training multimodal large language models on Ascend NPUs☆91Updated 7 months ago
- ☆368Updated last month
- GLM Series Edge Models☆134Updated last month
- MMR1: Advancing the Frontiers of Multimodal Reasoning☆153Updated last month
- Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"☆127Updated 4 months ago
- ☆349Updated 2 months ago
- Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources☆155Updated 2 weeks ago
- Awesome-RAG-VIsion: a curated list of advanced retrieval augmented generation (RAG) for Computer Vision☆123Updated last week
- official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"☆142Updated 10 months ago