xmu-xiaoma666 / Multimodal-Open-O1
Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool works locally and aims to create inference chains akin to those used by OpenAI-o1, but with localized processing power.
☆26Updated last month
Related projects ⓘ
Alternatives and complementary repositories for Multimodal-Open-O1
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆46Updated this week
- Official implement of MIA-DPO☆32Updated last week
- This is the official repo for the incoming work: ByteVideoLLM☆12Updated last week
- Video dataset dedicated to portrait-mode video recognition.☆35Updated 7 months ago
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆38Updated 4 months ago
- ☆67Updated 6 months ago
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆17Updated 3 weeks ago
- 🔥 Aurora Series: A more efficient multimodal large language model series for video.☆41Updated 2 weeks ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆48Updated last month
- ✨✨ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?☆77Updated last month
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆60Updated 3 weeks ago
- minisora-DiT, a DiT reproduction based on XTuner from the open source community MiniSora☆37Updated 7 months ago
- Making LLaVA Tiny via MoE-Knowledge Distillation☆55Updated 2 weeks ago
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation☆84Updated 2 months ago
- Explore the Limits of Omni-modal Pretraining at Scale☆89Updated 2 months ago
- ☆35Updated 4 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆137Updated this week
- The official code of the paper "PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction".☆42Updated last week
- [NeurIPS 2024] Efficient Multi-modal Models via Stage-wise Visual Context Compression☆38Updated 3 months ago
- ☆103Updated 3 months ago
- LMM which strictly superset LLM embedded☆31Updated last week
- [NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment☆48Updated last month
- [NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of…☆98Updated 3 weeks ago
- DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution☆39Updated this week
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆32Updated last month
- ☆35Updated last month
- Official repo for StableLLAVA☆90Updated 10 months ago
- ☆30Updated 5 months ago
- This repository provides an improved LLamaGen Model, fine-tuned on 500,000 high-quality images, each accompanied by over 300 token prompt…☆21Updated 3 weeks ago
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆81Updated 4 months ago