lucasjinreal / MLLM_Factory
A Dead Simple and Modularized Multi-Modal Training and Finetune Framework. Compatible to any LLaVA/Flamingo/QwenVL/MiniGemini etc series models.
☆17Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for MLLM_Factory
- ☆17Updated last year
- MLLM-DataEngine: An Iterative Refinement Approach for MLLM☆36Updated 5 months ago
- ☆19Updated 11 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆26Updated last month
- Video dataset dedicated to portrait-mode video recognition.☆38Updated 7 months ago
- minisora-DiT, a DiT reproduction based on XTuner from the open source community MiniSora☆38Updated 7 months ago
- ☆35Updated 5 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆60Updated 2 months ago
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆22Updated 10 months ago
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆38Updated 4 months ago
- ☆21Updated 10 months ago
- 🔥 Aurora Series: A more efficient multimodal large language model series for video.☆47Updated this week
- Chinese CLIP models with SOTA performance.☆48Updated last year
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆35Updated last month
- [CVPR2023] This is an official implementation of paper "DETRs with Hybrid Matching".☆14Updated 2 years ago
- IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model☆25Updated last month
- A multimodal large-scale model, which performs close to the closed-source Qwen-VL-PLUS on many datasets and significantly surpasses the p…☆14Updated 9 months ago
- Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed☆57Updated 3 weeks ago
- VimTS: A Unified Video and Image Text Spotter☆72Updated last week
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆39Updated 3 months ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆51Updated 3 weeks ago
- Official implementation of High Fidelity Scene Text Synthesis.☆36Updated this week
- Lion: Kindling Vision Intelligence within Large Language Models☆52Updated 9 months ago
- Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Groundi…☆42Updated 10 months ago
- ☆74Updated 8 months ago
- BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild☆27Updated 7 months ago
- ☆131Updated 11 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆32Updated 5 months ago
- ☆68Updated last week