OpenGVLab / InternLMM
☆17Updated last year
Related projects ⓘ
Alternatives and complementary repositories for InternLMM
- MLLM-DataEngine: An Iterative Refinement Approach for MLLM☆37Updated 6 months ago
- 🔥 Aurora Series: A more efficient multimodal large language model series for video.☆47Updated last week
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆32Updated 5 months ago
- ☆85Updated last year
- ☆19Updated 11 months ago
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning☆66Updated 5 months ago
- Codes for ICML 2023 Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation☆35Updated last year
- Video dataset dedicated to portrait-mode video recognition.☆38Updated 7 months ago
- Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models☆36Updated last year
- Making LLaVA Tiny via MoE-Knowledge Distillation☆63Updated last month
- This repo contains the code for our paper Towards Open-Ended Visual Recognition with Large Language Model☆90Updated 4 months ago
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆22Updated 10 months ago
- The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". Th…☆33Updated 2 weeks ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆61Updated 2 months ago
- A Dead Simple and Modularized Multi-Modal Training and Finetune Framework. Compatible to any LLaVA/Flamingo/QwenVL/MiniGemini etc series …☆17Updated 7 months ago
- ☆21Updated 3 months ago
- This is the official repo for the incoming work: ByteVideoLLM☆15Updated 3 weeks ago
- Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model☆39Updated last year
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆26Updated 2 months ago
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆35Updated 2 months ago
- The official GitHub page for ''What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Ins…☆18Updated last year
- ☆131Updated 11 months ago
- VideoHallucer, The first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)☆22Updated 5 months ago
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆61Updated last month
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆120Updated last month
- ☆45Updated last year
- Official PyTorch implementation of the paper "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training".☆53Updated last year
- LMM which strictly superset LLM embedded☆31Updated 2 weeks ago
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation☆84Updated 2 months ago