mbzuai-oryx / groundingLMMLinks
[CVPR 2024 π₯] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
β912Updated last month
Alternatives and similar repositories for groundingLMM
Users that are interested in groundingLMM are comparing it to the libraries listed below
Sorting:
- [CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Wantβ841Updated last month
- VisionLLM Seriesβ1,105Updated 6 months ago
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of β¦β495Updated last year
- GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interestβ544Updated 3 months ago
- Recent LLM-based CV and related works. Welcome to comment/contribute!β872Updated 6 months ago
- When do we not need larger vision models?β406Updated 7 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Promptsβ331Updated last year
- [ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"β850Updated last year
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understandingβ644Updated 7 months ago
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.β541Updated 2 months ago
- β795Updated last year
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)β837Updated last year
- γICLR 2024π₯γ Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignmentβ825Updated last year
- PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"β673Updated last year
- NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editingβ567Updated 10 months ago
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skillsβ759Updated last year
- Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing imagβ¦β535Updated last year
- [CVPR 2024] Official implementation of the paper "Visual In-context Learning"β497Updated last year
- [Pattern Recognition 25] CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasksβ440Updated 6 months ago
- Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Seriesβ1,029Updated 7 months ago
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformerβ385Updated 4 months ago
- [ECCV 2024] Tokenize Anything via Promptingβ594Updated 9 months ago
- Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"β507Updated last month
- [ICCV 2023] Official implementation of the paper "A Simple Framework for Open-Vocabulary Segmentation and Detection"β730Updated last year
- A Framework of Small-scale Large Multimodal Modelsβ897Updated 4 months ago
- LLaVA-Interactive-Demoβ379Updated last year
- β350Updated last year
- β529Updated 10 months ago
- A family of lightweight multimodal models.β1,041Updated 9 months ago
- [CVPR 2024] PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding.β236Updated 7 months ago