mbzuai-oryx / groundingLMMLinks

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

☆930

Alternatives and similar repositories for groundingLMM

Users that are interested in groundingLMM are comparing it to the libraries listed below

Sorting:

OpenGVLab / VisionLLM
VisionLLM Series
☆1,127Updated 9 months ago
SunzeY / AlphaCLIP
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
☆855Updated 4 months ago
OpenGVLab / all-seeing
[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of …
☆501Updated last year
jshilong / GPT4RoI
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
☆550Updated 5 months ago
rese1f / MovieChat
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
☆666Updated 10 months ago
DirtyHarryLYL / LLM-in-Vision
Recent LLM-based CV and related works. Welcome to comment/contribute!
☆873Updated 8 months ago
beichenzbc / Long-CLIP
[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
☆873Updated last year
microsoft / LLM2CLIP
LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.
☆566Updated 4 months ago
bfshi / scaling_on_scales
When do we not need larger vision models?
☆412Updated 9 months ago
WisconsinAIVision / ViP-LLaVA
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
☆331Updated last year
dvlab-research / LLaMA-VID
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
☆849Updated last year
LLaVA-VL / LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
☆762Updated last year
shikras / shikra
☆799Updated last year
PKU-YuanGroup / LanguageBind
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
☆852Updated last year
penghao-wu / vstar
PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
☆681Updated last year
dvlab-research / Seg-Zero
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
☆563Updated 4 months ago
thunlp / LLaVA-UHD
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
☆397Updated this week
IDEA-Research / Grounding-DINO-1.5-API
Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series
☆1,061Updated 10 months ago
awaisrauf / Awesome-CV-Foundational-Models
☆538Updated last year
SkyworkAI / Vitron
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
☆577Updated last year
baaivision / tokenize-anything
[ECCV 2024] Tokenize Anything via Prompting
☆599Updated 11 months ago
OpenGVLab / Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing imag…
☆549Updated last year
tsb0601 / MMVP
☆355Updated last year
xmed-lab / CLIP_Surgery
[Pattern Recognition 25] CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks
☆447Updated 8 months ago
RenShuhuai-Andy / TimeChat
[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
☆400Updated 6 months ago
UX-Decoder / DINOv
[CVPR 2024] Official implementation of the paper "Visual In-context Learning"
☆516Updated last year
TinyLLaVA / TinyLLaVA_Factory
A Framework of Small-scale Large Multimodal Models
☆924Updated 7 months ago
LLaVA-VL / LLaVA-Interactive-Demo
LLaVA-Interactive-Demo
☆379Updated last year
NVlabs / RADIO
Official repository for "AM-RADIO: Reduce All Domains Into One"
☆1,403Updated this week
mbzuai-oryx / VideoGPT-plus
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
☆291Updated 3 months ago