microsoft / SoM
[arXiv 2023] Set-of-Mark Prompting for GPT-4V and LMMs
☆1,373Updated 8 months ago
Alternatives and similar repositories for SoM:
Users that are interested in SoM are comparing it to the libraries listed below
- AI agent using GPT-4V(ision) capable of using a mouse/keyboard to interact with web UI☆1,036Updated 4 months ago
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆739Updated last year
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)☆805Updated 9 months ago
- [CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses tha…☆867Updated 5 months ago
- VisionLLM Series☆1,054Updated 2 months ago
- [ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large mult…☆742Updated 3 months ago
- 【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection☆3,238Updated 5 months ago
- Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series☆941Updated 3 months ago
- Official repo for MM-REACT☆949Updated last year
- [NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web"☆819Updated last month
- [ECCV 2024] Official implementation of the paper "Semantic-SAM: Segment and Recognize Anything at Any Granularity"☆2,604Updated 9 months ago
- ☆778Updated 9 months ago
- PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"☆602Updated last year
- LLaVA-Interactive-Demo☆369Updated 9 months ago
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"☆981Updated 3 months ago
- Emu Series: Generative Multimodal Models from BAAI☆1,716Updated 7 months ago
- [Image 2 Text Para] Transform Image into Unique Paragraph with ChatGPT, BLIP2, OFA, GRIT, Segment Anything, ControlNet.☆807Updated 2 years ago
- Project Page for "LISA: Reasoning Segmentation via Large Language Model"☆2,185Updated 2 months ago
- ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Expert…☆1,428Updated last month
- Compose multimodal datasets 🎹☆360Updated 2 weeks ago
- [ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the cap…☆1,352Updated last month
- Strong and Open Vision Language Assistant for Mobile Devices☆1,206Updated last year
- BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs☆508Updated last year
- Mixture-of-Experts for Large Vision-Language Models☆2,153Updated 5 months ago
- Caption-Anything is a versatile tool combining image segmentation, visual captioning, and ChatGPT, generating tailored captions with dive…☆1,738Updated last year
- Recent LLM-based CV and related works. Welcome to comment/contribute!☆861Updated last month
- ☆773Updated 9 months ago
- An Open-source Toolkit for LLM Development☆2,776Updated 3 months ago
- A family of lightweight multimodal models.☆1,015Updated 5 months ago
- Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2☆2,072Updated 2 weeks ago