LiWentomng / gradio-osprey-demo
Gradio demo used in our Osprey:Pixel Understanding with Visual Instruction Tuning.
☆15Updated last year
Alternatives and similar repositories for gradio-osprey-demo:
Users that are interested in gradio-osprey-demo are comparing it to the libraries listed below
- ☆26Updated 9 months ago
- ☆18Updated last year
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆66Updated 7 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 7 months ago
- Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.☆18Updated 2 years ago
- minisora-DiT, a DiT reproduction based on XTuner from the open source community MiniSora☆41Updated last year
- [ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆71Updated 3 months ago
- This repo contains the code for our paper Towards Open-Ended Visual Recognition with Large Language Model☆95Updated 9 months ago
- ☆58Updated last year
- Precision Search through Multi-Style Inputs☆68Updated this week
- ☆43Updated 4 months ago
- [CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories☆37Updated last month
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆39Updated 7 months ago
- OpenMMLab Detection Toolbox and Benchmark for V3Det☆15Updated last year
- [CVPR2025] Official implementation of High Fidelity Scene Text Synthesis.☆60Updated last month
- Implementation code of the paper MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing☆55Updated last month
- Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed☆89Updated 6 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated 8 months ago
- ☆34Updated last year
- 🔥 [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"☆37Updated 10 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆20Updated 4 months ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆123Updated 5 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆50Updated 3 months ago
- [CVPR 2025] DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution☆46Updated last month
- A Dead Simple and Modularized Multi-Modal Training and Finetune Framework. Compatible to any LLaVA/Flamingo/QwenVL/MiniGemini etc series …☆19Updated last year
- ☆11Updated 3 months ago
- ☆40Updated 3 months ago
- A Simple Framework of Small-scale LMMs for Video Understanding☆56Updated last week
- ☆91Updated 9 months ago
- Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs☆90Updated 3 months ago