CircleRadon / Osprey
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
☆748Updated last month
Related projects: ⓘ
- [CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale☆1,027Updated last month
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization☆537Updated 3 months ago
- [ECCV 2024] The official code of paper "Open-Vocabulary SAM".☆902Updated last month
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)☆688Updated last month
- Official PyTorch implementation of "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM"☆904Updated last month
- [CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception☆474Updated 4 months ago
- [ECCV 2024] Tokenize Anything via Prompting☆502Updated 2 months ago
- Official repository for the paper PLLaVA☆551Updated last month
- Project Page for "LISA: Reasoning Segmentation via Large Language Model"☆1,754Updated 2 months ago
- [ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model☆283Updated last month
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding☆535Updated last month
- ☆356Updated 4 months ago
- We introduce a novel approach for parameter generation, named neural network parameter diffusion (p-diff), which employs a standard laten…☆822Updated 3 months ago
- An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions☆1,220Updated last month
- A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing☆282Updated 2 months ago
- [ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"☆591Updated last month
- [CVPR 2024] 🎬💭 chat with over 10K frames of video!☆488Updated last week
- Multimodal Models in Real World☆372Updated 2 months ago
- Accelerating the development of large multimodal models (LMMs) with lmms-eval☆1,334Updated this week
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs☆719Updated this week
- Controllable video and image Generation, SVD, Animate Anyone, ControlNet, ControlNeXt, LoRA☆1,255Updated last week
- API for Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series☆707Updated last month
- VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks☆354Updated 2 months ago
- [ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creation☆378Updated 5 months ago
- (TPAMI 2024) A Survey on Open Vocabulary Learning☆794Updated 3 weeks ago
- Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models☆295Updated this week
- ☆732Updated 2 months ago
- [CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want☆639Updated last month
- 【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment☆682Updated 5 months ago
- GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest☆496Updated 3 months ago