alibaba / AICITY2024_Track2_AliOpenTrek_CityLLaVA
☆33Updated 4 months ago
Related projects ⓘ
Alternatives and complementary repositories for AICITY2024_Track2_AliOpenTrek_CityLLaVA
- A systematic survey of multi-modal and multi-task visual understanding foundation models for driving scenarios☆47Updated 5 months ago
- DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution☆39Updated last week
- [NeurIPS 24] MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks☆31Updated this week
- Codes for ICML 2023 Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation☆35Updated last year
- Open-vocabulary Semantic Segmentation☆34Updated 9 months ago
- MLLM-DataEngine: An Iterative Refinement Approach for MLLM☆36Updated 5 months ago
- ✨✨ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?☆78Updated last week
- ☆29Updated 3 months ago
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning☆66Updated 5 months ago
- [CVPR2023] This is an official implementation of paper "DETRs with Hybrid Matching".☆27Updated 2 years ago
- ☆45Updated 2 weeks ago
- Official repository for the NuScenes-MQA. This paper is accepted by LLVA-AD Workshop at WACV 2024.☆24Updated 11 months ago
- This repo contains the code for our paper Towards Open-Ended Visual Recognition with Large Language Model☆90Updated 4 months ago
- Making LLaVA Tiny via MoE-Knowledge Distillation☆63Updated 3 weeks ago
- ☆19Updated 11 months ago
- Open-Vocabulary Panoptic Segmentation☆18Updated 2 months ago
- ☆78Updated 9 months ago
- ☆16Updated 2 years ago
- Official implementation of the CVPR paper Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent …☆25Updated last year
- A collection of visual instruction tuning datasets.☆75Updated 8 months ago
- This repository compiles a list of papers related to Video LLM.☆19Updated 4 months ago
- Distilling Large Vision-Language Model with Out-of-Distribution Generalizability (ICCV 2023)☆54Updated 7 months ago
- ☆19Updated 6 months ago
- Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs☆77Updated 5 months ago
- Official PyTorch implementation of CODA-LM(https://arxiv.org/abs/2404.10595)☆68Updated 2 weeks ago
- ☆15Updated 5 months ago
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆61Updated last month
- Code for Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking☆20Updated last month
- [ECCV 2024] The official code for "Dolphins: Multimodal Language Model for Driving“☆47Updated 4 months ago
- ☆57Updated last year