mrseanryan / finetune_LLaVALinks
Fine tune LLaVA 1.5 - based on article by wandb
☆12Updated last year
Alternatives and similar repositories for finetune_LLaVA
Users that are interested in finetune_LLaVA are comparing it to the libraries listed below
Sorting:
- [CVPRW 2024] TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning. Official code for the 3rd place solution of t…☆38Updated 3 months ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆20Updated 2 months ago
- ☆36Updated this week
- Implementation of the "the first large-scale multimodal mixture of experts models." from the paper: "Multimodal Contrastive Learning with…☆28Updated 2 months ago
- arxiv-daily☆80Updated 4 years ago
- ☆47Updated 11 months ago
- A question bank for interview questions for data related roles☆10Updated last year
- Official implementation of CVPR 2024 paper "Retrieval-Augmented Open-Vocabulary Object Detection".☆41Updated 8 months ago
- Code for CVPR2025 "MMRL: Multi-Modal Representation Learning for Vision-Language Models" and its extension "MMRL++: Parameter-Efficient a…☆42Updated 2 weeks ago
- Evaluate the performance of computer vision models and prompts for zero-shot models (Grounding DINO, CLIP, BLIP, DINOv2, ImageBind, model…☆36Updated last year
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆59Updated 3 months ago
- The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …☆53Updated 7 months ago
- [ICCV2023] Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer☆36Updated last year
- Taming Self-Training for Open-Vocabulary Object Detection, CVPR 2024☆21Updated last year
- [CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval☆15Updated 2 months ago
- Using image captions with LLM for zero-shot VQA☆18Updated last year
- CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation☆72Updated 9 months ago
- [CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆43Updated 3 months ago
- ☆19Updated last year
- [ECAI 2023] MonoSKD: General Distillation Framework for Monocular 3D Object Detection via Spearman Correlation Coefficient☆30Updated last year
- GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)☆67Updated last year
- (ECCV 2024) Can OOD Object Detectors Learn from Foundation Models?☆25Updated 6 months ago
- LP-OVOD: Open-Vocabulary Object Detection by Linear Probing (WACV 2024)☆24Updated 10 months ago
- Official implementation of the CVPR paper Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent …☆26Updated 2 years ago
- ☆39Updated last year
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆156Updated 8 months ago
- Official PyTorch implementation of “MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation”☆16Updated 6 months ago
- 🚀【AAAI 2025】Cross-View Referring Multi-Object Tracking☆55Updated last week
- iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models☆19Updated 4 months ago
- ☆12Updated 4 months ago