mrseanryan / finetune_LLaVALinks
Fine tune LLaVA 1.5 - based on article by wandb
☆13Updated last year
Alternatives and similar repositories for finetune_LLaVA
Users that are interested in finetune_LLaVA are comparing it to the libraries listed below
Sorting:
- [CVPRW 2024] TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning. Official code for the 3rd place solution of t…☆44Updated 7 months ago
- ☆114Updated 5 months ago
- [IJCV 2024]☆16Updated 10 months ago
- Official implementation of CVPR 2024 paper "Retrieval-Augmented Open-Vocabulary Object Detection".☆43Updated last year
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆166Updated 11 months ago
- [EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆53Updated 3 weeks ago
- The official repo of our work "Pensieve: Retrospect-then-Compare mitigates Visual Hallucination"☆16Updated last year
- Official PyTorch Implementation for "Stereo3DMOT: Stereo Vision Based 3D Multi-Object Tracking with Multimodal ReID, PRCV2023"☆22Updated last year
- Official Repo for PosSAM: Panoptic Open-vocabulary Segment Anything☆67Updated last year
- Official PyTorch implementation Source code for LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation, accepted at …☆109Updated last year
- Benchmarking Panoptic Video Scene Graph Generation (PVSG), CVPR'23☆96Updated last year
- ☆53Updated last year
- Scaffold Prompting to promote LMMs☆44Updated 9 months ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆23Updated 2 weeks ago
- [CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆67Updated 2 months ago
- Awesome paper for multi-modal llm with grounding ability☆19Updated last year
- Detectron2 Toolbox and Benchmark for V3Det☆18Updated last year
- The source code for "UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All"☆46Updated last year
- [CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval☆23Updated last week
- The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …☆58Updated 10 months ago
- GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)☆71Updated last year
- Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"☆49Updated 2 weeks ago