mrseanryan / finetune_LLaVALinks
Fine tune LLaVA 1.5 - based on article by wandb
☆13Updated last year
Alternatives and similar repositories for finetune_LLaVA
Users that are interested in finetune_LLaVA are comparing it to the libraries listed below
Sorting:
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆60Updated 5 months ago
- [IJCV 2024]☆16Updated 9 months ago
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆165Updated 10 months ago
- Official implementation of CVPR 2024 paper "Retrieval-Augmented Open-Vocabulary Object Detection".☆42Updated 11 months ago
- [CVPRW 2024] TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning. Official code for the 3rd place solution of t…☆42Updated 6 months ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆23Updated 3 weeks ago
- Benchmarking Panoptic Video Scene Graph Generation (PVSG), CVPR'23☆96Updated last year
- The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …☆57Updated 9 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆55Updated last month
- iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models☆19Updated 6 months ago
- ☆11Updated 9 months ago
- [ICCV 2025] Dynamic-VLM☆23Updated 8 months ago
- [CVPR 2025] Few-shot Recognition via Stage-Wise Retrieval-Augmented Finetuning☆23Updated 2 months ago
- [CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval☆20Updated 4 months ago
- (ACL-2025 main conference) Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback☆26Updated last month
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆48Updated 7 months ago
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆51Updated last year
- ☆112Updated 4 months ago
- GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)☆70Updated last year
- [CVPR 2025] Official PyTorch Code for "MMRL: Multi-Modal Representation Learning for Vision-Language Models" and its extension "MMRL++: P…☆66Updated last month
- Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"☆49Updated 5 months ago
- ☆41Updated 2 months ago
- [CVPR'24 Highlight] PyTorch Implementation of Object Recognition as Next Token Prediction☆180Updated 3 months ago
- [ICCV2025] Harnessing CLIP, DINO and SAM for Open Vocabulary Segmentation☆77Updated last month
- [CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆58Updated last month
- A detection/segmentation dataset with labels characterized by intricate and flexible expressions. "Described Object Detection: Liberating…☆130Updated last year
- ☆30Updated 7 months ago
- [ICCV'25] "Harnessing Uncertainty-aware Bounding Boxes for Unsupervised 3D Object Detection".☆20Updated 10 months ago
- Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"☆48Updated last month
- Scaffold Prompting to promote LMMs☆43Updated 8 months ago