andimarafioti / florence2-finetuning
Quick exploration into fine tuning florence 2
☆267Updated last month
Related projects ⓘ
Alternatives and complementary repositories for florence2-finetuning
- Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"☆306Updated 3 weeks ago
- ☆178Updated last week
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images☆318Updated last month
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆144Updated this week
- From scratch implementation of a vision language model in pure PyTorch☆161Updated 6 months ago
- LLaVA-Interactive-Demo☆352Updated 3 months ago
- VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv 2023 / CVPR 2024☆261Updated 6 months ago
- a family of highly capabale yet efficient large multimodal models☆161Updated 2 months ago
- Data release for the ImageInWords (IIW) paper.☆200Updated 5 months ago
- An open-source implementaion for fine-tuning Qwen2-VL series by Alibaba Cloud.☆96Updated this week
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆294Updated 3 months ago
- Famous Vision Language Models and Their Architectures☆401Updated 2 months ago
- Official repository for the paper PLLaVA☆581Updated 3 months ago
- HPT - Open Multimodal LLMs from HyperGAI☆312Updated 5 months ago
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆703Updated 9 months ago
- LLaVA-HR: High-Resolution Large Language-Vision Assistant☆213Updated 2 months ago
- When do we not need larger vision models?☆333Updated 2 months ago
- ☆258Updated this week
- PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"☆523Updated 10 months ago
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆258Updated 4 months ago
- Parameter-efficient finetuning script for Phi-3-vision, the strong multimodal language model by Microsoft.☆53Updated 4 months ago
- Testing and evaluating the capabilities of Vision-Language models (PaliGemma) in performing computer vision tasks such as object detectio…☆77Updated 5 months ago
- Long Context Transfer from Language to Vision☆328Updated 2 weeks ago
- Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding☆212Updated 2 months ago
- A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision,…☆170Updated 2 weeks ago
- API for Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series☆770Updated 3 months ago
- ☆145Updated 3 weeks ago
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding☆553Updated last month
- The official repo for the paper "VeCLIP: Improving CLIP Training via Visual-enriched Captions"☆228Updated 2 months ago
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".☆217Updated 9 months ago