ANYANTUDRE / Florence-2-Vision-Language-Model
Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
☆33Updated 8 months ago
Alternatives and similar repositories for Florence-2-Vision-Language-Model:
Users that are interested in Florence-2-Vision-Language-Model are comparing it to the libraries listed below
- LLaVA combines with Magvit Image tokenizer, training MLLM without an Vision Encoder. Unifying image understanding and generation.☆35Updated 9 months ago
- ☆49Updated 3 weeks ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆117Updated 4 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆57Updated last month
- ☆31Updated 2 months ago
- imagetokenizer is a python package, helps you encoder visuals and generate visuals token ids from codebook, supports both image and video…☆30Updated 9 months ago
- Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.☆18Updated 2 years ago
- ☆33Updated last month
- Our 2nd-gen LMM☆33Updated 10 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 6 months ago
- A Simple Framework of Small-scale Large Multimodal Models for Video Understanding Based on TinyLLaVA_Factory.☆46Updated this week
- ☆49Updated 3 months ago
- Referring any person or objects given a natural language description. Code base for RexSeek and HumanRef Benchmark☆84Updated this week
- minisora-DiT, a DiT reproduction based on XTuner from the open source community MiniSora☆40Updated last year
- Implementation of ViTaR: ViTAR: Vision Transformer with Any Resolution in PyTorch☆32Updated 4 months ago
- EfficientViT is a new family of vision models for efficient high-resolution vision.☆24Updated last year
- Official Pytorch Implementation of Self-emerging Token Labeling☆32Updated 11 months ago
- Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed☆82Updated 5 months ago
- Precision Search through Multi-Style Inputs☆65Updated 8 months ago
- ☆33Updated last year
- MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)☆133Updated 2 months ago
- GIFT: Generative Interpretable Fine-Tuning☆20Updated 5 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆64Updated 6 months ago
- arXiv 23 "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs"☆14Updated 3 months ago
- Florence-2☆60Updated last month
- ☆12Updated 2 months ago
- Use Florence 2 to auto-label data for use in training fine-tuned object detection models.☆62Updated 7 months ago
- A Gradio component that can be used to annotate images with bounding boxes.☆45Updated 3 weeks ago
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆25Updated last year
- EfficientSAM + YOLO World base model for use with Autodistill.☆10Updated last year