JiuhaiChen / Florence-VL

☆216

Alternatives and similar repositories for Florence-VL:

Users that are interested in Florence-VL are comparing it to the libraries listed below

Oryx-mllm / Oryx
[ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
☆289Updated this week
microsoft / VidTok
a family of versatile and state-of-the-art video tokenizers.
☆337Updated last month
DAMO-NLP-SG / VideoRefer
The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"
☆145Updated this week
tencent-ailab / Leopard
The repository for the paper titled "Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks"
☆152Updated last month
SunzeY / X-Prompt
Official implementation of X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
☆154Updated 2 months ago
haoosz / BiGR
[ICLR 2025] BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities
☆138Updated 3 weeks ago
sjtuplayer / SaRA
SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation
☆107Updated 4 months ago
ShihaoZhaoZSH / LaVi-Bridge
[ECCV 2024] Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
☆290Updated 7 months ago
ZichengDuan / EZIGen
Official code base for paper EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidan…
☆102Updated last week
FoundationVision / OmniTokenizer
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
☆279Updated 7 months ago
hqhQAQ / MIP-Adapter
[AAAI 2025] Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation
☆118Updated 2 months ago
FoundationVision / Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
☆541Updated 8 months ago
dvlab-research / Lyra
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
☆276Updated last month
linzhiqiu / t2v_metrics
Evaluating text-to-image/video/3D models with VQAScore
☆253Updated last week
dle666 / R-CoT
Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models
☆170Updated 3 months ago
zhuyiche / llava-phi
☆372Updated 2 months ago
zhaohengyuan1 / Genixer
(ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generator
☆104Updated 4 months ago
showlab / LOVA3
(NeurIPS 2024) Learning to Visual Question Answering, Asking and Assessment
☆73Updated this week
Mark12Ding / SAM2Long
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
☆453Updated 2 months ago
ZrrSkywalker / MathVerse
[ECCV 2024] Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
☆153Updated 4 months ago
guoqincode / Focus-on-Your-Instruction
[CVPR 2024] Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
☆109Updated 11 months ago
dongyh20 / Chain-of-Spot
Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models
☆90Updated 10 months ago
OPPOMKLab / u-LLaVA
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
☆129Updated 7 months ago
DCDmllm / WorldGPT
WorldGPT: Empowering LLM as Multimodal World Model
☆114Updated 6 months ago
mlpc-ucsd / BLIVA
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
☆253Updated 10 months ago
guoqincode / DiT-Visualization
Visualization of DiT self attention features
☆167Updated 6 months ago
ZrrSkywalker / MAVIS
[ICLR 2025] Mathematical Visual Instruction Tuning for Multi-modal Large Language Models
☆122Updated 2 months ago
SunzeY / Bootstrap3D
Official implementation of Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
☆83Updated 3 months ago