JiuhaiChen / Florence-VL
☆206Updated last month
Alternatives and similar repositories for Florence-VL:
Users that are interested in Florence-VL are comparing it to the libraries listed below
- MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution☆272Updated 3 weeks ago
- The repository for the paper titled "Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks"☆144Updated 3 weeks ago
- a family of versatile and state-of-the-art video tokenizers.☆300Updated this week
- SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation☆103Updated 3 months ago
- The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"☆133Updated last week
- Official implementation of X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models☆142Updated last month
- The official code for "BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities"☆168Updated 2 months ago
- Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition☆263Updated last week
- ☆369Updated last month
- ☆192Updated this week
- [NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.☆269Updated 6 months ago
- Official code base for paper EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidan…☆94Updated 2 weeks ago
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization☆524Updated 7 months ago
- u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model☆126Updated 6 months ago
- [ECCV 2024] Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation☆287Updated 6 months ago
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation☆134Updated 2 months ago
- WorldGPT: Empowering LLM as Multimodal World Model☆106Updated 5 months ago
- Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models☆602Updated 4 months ago
- SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree☆531Updated last month
- EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders☆528Updated 3 months ago
- [AAAI 2025] Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation☆101Updated last month
- Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models☆157Updated 2 months ago
- Official Repository of ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning☆206Updated 3 months ago
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models☆84Updated 9 months ago
- ☆40Updated 5 months ago
- This is the official reproduction of Qihoo-T2X.☆254Updated 2 months ago
- (NeurIPS 2024) Learning to Visual Question Answering, Asking and Assessment☆63Updated 2 months ago
- This is the official reproduction of FancyVideo.☆650Updated 2 months ago
- Evaluating text-to-image/video/3D models with VQAScore☆232Updated last week
- [ECCV 2024] Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?☆150Updated 3 months ago