zhangfaen / finetune-Qwen2-VL
☆281Updated 2 weeks ago
Alternatives and similar repositories for finetune-Qwen2-VL:
Users that are interested in finetune-Qwen2-VL are comparing it to the libraries listed below
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text☆302Updated 2 months ago
- A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision,…☆223Updated 3 weeks ago
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer☆348Updated this week
- An open-source implementaion for fine-tuning Qwen2-VL series by Alibaba Cloud.☆167Updated this week
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆595Updated last month
- RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness☆273Updated last month
- ☆162Updated last month
- Long Context Transfer from Language to Vision☆356Updated last month
- Quick exploration into fine tuning florence 2☆289Updated 3 months ago
- official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"☆135Updated 7 months ago
- Dataset and Code for our ACL 2024 paper: "Multimodal Table Understanding". We propose the first large-scale Multimodal IFT and Pre-Train …☆178Updated 3 months ago
- 📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.☆328Updated 3 weeks ago
- Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊☆258Updated 2 months ago
- A Framework of Small-scale Large Multimodal Models☆709Updated 3 weeks ago
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆442Updated this week
- ☆159Updated 6 months ago
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer☆210Updated 9 months ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model☆251Updated 6 months ago
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆195Updated 4 months ago
- [CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback☆257Updated 4 months ago
- This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for E…☆378Updated this week
- ☆338Updated 2 months ago
- Document Artifical Intelligence☆138Updated last month
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".☆230Updated 11 months ago
- Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent☆205Updated this week
- Aligning LMMs with Factually Augmented RLHF☆339Updated last year
- A family of lightweight multimodal models.☆972Updated last month
- Official repository for the paper PLLaVA☆630Updated 5 months ago
- [ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Langua…☆335Updated last week
- LLaVA-HR: High-Resolution Large Language-Vision Assistant☆223Updated 5 months ago