2U1 / Molmo-Finetune
An open-source implementaion for fine-tuning Molmo-7B-D and Molmo-7B-O by allenai.
☆28Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for Molmo-Finetune
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆28Updated 7 months ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆62Updated 2 weeks ago
- ☆73Updated 8 months ago
- This repo contains the code and data for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks"☆62Updated this week
- Exploration of the multi modal fuyu-8b model of Adept. 🤓 🔍☆28Updated last year
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆51Updated last week
- Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆113Updated last month
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆38Updated 4 months ago
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆32Updated last month
- Code for our Paper "All in an Aggregated Image for In-Image Learning"☆29Updated 7 months ago
- arXiv 23 "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs"☆13Updated 9 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆137Updated this week
- Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"☆86Updated 7 months ago
- ☆57Updated 9 months ago
- An open-source implementaion for fine-tuning Qwen2-VL series by Alibaba Cloud.☆106Updated last week
- a family of highly capabale yet efficient large multimodal models☆161Updated 2 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆178Updated last month
- imagetokenizer is a python package, helps you encoder visuals and generate visuals token ids from codebook, supports both image and video…☆27Updated 4 months ago
- An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.☆69Updated last week
- A Dead Simple and Modularized Multi-Modal Training and Finetune Framework. Compatible to any LLaVA/Flamingo/QwenVL/MiniGemini etc series …☆17Updated 6 months ago
- Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context☆16Updated 2 months ago
- The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.☆68Updated 2 months ago
- ☆29Updated 2 months ago
- Reproduction of LLaVA-v1.5 based on Llama-3-8b LLM backbone.☆58Updated 2 weeks ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆26Updated last month
- A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents☆30Updated 3 weeks ago
- Empirical Study Towards Building An Effective Multi-Modal Large Language Model☆23Updated last year
- Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image …☆54Updated 3 weeks ago
- Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4☆25Updated last year
- Making LLaVA Tiny via MoE-Knowledge Distillation☆55Updated 2 weeks ago