PKU-YuanGroup / LLaVA-o1
☆55Updated 3 months ago
Alternatives and similar repositories for LLaVA-o1:
Users that are interested in LLaVA-o1 are comparing it to the libraries listed below
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆57Updated 2 weeks ago
- Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding☆48Updated 3 months ago
- A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.☆89Updated 2 months ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆74Updated 4 months ago
- Rethinking Step-by-step Visual Reasoning in LLMs☆268Updated last month
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆198Updated 2 months ago
- ☆30Updated last month
- This is the repo for the paper "PANGEA: A FULLY OPEN MULTILINGUAL MULTIMODAL LLM FOR 39 LANGUAGES"☆104Updated 3 months ago
- The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆144Updated last month
- ☆60Updated last month
- Video-LlaVA fine-tune for CinePile evaluation☆49Updated 7 months ago
- ☆36Updated last year
- An open-source implementaion for fine-tuning Molmo-7B-D and Molmo-7B-O by allenai.☆52Updated last month
- Parameter-efficient finetuning script for Phi-3-vision, the strong multimodal language model by Microsoft.☆58Updated 8 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆154Updated 2 months ago
- This project is a collection of fine-tuning scripts to help researchers fine-tune Qwen 2 VL on HuggingFace datasets.☆64Updated 5 months ago
- Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…☆54Updated 4 months ago
- ☆13Updated 3 months ago
- Resources for our paper: "EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms"☆84Updated 4 months ago
- TinyClick: Single-Turn Agent for Empowering GUI Automation☆30Updated 4 months ago
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆40Updated 8 months ago
- A list of language models with permissive licenses such as MIT or Apache 2.0☆24Updated last week
- A new novel multi-modality (Vision) RAG architecture☆23Updated 5 months ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆117Updated 4 months ago
- FuseAI Project☆83Updated last month
- Here we will track the latest AI Multimodal Models, including Multimodal Foundation Models, LLM, Agent, Audio, Image, Video, Music and 3D…☆34Updated last month
- Chat with Phi 3.5/3 Vision LLMs. Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which includ…☆33Updated 2 months ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆95Updated 2 weeks ago