dongyh20 / Insight-V
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
β21Updated this week
Related projects β
Alternatives and complementary repositories for Insight-V
- π₯ Aurora Series: A more efficient multimodal large language model series for video.β47Updated last week
- Official repo for StableLLAVAβ91Updated 11 months ago
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"β38Updated 7 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Modelβ39Updated 3 months ago
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.orβ¦β107Updated 4 months ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridgesβ49Updated 2 months ago
- Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image β¦β55Updated last month
- A Framework for Decoupling and Assessing the Capabilities of VLMsβ38Updated 4 months ago
- Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".β42Updated 3 weeks ago
- Official implement of MIA-DPOβ41Updated 3 weeks ago
- β76Updated this week
- β90Updated 6 months ago
- β147Updated last month
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".β83Updated 4 months ago
- Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervisionβ47Updated 4 months ago
- The official code of the paper "PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction".β45Updated 3 weeks ago
- Multimodal Video Understanding Framework (MVU)β24Updated 6 months ago
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"β116Updated last week
- β35Updated 3 months ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMsβ62Updated last month
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generationβ84Updated 2 months ago
- [NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"β103Updated last year
- β¨β¨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Modelsβ140Updated 2 weeks ago
- This repo contains the code and data for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks"β43Updated 2 weeks ago
- Matryoshka Multimodal Modelsβ84Updated this week
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architectureβ179Updated last month
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effectβ¦β32Updated 5 months ago
- MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Modelsβ55Updated 2 months ago
- A big_vision inspired repo that implements a generic Auto-Encoder class capable in representation learning and generative modeling.β30Updated 4 months ago