[NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs".
☆79Jun 17, 2024Updated last year
Alternatives and similar repositories for DeepStack-VL
Users that are interested in DeepStack-VL are comparing it to the libraries listed below
Sorting:
- ☆24Dec 26, 2024Updated last year
- Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".☆58Jun 27, 2023Updated 2 years ago
- ☆21Jan 17, 2025Updated last year
- ☆134Dec 22, 2023Updated 2 years ago
- [NeurIPS 2025] The official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tun…☆40Feb 20, 2025Updated last year
- [AAAI-25] Official repository of "Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object De…☆20Dec 27, 2024Updated last year
- VideoNSA: Native Sparse Attention Scales Video Understanding☆81Nov 16, 2025Updated 3 months ago
- [SCIS] MULTI-Benchmark: Multimodal Understanding Leaderboard with Text and Images☆44Nov 19, 2025Updated 3 months ago
- Implementation and dataset for paper "Can MLLMs Perform Text-to-Image In-Context Learning?"☆42Jun 2, 2025Updated 8 months ago
- Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks☆36Nov 27, 2025Updated 3 months ago
- [COLM'25] Official implementation of the Law of Vision Representation in MLLMs☆175Oct 6, 2025Updated 4 months ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model☆281Jun 25, 2024Updated last year
- ☆27Apr 11, 2025Updated 10 months ago
- Official implementation of "Describing Sets of Images with Textual-PCA".☆16Feb 13, 2023Updated 3 years ago
- This repository contains the code and data for the paper "VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception o…☆28Jul 9, 2025Updated 7 months ago
- 2025.01:从零到一实现了一个多模态大模型,并命名为Reyes(睿视),R:睿,eyes:眼。Reyes的参数量为8B,视觉编码器使用的是InternViT-300M-448px-V2_5,语言模型侧使用的是Qwen2.5-7B-Instruct,Reyes也通过一个两…☆30Feb 10, 2026Updated 2 weeks ago
- Multimodal RewardBench☆61Feb 21, 2025Updated last year
- [ICML'25] "Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding" by Jiajun Zhu, Peihao Wang, Ruisi…☆14Jun 6, 2025Updated 8 months ago
- Elastic Workplace Search Official Python Client☆10Aug 8, 2024Updated last year
- Deep Learning for Video Retrieval by Natural Language☆11Oct 20, 2019Updated 6 years ago
- ☆12Jan 25, 2024Updated 2 years ago
- The OBMO module embedded in PatchNet☆10Feb 21, 2024Updated 2 years ago
- ☆14Apr 25, 2025Updated 10 months ago
- Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"☆45Apr 3, 2025Updated 10 months ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆109May 27, 2025Updated 9 months ago
- ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO☆78Nov 17, 2025Updated 3 months ago
- [CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge☆153Sep 3, 2025Updated 5 months ago
- Open-source red teaming framework for MLLMs with 37+ attack methods☆226Jan 16, 2026Updated last month
- Pre-trained V+L Data Preparation☆46Jun 2, 2020Updated 5 years ago
- [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text☆412May 5, 2025Updated 9 months ago
- [ICLR 2025] Mathematical Visual Instruction Tuning for Multi-modal Large Language Models☆152Dec 5, 2024Updated last year
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆164Dec 26, 2024Updated last year
- ☆15Sep 22, 2025Updated 5 months ago
- Augmentation scripts for the bAbI Dialog Tasks dataset☆13Oct 16, 2018Updated 7 years ago
- ☆219Jul 5, 2024Updated last year
- A Survey on Leveraging Pre-trained Generative Adversarial Networks for Image Editing and Restoration☆17Jul 22, 2022Updated 3 years ago
- LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba (Official Implementation)☆17Oct 24, 2024Updated last year
- Repository for ACL2020 paper "Refer360° A Referring Expression Recognition Dataset in 360°Images"☆13Jun 26, 2021Updated 4 years ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆65Nov 1, 2024Updated last year