DAMO-NLP-SG/multimodal_textbook

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/DAMO-NLP-SG/multimodal_textbook)

DAMO-NLP-SG / multimodal_textbook

[ICCV 2025 Highlight] The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"

☆196

Alternatives and similar repositories for multimodal_textbook

Users that are interested in multimodal_textbook are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

DAMO-NLP-SG / CMM
View on GitHub
✨✨The Curse of Multi-Modalities (CMM): Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
☆54Jul 11, 2025Updated last year
LengSicong / MMR1
View on GitHub
[CVPR 2026] MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
☆217Sep 26, 2025Updated 10 months ago
alibaba-damo-academy / PixelRefer
View on GitHub
The code for PixelRefer & VideoRefer
☆352Nov 16, 2025Updated 8 months ago
yale-nlp / MMVU
View on GitHub
Data and Code for CVPR 2025 paper "MMVU: Measuring Expert-Level Multi-Discipline Video Understanding"
☆76Feb 28, 2025Updated last year
DAMO-NLP-SG / MT-LLaMA
View on GitHub
Multi-Task instruction-tuned LLaMA
☆14May 5, 2023Updated 3 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
RUCAIBox / Virgo
View on GitHub
Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*
☆110May 27, 2025Updated last year
DAMO-NLP-SG / Auto-Arena-LLMs
View on GitHub
☆44Oct 7, 2024Updated last year
RUCAIBox / Event-Bench
View on GitHub
Official code of *Towards Event-oriented Long Video Understanding*
☆12Jul 26, 2024Updated 2 years ago
VisualWebBench / VisualWebBench
View on GitHub
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
☆68Oct 19, 2024Updated last year
TIGER-AI-Lab / VisualWebInstruct
View on GitHub
The official repo for "VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search" [EMNLP25]
☆39Feb 1, 2026Updated 5 months ago
zwq2018 / Multi-modal-Self-instruct
View on GitHub
The codebase for our EMNLP24 paper: Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Mo…
☆85Jan 27, 2025Updated last year
zwq2018 / Agent-Pro
View on GitHub
The Code Repo for Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization
☆129Sep 2, 2024Updated last year
DAMO-NLP-SG / Inf-CLIP
View on GitHub
[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for C…
☆287Jan 16, 2025Updated last year
OpenGVLab / OmniCorpus
View on GitHub
[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
☆425May 5, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
DAMO-NLP-SG / VideoLLaMA3
View on GitHub
Frontier Multimodal Foundation Models for Image and Video Understanding
☆1,173Aug 14, 2025Updated 11 months ago
Rh-Dang / ECBench
View on GitHub
A Holistic Embodied Cognition Benchmark
☆18Apr 3, 2025Updated last year
CircleRadon / TokenPacker
View on GitHub
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025
☆280May 26, 2025Updated last year
baaivision / DenseFusion
View on GitHub
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
☆159Dec 6, 2024Updated last year
Share14 / ShareGemini
View on GitHub
☆32Jul 29, 2024Updated 2 years ago
OpenGVLab / VideoChat-Flash
View on GitHub
[ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
☆526Jul 19, 2026Updated last week
DAMO-NLP-SG / LongPO
View on GitHub
[ICLR 2025] LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
☆43Feb 27, 2025Updated last year
alibaba-damo-academy / EOCBench
View on GitHub
[NeurIPS 2025] EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocen…
☆22Jun 17, 2025Updated last year
yuweihao / MM-Vet
View on GitHub
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities (ICML 2024)
☆331Jan 20, 2025Updated last year
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
alibaba-damo-academy / RynnScale
View on GitHub
RynnScale: Scalable VLM and VLA Development Kits
☆25Jul 17, 2026Updated last week
baaivision / EVE
View on GitHub
EVE Series: Encoder-Free Vision-Language Models from BAAI
☆376Jul 24, 2025Updated last year
ByungKwanLee / Phantom
View on GitHub
[Technical Report] Official PyTorch implementation code for realizing the technical part of Phantom of Latent representing equipped with …
☆63Oct 9, 2024Updated last year
XiPotatonium / pnr
View on GitHub
Accepted at IJCAI-2022
☆11Sep 3, 2022Updated 3 years ago
Ucas-HaoranWei / Slow-Perception
View on GitHub
Official code implementation of Slow Perception:Let's Perceive Geometric Figures Step-by-step
☆163Jul 28, 2025Updated last year
Qinying-Liu / TagAlign
View on GitHub
Official implementation of TagAlign
☆37Dec 11, 2024Updated last year
ZJU-REAL / Self-Braking-Tuning
View on GitHub
[NeurIPS 2025] Let LRMs Break Free from Overthinking via Self-Braking Tuning. https://arxiv.org/abs/2505.14604
☆54Nov 4, 2025Updated 8 months ago
ModalMinds / MM-EUREKA
View on GitHub
MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
☆771Sep 7, 2025Updated 10 months ago
FreedomIntelligence / ALLaVA
View on GitHub
Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model
☆281Jun 25, 2024Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
AV-Odyssey / AV-Odyssey
View on GitHub
This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"
☆31Dec 23, 2024Updated last year
MAmmoTH-VL / MAmmoTH-VL
View on GitHub
(ACL 2025) MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
☆50Jun 4, 2025Updated last year
dongyh20 / Insight-V
View on GitHub
[CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
☆240Nov 7, 2025Updated 8 months ago
SALT-NLP / LLaVAR
View on GitHub
Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"
☆268Jun 12, 2024Updated 2 years ago
ExplainableML / fomo_in_flux
View on GitHub
Code and benchmark for the paper: "A Practitioner's Guide to Continual Multimodal Pretraining" [NeurIPS'24]
☆62Dec 10, 2024Updated last year
DAMO-NLP-SG / VideoLLaMA2
View on GitHub
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
☆1,304Jan 23, 2025Updated last year
XiPotatonium / chatbot-webui
View on GitHub
一个支持跨模态大语言模型的webui. A chatbot webui that supports various multi-modal large language models
☆11May 8, 2023Updated 3 years ago