buptlihang / CVLMLinks

☆22

Alternatives and similar repositories for CVLM

Users that are interested in CVLM are comparing it to the libraries listed below

Sorting:

scenarios / WeMM
☆87Updated last year
opendatalab / MLLM-DataEngine
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
☆48Updated last year
mynameischaos / Lion
Lion: Kindling Vision Intelligence within Large Language Models
☆51Updated last year
X2FD / LVIS-INSTRUCT4V
☆133Updated last year
mightyzau / InfMLLM
☆19Updated 2 years ago
baaivision / CapsFusion
[CVPR 2024] CapsFusion: Rethinking Image-Text Data at Scale
☆212Updated last year
BAAI-DCAI / Visual-Instruction-Tuning
SVIT: Scaling up Visual Instruction Tuning
☆164Updated last year
OpenGVLab / LCL
[NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
☆70Updated 9 months ago
palchenli / VL-Instruction-Tuning
☆91Updated 2 years ago
huggingface / docmatix
A huge dataset for Document Visual Question Answering
☆20Updated last year
QQ-MM / PureMM
☆21Updated last year
OFA-Sys / TouchStone
Touchstone: Evaluating Vision-Language Models by Language Models
☆83Updated last year
baaivision / DenseFusion
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
☆158Updated last year
BAAI-DCAI / DataOptim
A collection of visual instruction tuning datasets.
☆76Updated last year
alibaba / conv-llava
☆124Updated last year
PCIResearch / TransCore-M
Large Multimodal Model
☆15Updated last year
MengLcool / DeepStack-VL
[NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…
☆72Updated last year
AILab-CVC / VL-GPT
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
☆86Updated last year
ggjy / DeLVM
☆120Updated last year
Share14 / ShareGemini
☆32Updated last year
Kwai-YuanQi / TaskGalaxy
Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types
☆32Updated 4 months ago
SliMM-X / CoMP-MM
Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"
☆39Updated 8 months ago
thunlp / Muffin
☆66Updated last year
TencentARC / GVT
Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".
☆58Updated 2 years ago
yeezhu / UNIT
PyTorch implementation of "UNIT: Unifying Image and Text Recognition in One Vision Encoder", NeurlPS 2024.
☆31Updated last year
weijiawu / TransDETR
[IJCV 2024] TransDETR: End-to-end Video Text Spotting with Transformer
☆105Updated last year
InternScience / SimChart9K
The proposed simulated dataset consisting of 9,536 charts and associated data annotations in CSV format.
☆26Updated last year
SY-Xuan / Pink
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
☆95Updated 10 months ago
gaopengcuhk / Pretrained-Pix2Seq
Replication of Pix2Seq with Pretrained Model
☆59Updated 4 years ago
zhjohnchan / SK-VG
[CVPR-2023] The official dataset of Advancing Visual Grounding with Scene Knowledge: Benchmark and Method.
☆32Updated 2 years ago