deepglint / CrocLinks

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

☆24

Alternatives and similar repositories for Croc

Users that are interested in Croc are comparing it to the libraries listed below

Sorting:

tripletclip / TripletCLIP
[NeurIPS 2024] Official PyTorch implementation of "Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives"
☆41Updated 8 months ago
Zi-hao-Wei / Efficient-Vision-Language-Pre-training-by-Cluster-Masking
[CVPR 2024] Improving language-visual pretraining efficiency by perform cluster-based masking on images.
☆28Updated last year
JiuTian-VL / MoME
[NeurIPS 2024] MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
☆69Updated 3 months ago
chuangchuangtan / LLaVA-NeXT-Image-Llama3-Lora
LLaVA-NeXT-Image-Llama3-Lora, Modified from https://github.com/arielnlee/LLaVA-1.6-ft
☆44Updated last year
wuw2019 / LoTLIP
[NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
☆43Updated 6 months ago
haoyu-bu / CAFe
Code for "CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning"
☆22Updated 4 months ago
MIV-XJTU / FLAME
[CVPR 2025] PyTorch implementation of paper "FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training"
☆30Updated last month
thunlp / DeepPerception
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
☆65Updated 2 months ago
adobe-research / llava-score
☆11Updated 10 months ago
UCSC-VLAA / CLIPS
An Enhanced CLIP Framework for Learning with Synthetic Captions
☆37Updated 3 months ago
ExplainableML / flair
[CVPR 2025] FLAIR: VLM with Fine-grained Language-informed Image Representations
☆94Updated last month
foundation-multimodal-models / CAL
[NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
☆57Updated 10 months ago
yuecao0119 / MMFuser
The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …
☆57Updated 9 months ago
LaVi-Lab / AIM
[ICCV 2025] Official code for "AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning"
☆35Updated last month
meetdavidwan / crg
PyTorch code for "Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training"
☆35Updated last year
ant-research / DreamLIP
[ECCV 2024] Official PyTorch implementation of DreamLIP: Language-Image Pre-training with Long Captions
☆135Updated 3 months ago
yaolinli / DeCo
Code for DeCo: Decoupling token compression from semanchc abstraction in multimodal large language models
☆63Updated 3 weeks ago
ys-zong / VL-ICL
[ICLR 2025] VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning
☆63Updated 6 months ago
Jiaxuan-Li / EVCap
[CVPR 2024] Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension
☆54Updated last year
amazon-science / QA-ViT
☆69Updated last year
hshjerry / VideoEspresso
[CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
☆108Updated 2 weeks ago
whwu95 / FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant
☆63Updated last year
minglllli / CLS-RL
Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning
☆58Updated 2 months ago
Liuziyu77 / MIA-DPO
Official implement of MIA-DPO
☆63Updated 6 months ago
wusize / F-LMM
[CVPR2025] Code Release of F-LMM: Grounding Frozen Large Multimodal Models
☆100Updated 2 months ago
sterzhang / PVIT
Official Repository of Personalized Visual Instruct Tuning
☆32Updated 5 months ago
Qinying-Liu / TagAlign
Official implementation of TagAlign
☆35Updated 8 months ago
yuhui-zh15 / VLMClassifier
Official implementation of "Why are Visually-Grounded Language Models Bad at Image Classification?" (NeurIPS 2024)
☆88Updated 9 months ago
ZhengYu518 / VL-Mamba
Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"
☆82Updated last year
lezhang7 / SAIL
[CVPR 2025 Highlight] Official Pytorch codebase for paper: "Assessing and Learning Alignment of Unimodal Vision and Language Models"
☆48Updated last month