ytaek-oh / fsc-clipLinks

[EMNLP 2024] Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

☆19

Alternatives and similar repositories for fsc-clip

Users that are interested in fsc-clip are comparing it to the libraries listed below

Sorting:

sterzhang / PVIT
Official Repository of Personalized Visual Instruct Tuning
☆32Updated 8 months ago
haoyu-bu / CAFe
Code for "CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning"
☆25Updated 7 months ago
LunarShen / DsicoVLA
[CVPR 2025] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
☆20Updated 4 months ago
Aurora-slz / MM-Verify
☆15Updated 3 weeks ago
Zi-hao-Wei / Efficient-Vision-Language-Pre-training-by-Cluster-Masking
[CVPR 2024] Improving language-visual pretraining efficiency by perform cluster-based masking on images.
☆29Updated last year
kwonjunn01 / Hi-Mapper
☆15Updated 11 months ago
fansunqi / AKeyS
Agentic Keyframe Search for Video Question Answering
☆13Updated 7 months ago
tripletclip / TripletCLIP
[NeurIPS 2024] Official PyTorch implementation of "Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives"
☆45Updated 11 months ago
jiaangli / VLCA
Do Vision and Language Models Share Concepts? A Vector Space Alignment Study
☆16Updated 11 months ago
see-say-segment / sesame
🔥 [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"
☆45Updated last year
adobe-research / llava-score
☆11Updated last year
LaVi-Lab / Visual-Table
[EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"
☆20Updated last year
showlab / MovieSeq
[ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences
☆40Updated 8 months ago
TAU-VAILab / hierarcaps
Code and data for the paper "Emergent Visual-Semantic Hierarchies in Image-Text Representations" (ECCV 2024)
☆32Updated last year
CeeZh / SILVR
Official Implementation for "SiLVR : A Simple Language-based Video Reasoning Framework"
☆19Updated 2 months ago
mlvlab / VidChain
Official Implementation (Pytorch) of the "VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Capti…
☆22Updated 9 months ago
Dongping-Chen / ISG
(ICLR 2025 Spotlight) Official code repository for Interleaved Scene Graph.
☆31Updated 3 months ago
JaaackHongggg / WorldSense
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
☆33Updated last month
ytaek-oh / vl_compo
☆10Updated last year
shubhamprshr27 / NeglectedTailsVLM
This repository houses the code for the paper - "The Neglected of VLMs"
☆29Updated 6 months ago
ChocoWu / SeTok
Codes for ICLR 2025 Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
☆75Updated 7 months ago
rui-qian / READ
Rui Qian, Xin Yin, Dejing Dou†: Reasoning to Attend: Try to Understand How <SEG> Token Works (CVPR 2025)
☆48Updated last month
wuw2019 / LoTLIP
[NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
☆46Updated 10 months ago
gyhdog99 / RACRO2
Official PyTorch implementation of RACRO (https://www.arxiv.org/abs/2506.04559)
☆19Updated 4 months ago
locuslab / llava-token-compression
☆44Updated last year
deep-spin / Infinite-Video
\infty-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation
☆19Updated 9 months ago
daeunni / Video-Skill-CoT
Code for "Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning [EMNLP 2025 Finding]"
☆16Updated 2 months ago
Shengcao-Cao / groundLMM
Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision
☆41Updated last month
lxa9867 / QSD
[CVPR 2024] "Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition"
☆12Updated last year
aimagelab / HySAC
Hyperbolic Safety-Aware Vision-Language Models. CVPR 2025
☆25Updated 7 months ago