Progressive multi-subspace fusion for text-image matching

Abstract Text-image cross-model matching is a core challenge in multimodal machine learning, aiming to enable efficient retrieval of images and texts across different modalities. The difficulty in this task stems from the inherent gap between text and image representations, which can lead to subopti...

Full description

Saved in:
Bibliographic Details
Main Authors: Haoming Wang, Li Zhu, Wentao Ma, Qian’ge Guo
Format: Article
Language:English
Published: Springer 2025-06-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-025-01946-1
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Text-image cross-model matching is a core challenge in multimodal machine learning, aiming to enable efficient retrieval of images and texts across different modalities. The difficulty in this task stems from the inherent gap between text and image representations, which can lead to suboptimal retrieval performance. Traditional approaches attempt to learn a shared representation space where both image and text can be directly compared. However, they often fail to account for the varying levels of semantic information captured in different layers of the encoders, resulting in inadequate alignment between the modalities. To address these limitations, we propose a novel approach called Progressive Multi-Subspace Fusion, dubbed PMSF for text-image matching. Our model reduces the model gap by using a progressive learning process, starting with shallow representations and moving to deeper layers. We use a dual-tower structure to encode multi-level features for both image and text, which are then mapped to corresponding auxiliary subspaces. These subspaces are fused through an adaptive GPO pooling strategy, enabling joint learning of a shared representation space. Experimental results on benchmark datasets, including Flickr30K and MSCOCO, show that PMSF significantly improves retrieval performance, achieving a Rsum score of 516.9 and 510.7, outperforming 23 state-of-the-art methods.
ISSN:2199-4536
2198-6053