SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval
B Yu, Z Yang, X Yi, Y Wang, Z Bao - International Conference on Advanced …, 2023 - Springer
B Yu, Z Yang, X Yi, Y Wang, Z Bao
International Conference on Advanced Data Mining and Applications, 2023•SpringerImage-text retrieval has been a crucial and fundamental task in multi-modal field. Benefiting
from the superiority of Transformer encoder in modeling multimodal information, the
Transformer-based alignment model has become the mainstream of image-text retrieval.
However, current Transformer-based alignment models suffer from two major limitations:(1)
The redundancy of modal features and the complexity of correlations between modalities
restrict the performance of the model.(2) Current researches are typically limited to a single …
from the superiority of Transformer encoder in modeling multimodal information, the
Transformer-based alignment model has become the mainstream of image-text retrieval.
However, current Transformer-based alignment models suffer from two major limitations:(1)
The redundancy of modal features and the complexity of correlations between modalities
restrict the performance of the model.(2) Current researches are typically limited to a single …
Abstract
Image-text retrieval has been a crucial and fundamental task in multi-modal field. Benefiting from the superiority of Transformer encoder in modeling multimodal information, the Transformer-based alignment model has become the mainstream of image-text retrieval. However, current Transformer-based alignment models suffer from two major limitations: (1) The redundancy of modal features and the complexity of correlations between modalities restrict the performance of the model. (2) Current researches are typically limited to a single viewpoint during the modal alignment. To address these issues, in this paper we propose a image-text retrieval model SSM based on Semantic Selection and Multi-view alignment. Specifically, we introduce a gated attention unit to filter unnecessary information, and design an adaptive weighted similarity calculation method to dynamically adjust the importance of different features during the alignment process. On the other hand, we design a multi-view cross-modal alignment method that considers different granularity and different level of information to provide complementary benefits in representation learning. We compare SSM with other advanced image-text retrieval models in MS-COCO and Flickr30K datasets, and the results show that the SSM model has competitive performance without much interaction.
Springer
Showing the best result for this search. See all results