Experiments on 3 widely used cross-modality datasets will be conducted in this section. We compare their performance with 11 state-of-the-art methods, highlighting the advancement of HAAN. Furthermore, parameter sensitivity, convergence analysis and ablation studies are presented to demonstrate the effectiveness of HAAN and the contribution of each component in it.
4.3. Evaluation Metric and Compared Methods
We perform image-text retrieval tasks on the above three datasets, and the tasks are divided into the following two types:
- (1)
Search text by image ()
- (2)
Search image by text ()
The mean Average Precision (mAP) is useful when testing the general performance of certain algorithms. The first step taken to work out mAP is to get the average precision (AP) of a set of
R retrieved documents by Equation (
25)
here
T represents how many relevant documents appear in the retrieved set, while
means the precision of the top
r retrieved documents. If the
rth retrieved document turns out to be relevant (where relevant means belonging to the class of the query) then
, or
. Then, we average the AP values over all queries in the query set to calculate the mAP. Alternatively, methods with larger mAP turn out to be more effective. Apart from this, the precision-recall curve is another metric to measure the effectiveness of different methods. The PR curves show the varying trend of retrieval accuracy under all recall values. Similar to features of mAP, the curve that can enclose the larger area means a better result the model can achieve.
To confirm the effectiveness of HAAN, this research will make a comparison between HAAN and other 11 state-of-the-art methods, including 3 traditional methods, namely JRL [
23], KCCA [
24], JFSSL [
25], and 8 deep learning methods, namely DCCA [
26], SCAN [
7], MAVA [
27], SGRAF [
28], SCL [
29], CGMN [
30], NAAF [
10] and VSRN++ [
31].
4.4. Comparison Results
Our HAAN method and 11 contrasting methods on all datasets are compared in terms of (1)
mAP scores, (2)
mAP scores and (3) mPA(AVG) scores (i.e., the average scores between (1) and (2)), as shown in
Table 3. We use “∘” to mark the traditional method and exploit “•” to represent the deep learning method. In addition, the best results are shown in bold. From
Table 3, we can easily find that HAAN achieves the best retrieval performance. Furthermore, HAAN improves the mAP(AVG) scores by 1.83 %, 1.20 % and 1.89 % respectively over the previous best model VSRN++ on Corel 5K, Pascal Sentence and Wiki. The performance of VSRN++ on
is better than that of HAAN, but only 0.57% higher, while HAAN can achieve similar high performance on both
and
, which indicates that HAAN is easier to solve practical problems.
It is worth noting that the text in a Pascal Sentence appears as a set of sentences, but in Corel 5K and Wiki, it is represented as a set of tags. Looking at mAP scores, HAAN performs better in image-text retrieval regardless of whether sentences or labels are used. We also find that the deep learning-based image-text retrieval methods perform better than traditional image-text retrieval methods. Next, the tasks of
and
are conducted on all datasets, and the PR curves are shown in
Figure 5. From
Figure 5, we can see that HAAN has the best overall performance because the area of the PR curve of HAAN tends to be larger than the area covered by the PR curves of other methods. Noticeably, VSRN++ is superior to HAAN only in the task of
in Pascal Sentence as shown in
Figure 5c. However, HAAN is superior to VSRN++ in all other respects.
To better evaluate our method, we focus on the training time of deep learning methods to conduct a comparative experiment. Specifically, source codes of all the methods are implemented on the same machine with a single GPU. From
Table 4, our findings go as follows. In the first place, DCCA and SCAN require the shortest training time, but perform less competitively than other deep learning methods in terms of image-text retrieval. Second, although MAVA, SGRAF, SCL, CGMN, NAAF have nearly the same training time as HAAN, HAAN outperforms them on image-text retrieval tasks. Finally, VSRN++ is secondary only to HAAN in image-text retrieval though it costs the longest training time.
Through a comprehensive analysis of these experimental results, conclusions can be summarized as follows:
- (1)
JRL, KCCA and JFSSL, the traditional image-text retrieval methods, are not as good as the image-text retrieval methods based on deep learning. Because deep neural networks can discover nonlinear image-text correlations.
- (2)
The attention mechanism-based model is significantly better than DCCA because it can effectively estimate image-text similarity by enabling latent matching between image patches and words. Specifically, SCAN computes the image-text similarity using visual regions and words as corresponding contexts. However, SCAN only exploits local-level relations. Different from SCAN, MAVA measures image-text similarity from the global, local, and relation levels, making it achieve better performance. Besides, SGRAF outperforms MAVA in suppressing uncorrelated interactions at the global and local levels. Furthermore, VSRN++ is secondary only to HAAN, but the training time of our HAAN is reduced by 16.04%, 16.01% and 14.29% compared with that of VSRN++ on the three data sets, respectively, which is very significant.
- (3)
SCL, CGMN and NAAF with outstanding performance can be observed, but they are not as good as HAAN. The reason is that these three methods do not consider the global-level information and the local-level information. Therefore, HAAN, which considers both global-level information and local-level information and further optimizes the two kinds of information, easily beats these three methods for roughly the same amount of training time.
- (4)
The comprehensive performance of HAAN is the best on all datasets. The reason is that HAAN can mine and fuse complementarities in multi-level data to cross the heterogeneous gap. Specifically, HAAN can accurately describe complex nonlinear image-text relationships, which is a distinct advantage over traditional methods. Since HAAN utilizes global-level and local-level information, it also significantly outperforms SCAN. Although both MAVA and SGRAF entirely use global-level and local-level data, HAAN keeps its advantages owing to the proposed AWL loss, which can accurately optimize image-text similarity by integrating pair mining and pair weighting in a unified framework.
In conclusion, HAAN fuses global-level and local-level information, and uses the proposed AWL to mine and enhance the two kinds of information, so the retrieval accuracy reaches the optimum. In addition, the first stage of AWL (i.e., image-text pairs sampling) selects valuable information while filtering redundant information, which accelerates the convergence and reduces the training time. HAAN achieves the effect of fast speed and high precision.
4.5. Parameter Sensitivity and Convergence Analyses
In this section, we conduct sensitivity analysis for the parameters, and convergence analysis for the hierarchical alignment network. The parameters involved in the proposed method is
and
mentioned in
Section 3.1,
mentioned in
Section 3.4.1. Besides, parametric sensitivity analyses are evaluated using mAP (AVG).
First, we set
to {0.2, 0.4, 0.6, 0.8, 1} and the experimental results are shown in
Figure 6. It can be concluded that when
is 0.6, on the three selected datasets, the average mAP scores of
and
are the highest. To be specific, the highest scores of mAP on Corel 5K, Pascal Sentence and Wiki are 0.5751, 0.6410, and 0.5546, respectively.
Second, we set
and
to {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} and the experimental results are shown in
Figure 7. According to the experimental results, when the ratio of
and
is 1:1, the mAP scores on the three datasets all reach the highest or are close to the highest, which proves that the importance of our two networks is basically the same. When the value of
is fixed, the mAP value will first increase and then decrease as the value of
increases from small to large. When the values of
and
are close to each other, the larger the mAP value will be, which confirms our conclusion above. When the value of
differs greatly from that of
, the value of mAP decreases rapidly. This indicates that the complementarity of global-level information and local-level information is very necessary to enhance the performance of image-text retrieval.
Finally, the results of the convergence experiment for GAN are shown in
Figure 8. We can easily observe that the objective function value of
monotonically decreases at each iteration. The reason lies in that our proposed AWL loss is effective. The convergence of LAN is not reported, because it is just similar to GAN.
4.6. Ablation Study
In this section, a series of ablation studies are conducted under different configurations of critical components of HAAN, in order to study the contribution of each component in the model.
As shown in
Table 5, several models are provided for ablation studies to reveal the effectiveness of GCM, LCM, AWL (Stage 1) and AWL (Stage 2). Particularly, “∘” represents that the module (or loss function) is not contained in the model, while “•” denotes that the module (or loss function) is contained in the model. To further demonstrate the effectiveness of AWL, we combine it with Triplet loss (TRI) [
12] for comparison. In the ablation model, TRI is used to replace AWL. We provide 7 combinations of the above 5 components, (e.g., HAAN-GCM represents HAAN with only the GCM module). The experimental results of our proposed ablation studies are shown in
Table 6, from which the following conclusions can be drawn. Note that the best results in
Table 6 are shown in bold.
The map value of HAAN-LCM is 1.45%, 2.09%, and 1.7% higher than that of HAAN-GCM on Corel5K, Pascal sentence and Wiki, respectively. This is because the LAN captures more details through the attention mechanism to get more valuable information. The performance of HAAN-GCM-LCM is better than that of HAAN-GCM and HAAN-LCM. This shows that the local-level information is complementary and that better performance can be achieved by integrating the two networks (i.e., GCM and LCM).
The mAP scores of HAAN-GCM-LCM-AWL (Stage 1) and HAAN-GCM-LCM-AWL (Stage 2) are very close, indicating that the two stages of AWL play almost the same importance in image-text similarity optimization. Furthermore, HAAN-GCM-LCM-AWL (Stage 1) and HAAN-GCM-LCM-AWL (Stage 2) are significantly better than HAAN-GCM-LCM-TRI. It is worth noting that the map value performed by any stage of the AWL shows higher than that of Triplet loss on three datasets by at least 1.3%, 2.63%, and 2.4%, respectively. This is due to the two stages of AWL addressing two major flaws of triplet loss respectively.
The mAP score of HAAN is much higher than that of HAAN-GCM-LCM-AWL (Stage 1) and HAAN-GCM-LCM-AWL (Stage 2). This is because the integration of the two stages (i.e., AWL (Stage 1) and AWL (Stage 2)) compensates for the defect for a single stage of AWL. Specifically, (1) only HAAN-GCM-LCM-AWL (Stage 1) is used, and valuable samples are selected, but training samples with differences could not be fully used; (2) only HAAN-GCM-LCM-AWL (Stage 2) is used to optimize all samples with different strengths, but no redundant information is filtered out.
From the column “AVG of all datasets”, we find that there are four main factors affecting system performance in our proposed HAAN, including (1) GCM, (2) LCM, (3) AWL (Stage 1), and (4) AWL (Stage 2). Specifically, the ways in which each factor affects the performance of image-text retrieval are shown below.
- (1)
GCM: From a holistic perspective, it explores global-level alignment between the whole image and text to learn image-text similarity.
- (2)
LCM: From the perspective of detail, explore the local alignment of image patches and key words, and learn the similarity of images and texts.
- (3)
AWL (Stage 1): Select valuable sample pairs (i.e., (1) Select samples far away from the anchor point to generate positive pairs; (2) select samples close to the anchor point to generate negative pairs ) and filter redundant pairs.
- (4)
AWL (Stage 2): Make full use of discriminant training samples and assign suitable weight to each positive and negative pair to achieve adaptive optimization.
To further verify the effectiveness of each factor, we conduct a series of ablation studies in the experiment. Furthermore, we add a new column titled “AVG of all datasets” in
Table 6. First of all, it can be convinced that the performances of HAAN-GCM and HAAN-LCM are close to each other, which proves that fine-grained data and coarse-grained data are of equal importance in the task of image-text retrieval. The values of HAAN-GCM-LCM-AWL (Stage 1) and HAAN-GCM-LCMAWL (Stage 2) are approximately the same, meaning that the effect of Stage 1 and Stage 2 are almost equal. Furthermore, we can observe that when a certain stage of AWL is added, the performance of HAAN is improved by about 3% compared with that of the global-level or local-level alone. The huge improvement it involves shows that AWL is very effective in promoting the performance of HAAN. At the same time, it is clearly shown that the performance of HAAN is about 2% better than using AWL in a certain stage alone, making sure the efficiency of aggregation of the two modules.
From an overall point of view, all these four modules attach great importance. GCM and LCM lay the foundation for subsequent optimization and further improvement of the model. AWL, when dealing with fully aggregated information (i.e., global-level, and local-level information), can quickly improve the overall performance of the model. When two stages are employed together, the optimization effect of AWL improves by more than about 4% compared to TRI. This also confirms the remarkable optimization effect of AWL, performing a much better result. Further, for your convenience, we have listed the main contents below.
In conclusion, we can draw the following conclusions: (1) when performing the task of image-text retrieval, each component in HAAN plays a positive role; (2) HAAN effectively mines and fuses complementarity in multi-granularity data, which can provide essential clues for bridging the heterogeneous gap.
4.7. Qualitative Results
We provide typical examples of image-text retrieval on the Pascal Sentence dataset by two state-of-the-art image-text retrieval methods (i.e., VSRN++ and HAAN) as well as HAAN. It shows the top ten results for
and
correspondences for a specific query. In particular, in
Figure 9a, we select two queries of the
for retrieval of “cow” and “dog”. In
Figure 9b, we select two queries on
for retrieval of “aeroplane” and “train”.
For the task , HAAN shows the best performance because its query results have the fewest errors. It is worth noting that, as in the retrieval of “cow”, the error text still contains some correct words (e.g., “black”, “white” and “face”) that match the correct semantic information in the query image.
Furthermore, for the task , VSRN++ and NAAF all make more mistakes. At the same time, HAAN obtains the results with the fewest mistakes, which partially deviate from the semantic information but contain features similar to correct semantic information. For example, images of birds in flight appear in the retrieval of “aeroplane”. In contrast, VSRN++ and NAAF get more errors and deviate from the correct semantic information largely.
From this, we can conclude that HAAN significantly outperforms VSRN++ and NAAF when performing tasks and . It should be noted that NAAF is the worst performer among the three methods, not only for the much more wrong results it returns in both retrieval tasks but also for the semantic concept of wrong results that are totally different from the correct semantic information. For example, when searching for “aeroplane”, the search results show pictures of motorcycles and trucks; when searching for “train”, the search results show pictures of buildings and interiors of rooms. These search results that seriously deviate from the correct semantic information are entirely unacceptable. It shows that the performance of the NAAF is the worst one.
All in all, HAAN is superior to the two most advanced methods, achieving the best performance.