This section describes the datasets and performance metrics considered to evaluate the proposed method. In addition, the results of experiments conducted on two segmentation tasks, lung segmentation in chest X-rays (CXR) and spinal cord segmentation in MRI, are presented, which demonstrate that the segmentation network trained using the proposed method exhibits a high performance and domain robustness. In addition, we present the results of an ablation study performed to validate the design choices in our proposed method.
4.1. Dataset
For lung segmentation, we used two public CXR datasets: JSRT [
14] and MC [
15]. JSRT is a dataset jointly created by the Japanese Society of Radiological Technology and the Japanese Radiological Society, and contains 247 the posterior-anterior (PA) CXR images. Among these images, 154 images have a pulmonary nodule, and the other 93 images are normal. All the images are sized
pixels and associated with the labeled annotations of other anatomical structures, including the lungs. The MC dataset is jointly populated by the National Library of Medicine and the Department of Health and Human Services in the U.S. This dataset consists of 138 PA CXR images; among these images, 80 images are normal and 58 correspond to tuberculosis patients. The images are sized
or
pixels.
For the spinal cord segmentation task with the MRI images, we used the dataset employed in the spinal cord gray matter challenge [
17]. The dataset involves images collected from the following four sites: University College London (
site1), Polytechnique Montreal (
site2), University of Zurich (
site3), and Vanderbilt University (
site4). Specifically, the dataset consists of 80 MRI images corresponding to 20 cases from each site. The data from these sites exhibit individual visual characteristics mainly due to the imaging equipment from different vendors being used. Therefore, in the evaluation of the domain robustness of the segmentation networks, the images from each site were considered to correspond to a single domain.
Figure 3 shows sample images of each dataset. From this figure, it can be observed that there exists a certain degree of distributional shift (i.e., domain difference) among the datasets. For example, the MC and JSRT datasets have considerably different visual features as can be observed through the annotations on the sample images in
Figure 3. The JSRT dataset contains images from patients having lung nodules, which can be characterized as a small spot. In contrast, the MC dataset includes images from tuberculosis patients whose lesions are widely spread over the lung area. Such a domain shift is difficult to be resolved via image preprocessing (e.g., histogram equalization) or data augmentation (e.g., brightness and contrast adjustment) as observed in the following experiments.
4.3. Lung Segmentation Result
U-net [
4] was utilized as a base segmentation network to perform comparative experiments involving existing frameworks such as the ACNN [
12] and SRM [
13] which consider the anatomical structures during training.
Table 1 and
Table 2 summarize the detailed architectures of the segmentation network and autoencoder for this experiment, respectively. As an activation function, the rectified linear unit (ReLU) was employed. Note that the feature maps
h and
z should be the same size to compute the embedding loss
. The comparison targets, ACNN and SRM, were trained with the same architectures to enable a fair comparison.
In general, to apply the pre-trained anatomical information to the segmentation network, the ACNN performs lower-dimensional projections of both the segmentation predictions and ground-truths based on the pre-trained autoencoder, and computes the shape regularization loss between these projections. In this experiment, we adopted the binary cross-entropy and mean squared error as the segmentation loss and shape regularization loss, respectively, as in the original study. The weight of the shape regularization loss was set as 0.01 according to the validation process. The SRM [
13] is a variant of the ACNN, which introduces an auxiliary loss function, specifically, the reconstruction loss, to ensure that the outputs from the projections obtained using the pre-trained autoencoder are similar. Therefore, the objective function in the SRM consists of three loss functions, specifically, the segmentation, shape regularization, and reconstruction losses. As in the original study, we used the dice loss as the segmentation and reconstruction loss, and the binary cross-entropy as the shape regularization loss. Through the validation process, the weights for the shape regularization and reconstruction loss were set as 0.01 and 0.001, respectively. For the proposed method, we set
in Equation (
3) as 1.0.
All the methods were trained using the Adam optimizer [
29] with a learning rate of 0.0001 for 120 epochs. Histogram equalization was performed as a preprocessing step. For data augmentation, we performed brightness and contrast adjustment by setting the range from 0.8 to 1.2. The dataset was randomly split into a training, validation, and test dataset at a ratio of 65%, 15%, and 20%, respectively. To enable a rigorous evaluation, all the experiments were repeated five times, and the mean and standard deviation of the performance values were reported.
Table 3 presents the average and standard deviation of the performance values over five runs, corresponding to the proposed method and comparison targets. ↓ and ↑ indicate that lower and higher values are better, respectively. For each experiment, the best result is expressed in boldface. As baselines, the performances corresponding to the training of only a segmentation network with (U-Net) or without data augmentation (U-Net w/o aug) are reported in the table. The results on both the datasets indicate that data augmentation enhances the segmentation performance.
In addition, we observed that the methods in which the anatomical information was reflected during training, ACNN and SRM, outperformed the baselines, U-Net and U-Net w/o aug. Specifically, the ACNN and SRM exhibited an enhanced performance in terms of the distance metrics and all metrics in the case of the JSRT and MC datasets, respectively. Nevertheless, the proposed method outperformed all the methods in terms of all overlap and distance metrics on both the datasets. The comparison results indicated that in terms of the ASD, the proposed method exhibited an enhancement of 5.6% and 5.4% over the JSRT and MC datasets, respectively, against the second-best performing model SRM. This result demonstrated that the proposed method can help the segmentation network learn the global anatomical structure to be segmented by producing segmentation outputs from the anatomical feature space modeled by the autoencoder.
The visualization results are presented in
Figure 4. The first and second rows show the segmentation results on the JSRT and MC dataset, respectively. The red solid line represents the ground-truth label, and the green area corresponds to the predicted result from the segmentation network. The left part shows the segmentation results of each method. The base U-Net tends to inaccurately predict the lung regions, and the reflection of the anatomical information in the network helps achieve better segmentation of the lung regions. Notably, the approach to use the anatomical information through the proposed strategy helped achieve the most accurate segmentation result.
To gain further insight into the proposed method, the reconstruction results from the trained DAE (i.e.,
) and segmentation results from the combination of segmentation encoder and DAE decoder (i.e.,
) are also depicted (see the right part of
Figure 4). Here, we examined the reconstruction capability of DAE although it is not used during inference. The DAE reconstructs the input labels as well as expected since the reconstruction of binary lung masks is an easy task. The results from
are noteworthy: the features from
can be successfully decoded by
. It implies that
can embed an input image into the anatomical feature space modeled by the DAE, and thereby,
can produce good segmentation results based on those features.
As described in
Section 3.2, the proposed method is designed to control the gradient flows through the embedding loss function
in Equation (
3). The gradients from
do not contribute to the DAE encoder, and thus, the DAE encoder is not affected when learning to build the anatomical structure features.
Table 4 illustrates the effect of the proposed strategy on the JSRT dataset. Proposed-BI represents a method in which the gradient update from
occurs in the segmentation and DAE encoder simultaneously, and Proposed-UN indicates the proposed strategy that only updates the segmentation encoder. From this experiment, we observed that Proposed-BI outperforms the baseline U-Net, which shows that constraining the feature space of the segmentation network by the autoencoder is effective to improve the segmentation performance. Moreover, the results of Proposed-UN demonstrate that preventing the gradient from
from being propagated to the DAE further enhances the segmentation performance by encouraging the DAE to effectively learn the structural information.
4.4. Spinal Cord Segmentation Result
The spinal cord gray matter challenge dataset [
17] was considered to evaluate the spinal cord segmentation performance of the proposed method. This dataset, which is composed of 3D MRI images, was cut cross-sectionally to allow the use of two-dimensional images in the experiment. Images without the ground-truth label were not used. Eventually, we considered 30, 113, 177, and 134 images from
site1,
site2,
site3, and
site4, respectively. Images from
site1 were not used for training due to the small number of images. The dataset corresponding to each site was split as follows: 65% for training, 15% for validation, and 20% for testing. All the images were center cropped at
pixels. The architectures of the segmentation network and autoencoder are similar to those in the lung segmentation experiment except for the number of layers and kernels: we used four times more kernels and removed one block in each encoder and decoder (i.e.,
and
in
Table 1 and
Table 2) to build a better baseline.
The hyperparameters involved in each method were determined through a validation process: For the ACNN, the weight for the shape regularization loss was set as 0.001, and for the SRM, the weights for the shape regularization loss and reconstruction loss were set as 0.01 and 0.001, respectively. The weight
for the proposed method was 1.0. All the methods were trained using the AdamP optimizer [
30] for 120 epochs owing to the more stable training progress. The learning rate and weight decay parameter were set as 0.01 and 0.0001, respectively. To compare with stronger baselines, data augmentation strategies were adopted, including random adjustment of the brightness and contrast in the range of 0.6 to 1.4. The other experimental settings were the same as those in the lung segmentation experiment. To compare the performance, the mean and standard deviation of the performance metrics over five runs are reported in
Table 5.
Similar to the results of the lung segmentation task, the ACNN and SRM outperformed the baseline U-Net, and the proposed method outperformed the comparison methods in terms of all metrics across the datasets from site2, site3, and site4. Notably, the distance metrics, ACD and ASD, were greatly improved, as in the previous experiment. For example, the ASD value of the proposed method on site2 was 0.372, 7.2% lower than the baseline and 2.6% lower than the second-best model ACNN.
The rows in
Figure 5 show the predicted images from the segmentation methods for
site2,
site3, and
site4, in order. The red solid line is the ground-truth label, and the green area represents the predicted result from the segmentation network. From the left part showing the comparison results, we can observe that the predictions from U-Net contain false positives located far from the ground-truth in several cases, although the other methods provide better segmentation results. Among these methods, the proposed method can realize more precise segmentation, especially in the case shown in the second row. These results demonstrate that the proposed method can more effectively learn the anatomical structure information than the comparison methods, resulting in better segmentation results. Similar to the lung segmentation task, the reconstruction results from
and segmentation results from
are presented (see the right part of
Figure 5). From these visualization results, it is confirmed again that
extracts anatomically informative features that can be easily decoded by
trained to reconstruct the ground-truth labels.
4.5. Domain Robustness
Learning the anatomical structures in medical images can enhance several aspects of segmentation models, for instance, in the form of domain robustness. To demonstrate the domain robustness of the proposed method, we trained a segmentation network by using images from a single source (i.e., domain) and tested the trained model by using images from other sources. For example, the JSRT dataset was used for training, and the trained model’s performance was evaluated using the MC dataset in the case of the lung segmentation task. In general, if a network exhibits a high performance on datasets from unseen domains, the network is considered to be robust to domain shifts. We conducted similar experiments using the spinal cord dataset: images from each site, site2, site3, and site4, were used as the training images, and the segmentation performance was examined on images corresponding to other domains. Images from site1 were utilized only for testing because the dataset for site1 contains excessively few images to be used for training. The trained models in the previous experiments were used to examine their robustness to domain shifts.
The domain robustness (i.e., domain generalization) performance of the segmentation models in the lung segmentation task is summarized in
Table 6. Two experimental settings were considered, JSRT→MC and MC→JSRT. In particular, JSRT→MC corresponds to the segmentation performances on the MC dataset for a model trained on the JSRT dataset. The last row presents the average performance over the two settings. The models trained using the proposed strategy exhibit superior performances in terms of both the overlap and distance measures, which indicates that the proposed method not only enhances the segmentation performance on the source domains, but also renders a segmentation model more robust to domain shifts. As can be seen in
Figure 3, although the visual characteristics of the two datasets are considerably different, the experimental result highlights that the domain generalization performance of CNN-based segmentation models can be enhanced if we carefully design a training framework for the model to learn the anatomical structure information related to the given tasks.
The visualizations of several segmentation results are presented in
Figure 6. The red solid line and blue shaded area represent the ground-truth and segmentation outputs, respectively. The existing approaches including the baseline U-Net are sensitive to even a small degree of domain shift, and this phenomenon cannot be simply resolved by applying data augmentation techniques such as random adjustments of the brightness and contrast.
For the spinal cord segmentation task, three experimental settings were considered:
site2→others,
site3→others, and
site4→others. For models trained with each source domain, we evaluated the segmentation performance on the images from all other domains. The proposed method exhibited a higher generalization capability on unseen domains, as presented in
Table 7. Except for the setting in which the source domain was
site3, the proposed method achieved better segmentation results on other domains, especially in terms of the distance metrics. For example, when the source domain was
site4, the ACD value averaged over other domains including
site1,
site2, and
site3 was improved by 14.8% compared to the second-best model SRM. According to the averaged performance over three experimental settings, the baseline U-Net, ACNN, and the proposed method showed a similar segmentation performance in terms of the overlap measures, although the distance performances of the proposed method were significantly improved. Specifically, the average ACD value for the proposed method was 0.667, corresponding to a 9.7% improvement over the baseline U-Net.
Figure 7 shows the segmentation results obtained using each model from the test images in other domains not used for training.
Figure 7a–c show the results from the models trained with
site2,
site3, and
site4 as the source domains, respectively. Similar conclusions as in the previous experiment could be derived. The comparison methods produced several false positive predictions, whereas the proposed method could accurately predict the target area. The quantitative and qualitative results indicated that valuable structural information such as a global shape or location in common across multiple domains can be learned using the proposed method.