Deep Convolutional Neural Networks For Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
Deep Convolutional Neural Networks For Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2016.2528162, IEEE
Transactions on Medical Imaging
1
Abstract—Remarkable progress has been made in image recog- data-driven learning, large-scale well-annotated datasets with
nition, primarily due to the availability of large-scale annotated representative data distribution characteristics are crucial to
datasets (i.e. ImageNet) and the revival of deep convolutional learning more accurate or generalizable models [5], [4]. Unlike
neural networks (CNN). CNNs enable learning data-driven,
highly representative, layered hierarchical image features from previous image datasets used in computer vision, ImageNet
sufficient training data. However, obtaining datasets as compre- [1] offers a very comprehensive database of more than 1.2
hensively annotated as ImageNet in the medical imaging domain million categorized natural images of 1000+ classes. The CNN
remains a challenge. There are currently three major techniques models trained upon this database serve as the backbone
that successfully employ CNNs to medical image classification: for significantly improving many object detection and image
training the CNN from scratch, using off-the-shelf pre-trained
CNN features, and conducting unsupervised CNN pre-training segmentation problems using other datasets [6], [7], e.g.,
with supervised fine-tuning. Another effective method is transfer PASCAL [8] and medical image categorization [9], [10], [11],
learning, i.e., fine-tuning CNN models (supervised) pre-trained [12]. However, there exists no large-scale annotated medical
from natural image dataset to medical image tasks (although image dataset comparable to ImageNet, as data acquisition is
domain transfer between two medical image datasets is also difficult, and quality annotation is costly.
possible).
In this paper, we exploit three important, but previously There are currently three major techniques that successfully
understudied factors of employing deep convolutional neural employ CNNs to medical image classification: 1) training the
networks to computer-aided detection problems. We first explore “CNN from scratch” [13], [14], [15], [16], [17]; 2) using
and evaluate different CNN architectures. The studied models “off-the-shelf CNN” features (without retraining the CNN) as
contain 5 thousand to 160 million parameters, and vary in complementary information channels to existing hand-crafted
numbers of layers. We then evaluate the influence of dataset scale
and spatial image context on performance. Finally, we examine image features, for Chest X-rays [10] and CT lung nodule
when and why transfer learning from pre-trained ImageNet (via identification [9], [12]; and 3) performing unsupervised pre-
fine-tuning) can be useful. We study two specific computer- training on natural or medical images and fine-tuning on med-
aided detection (CADe) problems, namely thoraco-abdominal ical target images using CNN or other types of deep learning
lymph node (LN) detection and interstitial lung disease (ILD) models [18], [19], [20], [21]. A decompositional 2.5D view
classification. We achieve the state-of-the-art performance on
the mediastinal LN detection, with 85% sensitivity at 3 false resampling and an aggregation of random view classification
positive per patient, and report the first five-fold cross-validation scores are used to eliminate the “curse-of-dimensionality”
classification results on predicting axial CT slices with ILD cate- issue in [22], in order to acquire a sufficient number of training
gories. Our extensive empirical evaluation, CNN model analysis image samples.
and valuable insights can be extended to the design of high Previous studies have analyzed three-dimensional patch
performance CAD systems for other medical imaging tasks.
creation for LN detection [23], [24], atlas creation from chest
CT [25] and the extraction of multi-level image features [26],
I. I NTRODUCTION [27]. At present, there are several extensions or variations of
Tremendous progress has been made in image recogni- the decompositional view representation introduced in [22],
tion, primarily due to the availability of large-scale anno- [28], such as: using a novel vessel-aligned multi-planar image
tated datasets (i.e. ImageNet [1], [2]) and the recent revival representation for pulmonary embolism detection [29], fusing
of deep convolutional neural networks (CNN) [3], [4]. For unregistered multiview for mammogram analysis [16] and
classifying pulmonary peri-fissural nodules via an ensemble
Hoo-Chang Shin, Holger R. Roth, Le Lu, Isabella Nogues, Jianhua Yao and of 2D views [12].
Ronald M. Summers are with the Imaging Biomarkers and Computer-Aided
Diagnosis Laboratory; Mingchen Gao, Ziyue Xu and Daniel Mollura are with Although natural images and medical images differ signif-
Center for Infectious Disease Imaging, Le Lu, Jianhua Yao and Ronald M. icantly, conventional image descriptors developed for object
Summers are also with Clinical Image Processing Service, Radiology and recognition in natural images, such as the scale-invariant
Imaging Sciences Department, National Institutes of Health Clinical Center,
Bethesda, MD 20892-1182, USA. Asterisk indicates corresponding author. feature transform (SIFT) [30] and the histogram of oriented
Holger Roth and Mingchen Gao contributed equally to this work. e-mail: gradients (HOG) [31], have been widely used for object de-
{hoochang.shin, le.lu, rms}@nih.gov. Copyright (c) 2010 IEEE. Personal use tection and segmentation in medical image analysis. Recently,
of this material is permitted. However, permission to use this material for
any other purposes must be obtained from the IEEE by sending a request to ImageNet pre-trained CNNs have been used for chest pathol-
[email protected]. ogy identification and detection in X-ray and CT modalities
[10], [9], [12]. They have yielded the best performance results els consistently outperform CNNs that merely use off-the-shelf
by integrating low-level image features (e.g., GIST [32], bag of CNN features, in both the LN and ILD classification problems.
visual words (BoVW) and bag-of-frequency [12]). However, We further analyze, via CNN activation visualizations, when
the fine-tuning of an ImageNet pre-trained CNN model on and why transfer learning from non-medical to medical images
medical image datasets has not yet been exploited. in CADe problems can be valuable.
In this paper, we exploit three important, but previously
under-studied factors of employing deep convolutional neural
II. DATASETS AND R ELATED W ORK
networks to computer-aided detection problems. Particularly,
we explore and evaluate different CNN architectures varying in We employ CNNs (with the characteristics defined above)
width (ranging from 5 thousand to 160 million parameters) and to thoraco-abdominal lymph node (LN) detection (evaluated
depth (various numbers of layers), describe the effects of vary- separately on the mediastinal and abdominal regions) and
ing dataset scale and spatial image context on performance, interstitial lung disease (ILD) detection. For LN detection, we
and discuss when and why transfer learning from pre-trained use randomly sampled 2.5D views in CT [22]. We use 2D CT
ImageNet CNN models can be valuable. We further verify slices [38], [39], [40] for ILD detection. We then evaluate and
our hypothesis by inheriting and adapting rich hierarchical compare CNN performance results.
image features [5], [33] from the large-scale ImageNet dataset Until the detection aggregation approach [22], [41], thora-
for computer aided diagnosis (CAD). We also explore CNN coabdominal lymph node (LN) detection via CADe mecha-
architectures of the most studied seven-layered “AlexNet- nisms has yielded poor performance results. In [22], each 3D
CNN” [4], a shallower “Cifar-CNN” [22], and a much deeper LN candidate produces up to 100 random 2.5D orthogonally
version of “GoogLeNet-CNN” [33] (with our modifications sampled images or views which are then used to train an
on CNN structures). This study is partially motivated by effective CNN model. The best performance on abdominal
recent studies [34], [35] in computer vision. The thorough LN detection is achieved at 83% recall on 3FP per patient
quantitative analysis and evaluation on deep CNN [34] or [22], using a “Cifar-10” CNN. Using the thoracoabdominal
sparsity image coding methods [35] elucidate the emerging LN detection datasets [22], we aim to surpass this CADe
techniques of the time and provide useful suggestions for their performance level, by testing different CNN architectures,
future stages of development, respectively. exploring various dataset re-sampling protocols, and applying
Two specific computer-aided detection (CADe) problems, transfer learning from ImageNet pre-trained CNN models.
namely thoraco-abdominal lymph node (LN) detection and Interstitial lung disease (ILD) comprises more than 150 lung
interstitial lung disease (ILD) classification are studied in this diseases affecting the interstitium, which can severely impair
work. On mediastinal LN detection, we surpass all currently the patient’s ability to breathe. Gao et al. [40] investigate
reported results. We obtain 86% sensitivity on 3 false positives the ILD classification problem in two scenarios: 1) slice-
(FP) per patient, versus the prior state-of-art sensitivities of level classification: assigning a holistic two-dimensional axial
78% [36] (stacked shallow learning) and 70% [22] (CNN), CT slice image with its occurring ILD disease label(s); and
as prior state-of-the-art. For the first time, ILD classification 2) patch-level classification: a/ sampling patches within the
results under the patient-level five-fold cross-validation proto- 2D ROIs (Regions of Interest provided by [37]), then b/
col (CV5) are investigated and reported. The ILD dataset [37] classifying patches into seven category labels ( six disease
contains 905 annotated image slices with 120 patients and labels and one ”‘’healthy”” label). Song et al. [38], [39] only
6 ILD labels. Such sparsely annotated datasets are generally address the second sub-task of patch-level classification under
difficult for CNN learning, due to the paucity of labeled the “leave-one-patient-out” (LOO) criterion. By training on the
instances. moderate-to-small scale ILD dataset [37], our main objective
Evaluation protocols and details are critical to deriving is to exploit and benchmark CNN based ILD classification
significant empirical findings [34]. Our experimental results performances under the CV5 metric (which is more realistic
suggest that different CNN architectures and dataset re- and unbiased than LOO [38], [39] and hard-split [40]), with
sampling protocols are critical for the LN detection tasks and without transfer learning.
where the amount of labeled training data is sufficient and Thoracoabdominal Lymph Node Datasets. We use the
spatial contexts are local. Since LN images are more flexible publicly available dataset from [22], [41]. There are 388
than ILD images with respect to resampling and reformatting, mediastinal LNs labeled by radiologists in 90 patient CT scans,
LN datasets may be more readily augmented by such image and 595 abdominal LNs in 86 patient CT scans. To facilitate
transformations. As a result, LN datasets contain more training comparison, we adopt the data preparation protocol of [22],
and testing data instances (due to data auugmentation) than where positive and negative LN candidates are sampled with
ILD datasets. They nonetheless remain less comprehensive the fields-of-view (FOVs) of 30mm to 45mm, surrounding the
than natural image datasets, such as ImageNet. Fine-tuning annotated and detected LN centers (obtained by a candidate
ImageNet-trained models for ILD classification is clearly generation process). More precisely, [22], [41], [36] follow
advantageous and yields early promising results, when the a coarse-to-fine CADe scheme, partially inspired by [42],
amount of labeled training data is highly insufficient and multi- which operates with ∼ 100% detection recalls at the cost of
class categorization is used, as opposed to the LN dataset’s approximately 40 false or negative LN candidates per patient
binary class categorization. Another significant finding is that scan. In this work, positive and negative LN candidate are
CNNs trained from scratch or fine-tuned from ImageNet mod- first sampled up to 200 times with translations and rotations.
III. M ETHODS
In this study, we explore, evaluate and analyze the influence
of various CNN Architectures, dataset characteristics (when
we need more training data or better models for object
Fig. 3. Some examples of 64 × 64 pixel CT image patches for (a) NM, (b) detection [51]) and CNN transfer learning from non-medical to
EM, (c) GG, (d) FB, (e) MN (f) CD.
medical image domains. These three key elements of building
effective deep CNN models for CADe problems are described
normal emphysema ground glass fibrosis micronodules consolidation below.
30.2 20.2 85.4 96.8 63.2 39.2
TABLE I
AVERAGE NUMBER OF IMAGES IN EACH FOLD FOR DISEASE CLASSES , A. Convolutional Neural Network Architectures
WHEN DIVIDING THE DATASET IN 5- FOLD PATIENT SETS .
We mainly explore three convolutional neural network ar-
chitectures (CifarNet [5], [22], AlexNet [4] and GoogLeNet
[33]) with different model training parameter values. The
current deep learning models [22], [52], [53] in medical image
[-1400; -950HU]. We then encode the transformed images into tasks are at least 2 ∼ 5 orders of magnitude smaller than even
RGB channels (to be aligned with the input channels of CNN AlexNet [4]. More complex CNN models [22], [52] have only
models [4], [33] pre-trained from natural image datasets [1]). about 150K or 15K parameters. Roth et al. [22] adopt the CNN
The low-attenuation CT window is useful for visualizing cer- architecture tailored to the Cifar-10 dataset [5] and operate on
tain texture patterns of lung diseases (especially emphysema). image windows of 32×32×3 pixels for lymph node detection,
The usage of different CT attenuation channels improves while the simplest CNN in [54] has only one convolutional,
classification results over the usage of a single CT windowing pooling, and FC layer, respectively.
channel, as demonstrated in [40]. More importantly, these CT We use CifarNet [5] as used in [22] as a baseline for
windowing processes do not depend on the lung segmentation, the LN detection. AlexNet [4] and GoogLeNet [33] are also
which instead is directly defined in the CT HU space. Figure 4 modified to evaluate these state-of-the-art CNN architecture
shows a representative example of lung, high-attenuation, and from ImageNet classification task [2] to our CADe prob-
low-attenuation CT windowing for an axis lung CT slice. lems and datasets. A simplified illustration of three CNN
As observed in [40], lung segmentation is crucial to holistic architectures exploited is shown in Figure 5. CifarNet always
slice-level ILD classification. We empirically compare per- takes 32 × 32 × 3 image patches as input while AlexNet
formance in two scenarios with a rough lung segmentation1 and GoogLeNet are originally designed for the fixed image
There is no significant difference between two setups. Due dimension of 256 × 256 × 3 pixels. We also reduced the
to the high precision of CNN based image processing, highly filter size, stride and pooling parameters of AlexNet and
accurate lung segmentation is not necessary . The localization GoogLeNet to accommodate a smaller input size of 64 ×
of ILD regions within the lung is simultaneously learned 64 × 3 pixels. We do so to produce and evaluate “simplified”
through selectively weighted CNN reception fields in the AlexNet and GoogLeNet versions that are better suited to the
deepest convolutional layers during the classification based smaller scale training datasets common in CADe problems.
CNN training [49], [50]. Some areas outside of the lung Throughout the paper, we refer to the models as CifarNet
appear in both healthy or diseased images. CNN training learns (32x32) or CifarNet (dropping 32x32); AlexNet (256x256) or
to ignore them by setting very small filter weights around AlexNet-H (high resolution); AlexNet (64x64) or AlexNet-L
the corresponding regions (Figure 13). This observation is (low resolution); GoogLeNet (256x256) or GoogLeNet-H and
validated by [40]. GoogLeNet (64x64) or GoogLeNet-L (dropping 3 since all
image inputs are three channels).
a) CifarNet: CifarNet, introduced in [5], was the state-
1 This can be achieved by segmenting the lung using simple label-fusion of-the-art model for object recognition on the Cifar10 dataset,
methods [48]. In the first case, we overlay the target image slice with the which consists of 32 × 32 images of 10 object classes. The
average lung mask among the training folds. In the second, we perform objects are normally centered in the images. Some example
simple morphology operations to obtain the lung boundary. In order to retain
information from the inside of the lung, we apply Gaussian smoothing to the images and class categories from the Cifar10 dataset are
regions outside of the lung boundary. shown in Figure 7. CifarNet has three convolution layers,
Fig. 5. A simplified illustration of the CNN architectures used. GoogLeNet [33] contains two convolution layers, three pooling layers, and nine inception
layers. Each of the inception layer of GoogLeNet consists of six convolution layers and one pooling layer.
detection.
B. ImageNet: Large Scale Annotated Natural Image Dataset tens of millions of free parameters to train, and thus require
sufficiently large numbers of labeled medical images.
ImageNet [1] has more than 1.2 million 256 × 256 images
For transfer learning, we follow the approach of [57],
categorized under 1000 object class categories. There are more
[6] where all CNN layers except the last are fine-tuned at
than 1000 training images per class. The database is organized
a learning rate 10 times smaller than the default learning
according to the WordNet [55] hierarchy, which currently
rate. The last fully-connected layer is random initialized and
contains only nouns in 1000 object categories. The image-
freshly trained, in order to accommodate the new object
object labels are obtained largely through crowd-sourcing,
categories in our CADe applications. Its learning rate is kept
e.g., Amazon Mechanical Turk, and human inspection. Some
at the original 0.01. We denote the models with random
examples of object categories in ImageNet are “sea snake”,
initialization or transfer learning as AlexNet-RI and AlexNet-
“sandwich”, “vase”, “leopard”, etc. ImageNet is currently the
TL, and GoogLeNet-RI and GoogLeNet-TL. We found that
largest image dataset among other standard datasets for visual
the transfer learning strategy yields the best performance
recognition. Indeed, the Caltech101, Caltech256 and Cifar10
results. Determining the optimal learning rate for different
dataset merely contain 60000 32 × 32 images and 10 object
layers is challenging, especially for very deep networks such
classes. Furthermore, due to the large number (1000+) of
as GoogLeNet.
object classes, the objects belonging to each ImageNet class
We also perform experiments using “off-the-shelf” CNN
category can be occluded, partial and small, relative to those in
features of AlexNet pre-trained on ImageNet and training only
the previous public image datasets. This significant intra-class
the final classifier layer to complete the new CADe classifica-
variation poses greater challenges to any data-driven learning
tion tasks. Parameters in the convolutional and fully connected
system that builds a classifier to fit given data and generalize
layers are fixed and are used as deep image extractors, as in
to unseen data. For comparison, some example images of
[10], [9], [12]. We refer to this model as AlexNet-ImNet in the
Cifar10 dataset and ImageNet images in the “tennis ball”
remainder of the paper. Note that [10], [9], [12] train support
class category are shown in Figure 7. The ImageNet dataset
vector machines and random forest classifiers using ImageNet
is publicly available, and the ImageNet Large Scale Visual
pre-trained CNN features. Our simplified implementation is
Recognition Challenge (ILSVRC) has become the standard
intended to determine whether fine-tuning the “end-to-end”
benchmark for large-scale object recognition.
CNN network is necessary to improve performance, as op-
posed to merely training the final classification layer. This is
C. Training Protocols and Transfer Learning a slight modification from the method described in [10], [9],
[12].
When learned from scratch, all the parameters of CNN Finally, transfer learning in CNN representation, as empiri-
models are initialized with random Gaussian distributions cally verified in previous literature [59], [60], [61], [11], [62],
and trained for 30 epochs with the mini-batch size of 50 can be effective in various cross-modality imaging settings
image instances. Training convergence can be observed within (RGB images to depth images [59], [60], natural images to
30 epochs. The other hyperparameters are momentum: 0.9; general CT and MRI images [11], and natural images to
weight decay: 0.0005; (base) learning rate: 0.01, decreased by neuroimaging [61] or ultrasound [62] data). More thorough
a factor of 10 at every 10 epochs. We use the Caffe framework theoretical studies on cross-modality imaging statistics and
[56] and NVidia K40 GPUs to train the CNNs. transferability will be needed for future studies.
AlexNet and GoogLeNet CNN models can be either learned
from scratch or fine-tuned from pre-trained models. Gir- IV. E VALUATIONS AND D ISCUSSIONS
shick et al. [6] find that, by applying ImageNet pre-trained
In this section, we evaluate and compare the performances
ALexNet to PASCAL dataset [8], performances of semantic
of nine CNN model configurations (CifarNet, AlexNet-ImNet,
20-class object detection and segmentation tasks significantly
AlexNet-RI-H, AlexNet-TL-H, AlexNet-RI-L, GoogLeNet-
improve over previous methods that use no deep CNNs.
RI-H, GoogLeNet-TL-H, GoogLeNet-RI-L and combined)
AlexNet can be fine-tuned on the PASCAL dataset to sur-
on two important CADe problems using publicly available
pass the performance of the ImageNet pre-trained AlexNet,
datasets [22], [41], [37].
although the difference is not as significant as that between
the CNN and non-CNN methods. Similarly, [57], [58] also
demonstrate that better performing deep models are learned A. Thoracoabdominal Lymph Node Detection
via CNN transfer learning from ImageNet to other datasets of We train and evaluate CNNs using three-fold cross-
limited scales. validation (folds are split into disjoint sets of patients), with the
Our hypothesis on CNN parameter transfer learning is the different CNN architectures described above. In testing, each
following: despite the disparity between natural images and LN candidate has multiple random 2.5D views tested by CNN
natural images, CNNs comprehensively trained on the large classifiers to generate LN class probability scores. We follow
scale well-annotated ImageNet may still be transferred to make the random view aggregation by averaging probabilities, as in
medical image recognition tasks more effective. Collecting [22].
and annotating large numbers of medical images still poses We first sample the LN image patches at a 64 × 64 pixel
significant challenges. On the other hand, the mainstream deep resolution. We then up-sample the 64 × 64 pixel LN images
CNN architectures (e.g., AlexNet and GoogLeNet) contain via bi-linear interpolation to 256 × 256 pixels, in order to
Fig. 8. FROC curves averaged on three-fold CV for the abdominal (left) and mediastinal (right) lymph nodes using different CNN models.
Ground Prediction
B. Interstitial Lung Disease Classification truth NM EM GG FB MN CD
The CNN models evaluated in this experiment are 1) NM 0.68 0.18 0.10 0.01 0.03 0.01
AlexNet-RI (training from scratch on the ILD dataset with EM 0.03 0.91 0.00 0.02 0.03 0.01
random initialization); 2) AlexNet-TL (with transfer learn- GG 0.06 0.01 0.70 0.09 0.06 0.08
FB 0.01 0.02 0.05 0.83 0.05 0.05
ing from [4]); 3) AlexNet-ImNet: pre-trained ImageNet-CNN
MN 0.09 0.00 0.07 0.04 0.79 0.00
model [4] with only the last cost function layer retrained from CD 0.02 0.01 0.10 0.18 0.01 0.68
random initialization, according to the six ILD classes (similar
TABLE V
to [9] but without using additional hand-crafted non-deep C ONFUSION MATRIX FOR ILD CLASSIFICATION ( PATCH - LEVEL ) WITH
feature descriptors, such as GIST and BoVW); 4) GoogLeNet- FIVE - FOLD CV USING G OOG L E N ET-TL.
RI (random initialization); 5) GoogLeNet-TL (GoogLeNet
also computed. F1-scores are reported on patch classification No existing work has reached the performance requirements
only (32×32 pixel patches extracted from manual ROIs) [38], for a realistic clinical setting [40], in which simple ROI-guided
[39], [54], as shown in Table IV. Both [38] and [39] use the image patch extraction and classification (which requires man-
evaluation protocol of “leave-one-patient-out” (LOO), which ual ROI selection by clinicians) is implemented. The main goal
is arguably much easier and not directly comparable to 10-fold of this paper is to investigate the three factors (CNN architec-
CV [54] or our Patch-CV5. In this study, we classify six ILD tures, dataset characteristics and transfer learning) that affect
classes by adding a consolidation (CD) class to five classes performance on a specific medical image analysis problem
of healthy (normal - NM), emphysema (EM), ground glass and to ultimately deliver clinically relevant results. For ILD
(GG), fibrosis (FB), and micronodules (MN) in [38], [39], classification, the most critical performance bottlenecks are
[54]. Patch-CV10 [54] and Patch-CV5 report similar medium the challenge of cross-dataset learning and the limited patient
to high F-scores. This implies that the ILD dataset (although population size. We attempt to overcome these obstacles by
one of the mainstream public medical image datasets) may not merging the ILD [37] and LTRC datasets. Although the ILD
adequately represent ILD disease CT lung imaging patterns, [37] and LTRC datasets [64] (used in [19]) were generated
over a population of only 120 patients. Patch-CV5 yields and annotated separately, they contain many common disease
higher F-scores than [54] and classifies the extra consolidation labels. For instance, the ILD disease classes emphysema (EM),
(CD) class. At present, the most pressing task is to drastically ground glass (GG), fibrosis (FB), and micronodules (MN)
expand the dataset or to explore across-dataset deep learning belong to both datasets, and thus can be jointly trained/tested
on the combined ILD and LTRC datasets [64]. to form a larger and unified dataset.
Recently, Gao et al. [40] have argued that a new CADe Adapting fully convolutional CNN or FCNN to parse every
protocol on holistic classification of ILD diseases directly, pixel location in the ILD lung CT images or slices, or adapting
using axial CT slice attenuation patterns and CNN, may be other methods from CNN based semantic image segmentation
more realistic for clinical applications. We refer to this as using PASCAL or ImageNet, may improve accuracy and
slice-level classification, as image patch sampling from manual efficiency. However, current FCNN approaches [65], [66]
ROIs can be completely avoided (hence, no manual ROI lack adequate spatial resolution in their directly output label
inputs will be provided). The experimental results in [40] are space. A segmentation label propagation method was recently
conducted with a patient-level hard split of 100 (training) and proposed [47] to provide full pixel-wise labeling of the ILD
20 (testing). The method’s testing F-scores (i.e., Slice-Test) data images. In this work, we sample image patches from the
are given in Table IV. Note that the F-scores in [40] are not slice using the ROIs for the ILD provided in the dataset, in
directly comparable to our results, due to different evaluation order to be consistent with previous methods in patch-level
criteria. Only Slice-Test is evaluated and reported in [40], and [38], [39], [54] and slice-level classification [40].
we find that F-scores can change drastically from different
rounds of the five-fold CV. C. Evaluation of Five CNN Models using ILD Classification
While it is a more practical CADe scheme, slice-level In this work, we mainly focus on AlexNet and GoogLeNet.
CNN learning [40] is very challenging, as it is restricted AlexNet is the first notably successful CNN architecture on
to only 905 CT image slices with tagged ILD labels. We the ImageNet challenge and has rekindled significant research
only benchmark the slice-level ILD classification results in interests on CNN. GoogLeNet is the state-of-the-art deep
this section. Even with the help of data augmentation (de- model, which has outperformed other notable models, such as
scribed in Sec. II), the classification accuracy of GoogLeNet- AlexNet, OverFeat, and VGGNet [67], [68] in various com-
TL from Table III is only 0.57. However, transfer learning puter vision benchmarks. Likewise, a reasonable assumption
from ImageNet pre-trained model is consistently beneficial, is that OverFeat and VGGNet may generate quantitative per-
as evidenced by AlexNet-TL (0.46) versus AlexNet-RI (0.44), formance results ranked between AlexNet’s and GoogLeNet’s.
and GoogLeNet-TL (0.57) versus GoogLeNet-RI (0.41). It For completeness, we include the Overfeat and VGGNet in the
especially prevents GoogLeNet from over-fitting on the limited following evaluations, to bolster our hypothesis.
CADe datasets. Finally, when the cross-validation is conducted d) Overfeat: OverFeat is described in [67] as an inte-
by randomly splitting the set of all 905 CT axial slices into five grated framework for using CNN for classification, localiza-
folds, markedly higher F-scores are obtained (Slice-Random tion and detection. Its architecture is similar to that of AlexNet,
in Table IV). This further validates the claim that the dataset but contains far more parameters (e.g., 1024 convolution filters
poorly generalizes ILDs for different patients. Figure 10 shows in both “conv4” and “conv5” layers compared to 384 and
examples of misclassified ILD patches (in axial view), with 256 convolution kernels in the “conv4” and “conv5” layers of
their ground truth labels and inaccurately classified labels. AlexNet), and operates more densely (e.g., smaller kernel size
Fig. 10. Visual examples of misclassified ILD 64x64 patches (in axial view), with their ground truth labels and inaccurately classified labels.
of 2 in “pool2” layer “pool5” compared to the kernel size 3 in Method ILD-Slice Method ILD-Patch
“pool2” and “pool5” of AlexNet) on the input image. Overfeat CifarNet - CifarNet 0.799
AlexNet-TL 0.867 AlexNet-TL 0.865
is the winning model of the ILSVRC 2013 in detection and
Overfeat-TL 0.877 Overfeat-TL 0.879
classification tasks. VGG-16-TL 0.90 VGG-16-TL 0.893
e) VGGNet: The VGGNet architecture is introduced in GoogLeNet-TL 0.902 GoogLeNet-TL 0.911
[68], where it is designed to significantly increase the depth TABLE VI
of the existing CNN architectures with 16 or 19 layers. Very C LASSIFICATION RESULTS ON ILD AND LN DETECTION WITH LOO.
small 3×3 size convolutional filters are used in all convolution
layers with a convolutional stride of size 1, in order to reduce
the number of parameters in deeper networks. Since VGGNet
is substantially deeper than the other CNN models, VGGNet CifarNet AlexNet Overfeat VGG-16 GoogLeNet
Time 7m16s 1h2m 1h26m 20h24m 2h49m
is more susceptible to the vanishing gradient problem [69],
Memory 2.25 GB 3.45 GB 4.22 GB 9.26 GB 5.37 GB
[70], [71]. Hence, the network may be more difficult to
TABLE VII
train. Training the network requires far more memory and T RAINING TIME AND MEMORY REQUIREMENTS OF THE FIVE CNN
computation time than AlexNet. We use the 16 layer variant ARCHITECTURES ON ILD PATCH - BASED CLASSIFICATION UP TO 90
as our default VGGNet model in our study. EPOCHS .
D. Training with “Equal Prior” vs. “Biased Prior” loss, validation loss and validation accuracy of AlexNet-RI and
Medical datasets are often “biased”, in that the number of AlexNet-TL, are shown in Figure 11. For AlexNet-RI in Figure
healthy samples is much larger than the number of diseased 11 (a), the training loss significantly decreases as the number
instances, or that the numbers of images per class are uneven. of training epochs increases, while the validation loss notably
In ILD dataset, the number of fibrosis samples is about increases and the validation accuracy does not improve much
3.5 times greater than the number of emphysema samples. before reaching a plateau. With transfer learning and fine-
The number of non-LNs is 3 ∼ 4 times greater than the tuning, much better and consistent performances of training
number of LNs in lymph node detection. Different sampling loss, validation loss and validation accuracy traces are obtained
or resampling rates are routinely applied to both ILD and LN (see Figure 11 (b)). We begin the optimization problem – that
detection to balance the data sample number or scale per class, of fine-tuning the ImageNet pre-trained CNN to classify a
as in[22]. We refer this as “Equal Prior”. If we use the same comprehensive set of images – by initializing the parameters
sampling rate, that will lead to a “Biased Prior” across different close to an optimal solution. One could compare this process
classes. to making adults learn to classify ILDs, as opposed to babies.
Without loss of generality, after GoogLeNet is trained on During the process, the validation loss, having remained at
the training sets under “Equal” or “Biased” priors, we com- lower values throughout, achieves higher final accuracy levels
pare its classification results on the balanced validation sets. than the validation loss on a similar problem with random
Evaluating a classifier on a biased validation set will cause initialization. Meanwhile, the training losses in both cases
unfair assessment of its performance. For instance, a classifier decrease to values near zero. This indicates that both AlexNet-
that predicts every image patch as “non-LN” will still achieve a RI and AlexNet-TL over-fit on the ILD dataset, due to its small
70% accuracy rate on a biased set with 3.5 times as many non- instance size. The quantitative results in Table III indicate
LN samples as LN samples. The classification accuracy results that AlexNet-TL and GoogLeNet-TL have consistently better
of GoogLeNet trained under two configurations are shown in classification accuracies than AlexNet-RI and GoogLeNet-RI,
Table VIII. Overall, it achieves lower accuracy results when respectively.
trained with a “biased prior” in both tasks, and the accuracy The last pooling layer (pool-5) activation maps of the Ima-
difference for ILD patch-based classification is small. geNet pre-trained AlexNet [4] (analogical to AlexNet-ImNet)
and AlexNet-TL, obtained by processing two input images of
ILD-Slice ILD-Patch Figure 2 (b,c), are shown in Figure 13 (a,b). The last pooling
Equal Prior 0.902 0.953 layer activation map summarizes the entire input image by
Biased Prior 0.872 0.952 highlighting which relative locations or neural reception fields
TABLE VIII relative to the image are activated. There are a total of 256
C LASSIFICATION ACCURACIES FOR ILD SLICE AND LN PATCH - LEVEL
DETECTION WITH “ EQUAL PRIOR ” AND “ BIASED PRIOR ”, USING
(6x6) reception fields in AlexNet [4]. Pooling units where the
G OOG L E N ET-TL. relative image location of the disease region is present in the
image are highlighted with green boxes. Next, we reconstruct
the original ILD images using the process of de-convolution,
back-propagating with convolution and un-pooling from the
V. A NALYSIS VIA CNN L EARNING T RACES & activation maps of the chosen pooling units [72]. From the
LULV ISUALIZATION reconstructed images (Figure 13 bottom), we observe that
In this section, we determine and analyze, via CNN visu- with fine-tuning, AlexNet-TL detects and localizes objects of
alization, the reasons for which transfer learning is beneficial interest (ILD disease regions depicted in in Figure 2 (b) and
to achieve better performance on CAD applications. (c)) better than AlexNet-ImNet. The filters shown in Figure
Thoracoabdominal LN Detection. In Figure 12, the 13 that better localize regions on the input images (Figure 2
first layer convolution filters from five different CNN ar- (b) and (c)) respectively, produce relatively higher activations
chitectures are visualized. We notice that without trans- (in the top 5%) among all 512 reception field responses in
fer learning [57], [6], somewhat blurry filters are learned the fine-tuned AlexNet-TL model. As observed in [73], the
(AlexNet-RI (256x256), AlexNet-RI (64x64), GoogLeNet- final CNN classification score can not be driven solely by a
RI (256x256) and GoogLeNet-RI (64x64)). However, in single strong activation in the receptions fields, but often by a
AlexNet-TL (256x256), many higher orders of contrast- or sparse set of high activations (i.e., varying selective or sparse
edge-preserving patterns (that enable capturing image ap- activations per input image).
pearance details) are evidently learned through fine-tuning
from ImageNet. With a smaller input resolution, AlexNet-RI
(64x64) and GoogLeNet-RI (64x64) can learn image contrast VI. F INDINGS AND F UTURE D IRECTIONS
filters to some degree; whereas, GoogLeNet-RI (256x256)
and AlexNet-RI (256x256) have over-smooth low-level filters We summarize our findings as follows.
throughout. • Deep CNN architectures with 8, even 22 layers [4],
ILD classification. We focus on analyzing visual CNN [33], can be useful even for CADe problems where the
optimization traces and activations from the ILD dataset, as available training datasets are limited. Previously, CNN
its slice-level setting is most similar to ImageNet’s. Indeed, models used in medical image analysis applications have
both datasets use full-size images. The traces of the training often been 2 ∼ 5 orders of magnitude smaller.
Fig. 11. Traces of training and validation loss (blue and green lines) and validation accuracy (orange lines) during (a) training AlexNet from random
initialization and (b) fine-tuning from ImageNet pre-trained CNN, for ILD classification.
Fig. 12. Visualization of first layer convolution filters of CNNs trained on abdominal and mediastinal LNs in RGB color, from random initialization (AlexNet-RI
(256x256), AlexNet-RI (64x64), GoogLeNet-RI (256x256) and GoogLeNet-RI (64x64)) and with transfer learning (AlexNet-TL (256x256)).
• The trade-off between using better learning models and medical image dataset, as evaluated in this paper.
using more training data [51] should be carefully consid-
VII. C ONCLUSION
ered when searching for an optimal solution to any CADe
problem (e.g., mediastinal and abdominal LN detection). In this paper, we exploit and extensively evaluate three im-
• Limited datasets can be a bottleneck to further ad- portant, previously under-studied factors on deep convolutional
vancement of CADe. Building progressively growing (in neural networks (CNN) architecture, dataset characteristics,
scale), well annotated datasets is at least as crucial as and transfer learning. We evaluate CNN performance on
developing new algorithms. This has been accomplished, two different computer-aided diagnosis applications: thoraco-
for instance, in the field of computer vision. The well- abdominal lymph node detection and interstitial lung disease
known scene recognition problem has made tremendous classification. The empirical evaluation, CNN model visual-
progress, thanks to the steady and continuous develop- ization, CNN performance analysis, and conclusive insights
ment of Scene-15, MIT Indoor-67, SUN-397 and Place can be generalized to the design of high performance CAD
datasets [58]. systems for other medical imaging tasks.
• Transfer learning from the large scale annotated natural ACKNOWLEDGMENT
image datasets (ImageNet) to CADe problems has been
This work was supported in part by the Intramural Re-
consistently beneficial in our experiments. This sheds
search Program of the National Institutes of Health Clinical
some light on cross-dataset CNN learning in the medical
Center, and in part by a grant from the KRIBB Research
image domain, e.g., the union of the ILD [37] and LTRC
Initiative Program (Korean Biomedical Scientist Fellowship
datasets [64], as suggested in this paper.
Program), Korea Research Institute of Bioscience and Biotech-
• Finally, applications of off-the-shelf deep CNN image
nology, Republic of Korea. This study utilized the high-
features to CADe problems can be improved by either
performance computational capabilities of the Biowulf Linux
exploring the performance-complementary properties of
cluster at the National Institutes of Health, Bethesda, MD
hand-crafted features [10], [9], [12], or by training CNNs
(http://biowulf.nih.gov). We thank NVIDIA for the K40 GPU
from scratch and better fine-tuning CNNs on the target
donation.
Fig. 13. Visualization of the last pooling layer (pool-5) activations (top). Pooling units where the relative image location of the disease region is located in
the image are highlighted with green boxes. The original images reconstructed from the units are shown in the bottom [72]. The examples in (a) and (b) are
computed from the input ILD images in Figure 2 (b) and (c), respectively.
inative learning and a spatial prior,” Medical image analysis, vol. 17, [48] H. Wang, J. W. Suh, S. R. Das, J. B. Pluta, C. Craige, P. Yushkevich
no. 2, pp. 254–270, 2013. et al., “Multi-atlas segmentation with joint label fusion,” IEEE Trans.
[25] M. Feuerstein, B. Glocker, T. Kitasaka, Y. Nakamura, S. Iwano, and Pattern Anal. Mach. Intell., vol. 35, no. 3, pp. 611–623, 2013.
K. Mori, “Mediastinal atlas creation from 3-d chest computed tomogra- [49] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for
phy images: application to automated detection and station mapping of free?–weakly-supervised learning with convolutional neural networks,”
lymph nodes,” Medical image analysis, vol. 16, no. 1, pp. 63–74, 2012. in IEEE CVPR, 2015, pp. 685–694.
[26] L. Lu, P. Devarakota, S. Vikal, D. Wu, Y. Zheng, and M. Wolf, [50] M. Oquab, L. Bottou, I. Laptev, and S. Josef, “Learning and transferring
“Computer aided diagnosis using multilevel image features on large- mid-level image representations using convolutional neural networks,”
scale evaluation,” in Medical Computer Vision. Large Data in Medical in IEEE CVPR, 2015, pp. 1717–1724.
Imaging. Springer, 2014, pp. 161–174. [51] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes, “Do we need more
[27] L. Lu, J. Bi, M. Wolf, and M. Salganicoff, “Effective 3d object detection training data or better models for object detection?” in BMVC, 2012.
and regression using probabilistic segmentation features in ct images,” [52] D. Ciresan, A. Giusti, L. Gambardella, and J. Schmidhuber, “Mitosis
in IEEE CVPR, 2011. detection in breast cancer histology images with deep neural networks,”
[28] L. Lu, A. Barbu, M. Wolf, J. Liang, M. Salganicoff, and D. Comaniciu, in MICCAI, 2013.
“Accurate polyp segmentation for 3d ct colonography using multi-staged [53] W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, and D. Shen, “Deep
probabilistic binary learning and compositional model,” in IEEE CVPR, convolutional neural networks for multi-modality isointense infant brain
2008. image segmentation,” NeuroImage, vol. 108, pp. 214–224, 2015.
[29] N. Tajbakhsh, M. B. Gotway, and J. Liang, “Computer-aided pulmonary [54] Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, and M. Chen, “Med-
embolism detection using a novel vessel-aligned multi-planar image ical image classification with convolutional neural network,” in IEEE
representation and convolutional neural networks,” in MICCAI, 2015. ICARCV, 2014, pp. 844–848.
[30] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” [55] G. A. Miller, “Wordnet: a lexical database for english,” Communications
International journal of computer vision, vol. 60, no. 2, pp. 91–110, of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
2004. [56] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,
[31] N. Dalal and B. Triggs, “Histograms of oriented gradients for human S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
detection,” in IEEE CVPR, vol. 1, 2005, pp. 886–893. fast feature embedding.” in ACM Multimedia, vol. 2, 2014, p. 4.
[32] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image [57] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features
databases for recognition,” in IEEE CVPR, 2008, pp. 1–8. off-the-shelf: an astounding baseline for recognition,” in IEEE CVPRW,
[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, 2014, pp. 512–519.
and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conf. [58] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning
on CVPR, 2015. deep features for scene recognition using places database,” in NIPS,
[34] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of 2014, pp. 487–495.
the devil in the details: Delving deep into convolutional nets,” in BMVC, [59] S. Gupta, R. Girshick, P. Arbelez, and J. Malik, “Learning rich features
2014. from rgb-d images for object detection and segmentation,” in ECCV,
2014, pp. 345–360.
[35] K. Chatfield, V. S. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil
[60] S. Gupta, P. Arbelez, R. Girshick, and J. Malik, “Indoor scene under-
is in the details: an evaluation of recent feature encoding methods.” in
standing with rgb-d images: Bottom-up segmentation, object detection
BMVC, 2011.
and semantic segmentation,” International Journal of Computer Vision,
[36] A. Seff, L. Lu, A. Barbu, H. Roth, H.-C. Shin, and R. M. Sum-
vol. 112, no. 2, pp. 133–149, 2015.
mers, “Leveraging mid-level semantic boundary cues for computer-aided
[61] A. Gupta, M. Ayhan, and A. Maida, “Natural image bases to represent
lymph node detection,” in MICCAI, 2015.
neuroimaging data,” in ICML, 2013, pp. 987–994.
[37] A. Depeursinge, A. Vargas, A. Platon, A. Geissbuhler, P.-A. Poletti, and [62] H. Chen, Q. Dou, D. Ni, J. Cheng, J. Qin, S. Li, and P. Heng, “Automatic
H. Müller, “Building a reference multimedia database for interstitial lung fetal ultrasound standard plane detection using knowledge transferred
diseases,” Computerized medical imaging and graphics, vol. 36, no. 3, recurrent neural networks,” in MICCAI, 2015, pp. 507–514.
pp. 227–238, 2012. [63] L. Kim, H. Roth, L. Lu, S. Wang, E. Turkbey, and S. M. Ronald,
[38] Y. Song, W. Cai, Y. Zhou, and D. D. Feng, “Feature-based image patch “Performance assessment of retroperitoneal lymph node computer-
approximation for lung tissue classification,” Medical Imaging, IEEE assisted detection using random forest and deep convolutional neural
Trans. on, vol. 32, no. 4, pp. 797–808, 2013. network learning algorithms in tandem,” in the 102nd Annual Meeting
[39] Y. Song, W. Cai, H. Huang, Y. Zhou, D. Feng, Y. Wang, M. Fulham, of Radiological Society of North America, 2014.
and M. Chen, “Large margin local estimate with applications to medical [64] D. Holmes III, B. Bartholmai, R. Karwoski, V. Zavaletta, and R. Robb,
image classification.” IEEE Trans. on Medical Imaging, 2015. “The lung tissue research consortium: an extensive open database
[40] M. Gao, U. Bagci, L. Lu, A. Wu, M. Buty, H.-C. Shin, H. Roth, containing histological, clinical, and radiological data to study chronic
Z. G. Papadakis, A. Depeursinge, M. R. Summers, Z. Xu, and J. D. lung disease,” in 2006 MICCAI Open Science Workshop, 2006.
Mollura, “Holistic classification of ct attenuation patterns for interstitial [65] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
lung diseases via deep convolutional neural networks,” in MICCAI first for semantic segmentation,” in IEEE CVPR, 2015.
Workshop on Deep Learning in Medical Image Analysis, 2015. [66] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
[41] A. Seff, L. Lu, K. M. Cherry, H. R. Roth, J. Liu, S. Wang, J. Hoffman, “Semantic image segmentation with deep convolutional nets and fully
E. B. Turkbey, and R. M. Summers, “2d view aggregation for lymph connected crfs,” ICLR, 2015.
node detection using a shallow hierarchy of linear classifiers,” in [67] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-
MICCAI, 2014, pp. 544–552. Cun, “Overfeat: Integrated recognition, localization and detection using
[42] L. Lu, M. Liu, X. Ye, S. Yu, and H. Huang, “Coarse-to-fine classification convolutional networks,” in ICLR, 2014.
via parametric and nonparametric models for computer-aided diagnosis,” [68] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
in ACM Conf. on CIKM, 2011, pp. 2509–2512. large-scale image recognition,” ICLR, 2014.
[43] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical [69] S. Hochreiter, “The vanishing gradient problem during learning recurrent
features for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell., neural nets and problem solutions,” Int. J. of Uncertainty, Fuzziness and
vol. 35, no. 8, pp. 1915–1929, 2013. Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998.
[44] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feedfor- [70] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
ward semantic segmentation with zoom-out features,” arXiv preprint deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
arXiv:1412.0774, 2014. 2006.
[45] H. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. Turkbey, and R. M. [71] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
Summers, “Deeporgan: Multi-level deep convolutional networks for with gradient descent is difficult,” Neural Networks, IEEE Transactions
automated pancreas segmentation,” in MICCAI, 2015. on, vol. 5, no. 2, pp. 157–166, 1994.
[46] X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, and S. Yan, [72] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
“Human parsing with contextualized convolutional neural network,” in tional networks,” in ECCV, 2014, pp. 818–833.
IEEE ICCV, 2015, pp. 1386–1394. [73] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of
[47] M. Gao, L. Lu, I. Nogues, M. R. Summers, and D. Mollura, “Segmen- multilayer neural networks for object recognition,” in ECCV, 2014.
tation label propagation using deep convolutional neural networks and
dense conditional random field,” in IEEE ISBI. IEEE, 2016.