Deep Convolutional Neural Networks For Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2016.2528162, IEEE
Transactions on Medical Imaging
1
Deep Convolutional Neural Networks for

Computer-Aided Detection: CNN Architectures,
Dataset Characteristics and Transfer Learning
Hoo-Chang Shin, Member, IEEE, Holger R. Roth, Mingchen Gao, Le Lu, Senior Member, IEEE, Ziyue Xu,
Isabella Nogues, Jianhua Yao, Daniel Mollura, Ronald M. Summers*
Abstract—Remarkable progress has been made in image recog- data-driven learning, large-scale well-annotated datasets with
nition, primarily due to the availability of large-scale annotated representative data distribution characteristics are crucial to
datasets (i.e. ImageNet) and the revival of deep convolutional learning more accurate or generalizable models [5], [4]. Unlike
neural networks (CNN). CNNs enable learning data-driven,
highly representative, layered hierarchical image features from previous image datasets used in computer vision, ImageNet
sufficient training data. However, obtaining datasets as compre- [1] offers a very comprehensive database of more than 1.2
hensively annotated as ImageNet in the medical imaging domain million categorized natural images of 1000+ classes. The CNN
remains a challenge. There are currently three major techniques models trained upon this database serve as the backbone
that successfully employ CNNs to medical image classification: for significantly improving many object detection and image
training the CNN from scratch, using off-the-shelf pre-trained
CNN features, and conducting unsupervised CNN pre-training segmentation problems using other datasets [6], [7], e.g.,
with supervised fine-tuning. Another effective method is transfer PASCAL [8] and medical image categorization [9], [10], [11],
learning, i.e., fine-tuning CNN models (supervised) pre-trained [12]. However, there exists no large-scale annotated medical
from natural image dataset to medical image tasks (although image dataset comparable to ImageNet, as data acquisition is
domain transfer between two medical image datasets is also difficult, and quality annotation is costly.
possible).
In this paper, we exploit three important, but previously There are currently three major techniques that successfully
understudied factors of employing deep convolutional neural employ CNNs to medical image classification: 1) training the
networks to computer-aided detection problems. We first explore “CNN from scratch” [13], [14], [15], [16], [17]; 2) using
and evaluate different CNN architectures. The studied models “off-the-shelf CNN” features (without retraining the CNN) as
contain 5 thousand to 160 million parameters, and vary in complementary information channels to existing hand-crafted
numbers of layers. We then evaluate the influence of dataset scale
and spatial image context on performance. Finally, we examine image features, for Chest X-rays [10] and CT lung nodule
when and why transfer learning from pre-trained ImageNet (via identification [9], [12]; and 3) performing unsupervised pre-
fine-tuning) can be useful. We study two specific computer- training on natural or medical images and fine-tuning on med-
aided detection (CADe) problems, namely thoraco-abdominal ical target images using CNN or other types of deep learning
lymph node (LN) detection and interstitial lung disease (ILD) models [18], [19], [20], [21]. A decompositional 2.5D view
classification. We achieve the state-of-the-art performance on
the mediastinal LN detection, with 85% sensitivity at 3 false resampling and an aggregation of random view classification
positive per patient, and report the first five-fold cross-validation scores are used to eliminate the “curse-of-dimensionality”
classification results on predicting axial CT slices with ILD cate- issue in [22], in order to acquire a sufficient number of training
gories. Our extensive empirical evaluation, CNN model analysis image samples.
and valuable insights can be extended to the design of high Previous studies have analyzed three-dimensional patch
performance CAD systems for other medical imaging tasks.
creation for LN detection [23], [24], atlas creation from chest
CT [25] and the extraction of multi-level image features [26],
I. I NTRODUCTION [27]. At present, there are several extensions or variations of
Tremendous progress has been made in image recogni- the decompositional view representation introduced in [22],
tion, primarily due to the availability of large-scale anno- [28], such as: using a novel vessel-aligned multi-planar image
tated datasets (i.e. ImageNet [1], [2]) and the recent revival representation for pulmonary embolism detection [29], fusing
of deep convolutional neural networks (CNN) [3], [4]. For unregistered multiview for mammogram analysis [16] and
classifying pulmonary peri-fissural nodules via an ensemble
Hoo-Chang Shin, Holger R. Roth, Le Lu, Isabella Nogues, Jianhua Yao and of 2D views [12].
Ronald M. Summers are with the Imaging Biomarkers and Computer-Aided
Diagnosis Laboratory; Mingchen Gao, Ziyue Xu and Daniel Mollura are with Although natural images and medical images differ signif-
Center for Infectious Disease Imaging, Le Lu, Jianhua Yao and Ronald M. icantly, conventional image descriptors developed for object
Summers are also with Clinical Image Processing Service, Radiology and recognition in natural images, such as the scale-invariant
Imaging Sciences Department, National Institutes of Health Clinical Center,
Bethesda, MD 20892-1182, USA. Asterisk indicates corresponding author. feature transform (SIFT) [30] and the histogram of oriented
Holger Roth and Mingchen Gao contributed equally to this work. e-mail: gradients (HOG) [31], have been widely used for object de-
{hoochang.shin, le.lu, rms}@nih.gov. Copyright (c) 2010 IEEE. Personal use tection and segmentation in medical image analysis. Recently,
of this material is permitted. However, permission to use this material for
any other purposes must be obtained from the IEEE by sending a request to ImageNet pre-trained CNNs have been used for chest pathol-
[email protected]. ogy identification and detection in X-ray and CT modalities
U.S. Government work not protected by U.S. copyright.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2016.2528162, IEEE
2
[10], [9], [12]. They have yielded the best performance results els consistently outperform CNNs that merely use off-the-shelf
by integrating low-level image features (e.g., GIST [32], bag of CNN features, in both the LN and ILD classification problems.
visual words (BoVW) and bag-of-frequency [12]). However, We further analyze, via CNN activation visualizations, when
the fine-tuning of an ImageNet pre-trained CNN model on and why transfer learning from non-medical to medical images
medical image datasets has not yet been exploited. in CADe problems can be valuable.
In this paper, we exploit three important, but previously
under-studied factors of employing deep convolutional neural
II. DATASETS AND R ELATED W ORK
networks to computer-aided detection problems. Particularly,
we explore and evaluate different CNN architectures varying in We employ CNNs (with the characteristics defined above)
width (ranging from 5 thousand to 160 million parameters) and to thoraco-abdominal lymph node (LN) detection (evaluated
depth (various numbers of layers), describe the effects of vary- separately on the mediastinal and abdominal regions) and
ing dataset scale and spatial image context on performance, interstitial lung disease (ILD) detection. For LN detection, we
and discuss when and why transfer learning from pre-trained use randomly sampled 2.5D views in CT [22]. We use 2D CT
ImageNet CNN models can be valuable. We further verify slices [38], [39], [40] for ILD detection. We then evaluate and
our hypothesis by inheriting and adapting rich hierarchical compare CNN performance results.
image features [5], [33] from the large-scale ImageNet dataset Until the detection aggregation approach [22], [41], thora-
for computer aided diagnosis (CAD). We also explore CNN coabdominal lymph node (LN) detection via CADe mecha-
architectures of the most studied seven-layered “AlexNet- nisms has yielded poor performance results. In [22], each 3D
CNN” [4], a shallower “Cifar-CNN” [22], and a much deeper LN candidate produces up to 100 random 2.5D orthogonally
version of “GoogLeNet-CNN” [33] (with our modifications sampled images or views which are then used to train an
on CNN structures). This study is partially motivated by effective CNN model. The best performance on abdominal
recent studies [34], [35] in computer vision. The thorough LN detection is achieved at 83% recall on 3FP per patient
quantitative analysis and evaluation on deep CNN [34] or [22], using a “Cifar-10” CNN. Using the thoracoabdominal
sparsity image coding methods [35] elucidate the emerging LN detection datasets [22], we aim to surpass this CADe
techniques of the time and provide useful suggestions for their performance level, by testing different CNN architectures,
future stages of development, respectively. exploring various dataset re-sampling protocols, and applying
Two specific computer-aided detection (CADe) problems, transfer learning from ImageNet pre-trained CNN models.
namely thoraco-abdominal lymph node (LN) detection and Interstitial lung disease (ILD) comprises more than 150 lung
interstitial lung disease (ILD) classification are studied in this diseases affecting the interstitium, which can severely impair
work. On mediastinal LN detection, we surpass all currently the patient’s ability to breathe. Gao et al. [40] investigate
reported results. We obtain 86% sensitivity on 3 false positives the ILD classification problem in two scenarios: 1) slice-
(FP) per patient, versus the prior state-of-art sensitivities of level classification: assigning a holistic two-dimensional axial
78% [36] (stacked shallow learning) and 70% [22] (CNN), CT slice image with its occurring ILD disease label(s); and
as prior state-of-the-art. For the first time, ILD classification 2) patch-level classification: a/ sampling patches within the
results under the patient-level five-fold cross-validation proto- 2D ROIs (Regions of Interest provided by [37]), then b/
col (CV5) are investigated and reported. The ILD dataset [37] classifying patches into seven category labels ( six disease
contains 905 annotated image slices with 120 patients and labels and one ”‘’healthy”” label). Song et al. [38], [39] only
6 ILD labels. Such sparsely annotated datasets are generally address the second sub-task of patch-level classification under
difficult for CNN learning, due to the paucity of labeled the “leave-one-patient-out” (LOO) criterion. By training on the
instances. moderate-to-small scale ILD dataset [37], our main objective
Evaluation protocols and details are critical to deriving is to exploit and benchmark CNN based ILD classification
significant empirical findings [34]. Our experimental results performances under the CV5 metric (which is more realistic
suggest that different CNN architectures and dataset re- and unbiased than LOO [38], [39] and hard-split [40]), with
sampling protocols are critical for the LN detection tasks and without transfer learning.
where the amount of labeled training data is sufficient and Thoracoabdominal Lymph Node Datasets. We use the
spatial contexts are local. Since LN images are more flexible publicly available dataset from [22], [41]. There are 388
than ILD images with respect to resampling and reformatting, mediastinal LNs labeled by radiologists in 90 patient CT scans,
LN datasets may be more readily augmented by such image and 595 abdominal LNs in 86 patient CT scans. To facilitate
transformations. As a result, LN datasets contain more training comparison, we adopt the data preparation protocol of [22],
and testing data instances (due to data auugmentation) than where positive and negative LN candidates are sampled with
ILD datasets. They nonetheless remain less comprehensive the fields-of-view (FOVs) of 30mm to 45mm, surrounding the
than natural image datasets, such as ImageNet. Fine-tuning annotated and detected LN centers (obtained by a candidate
ImageNet-trained models for ILD classification is clearly generation process). More precisely, [22], [41], [36] follow
advantageous and yields early promising results, when the a coarse-to-fine CADe scheme, partially inspired by [42],
amount of labeled training data is highly insufficient and multi- which operates with ∼ 100% detection recalls at the cost of
class categorization is used, as opposed to the LN dataset’s approximately 40 false or negative LN candidates per patient
binary class categorization. Another significant finding is that scan. In this work, positive and negative LN candidate are
CNNs trained from scratch or fine-tuned from ImageNet mod- first sampled up to 200 times with translations and rotations.

3
Fig. 1. Some examples of abdominal and mediastinal lymph nodes sampled

on axial (ax), coronal (co), and sagittal (sa) views, with four different fields-of-
views (30mm: orange; 45mm: red; 85mm: green; 128mm: blue) surrounding
lymph nodes. Fig. 2. Some examples of CT image slices with six lung tissue types in the
ILD dataset [37]. Disease tissue types are located with dark orange arrows.
Afterwards, negative LN samples are randomly re-selected at a

lower rate close to the total number of positives. LN candidates which orientation can be constrained along the attached vessel
are randomly extracted from fields-of-view (FOVs) spanning axis, vessel-aligned multi-planar image representation (MPR)
35mm to 128mm in soft-tissue window [-100, 200HU]. This is more effective than randomly aligned MPR.
allows us to capture multiple spatial scales of image context Interstitial Lung Disease Dataset. We utilize the publicly
[43], [44]). The samples are then rescaled to a 64 × 64 pixel available dataset of [37]. It contains 905 image slices from 120
resolution via B-spline interpolation. A few examples of LNs patients, with six lung tissue types annotations containing at
with axial, coronal, and sagittal views encoded in RGB color least one of the following: healthy (NM), emphysema (EM),
images [22] are shown in Figure 1. ground glass (GG), fibrosis (FB), micronodules (MN) and
Unlike the heart or the liver, lymph nodes have no pre- consolidation (CD) (Figure 3). At the slice level, the objective
determined anatomic orientation. Hence, the purely random is to classify the status of “presence/absence” of any of the six
image resampling (with respect to scale, displacement and ILD classes for an input axial CT slice [40]. Characterizing
orientation) and reformatting (the axial, coronal, and sagittal an arbitrary CT slice against any possible ILD type, without
views are in any system randomly resampled coordinates) any manual ROI (in contrast to [38], [39]), can be useful for
is a natural choice, which also happens to yield high CNN large-scale patient screening. For slice-level ILD classification,
performance. Although we integrate three channels of informa- we sampled the slices 12 times with random translations and
tion from three orthogonal views for LN detection, the pixel- rotations. After this, we balanced the numbers of CT slice
wise spatial correlations between or among channels are not samples for the six classes by randomly sampling several
necessary. The convolutional kernels in the lower level CNN instances at various rates. For patch-based classification, we
architectures can learn the optimal weights to linearly combine sampled up to 100 patches of size 64×64 from each ROI. This
the observations from the axial, coronal, and sagittal channels dataset is divided into five folds with disjoint patient subsets.
by computing their dot-products. Transforming axial, coronal, The average number of CT slices (training instances) per fold
and sagittal representations to RGB also facilitates transfer is small, as shown in Table I. Slice-level ILD classification
learning from CNN models trained on ImageNet. is a very challenging task where CNN models need to learn
This learning representation (i.e., “built-in CNN”) is flexi- from very small numbers of training examples and predict ILD
ble, in that it naturally combines multiple sources or channels labels on unseen patients.
of information. In the recent literature [45], even heteroge- In the publicly available ILD dataset, very few CT slices
neous class-conditional probability maps can be combined are labeled as normal or healthy. The remaining CT slices
with raw images to improve performance. This set-up is cannot be simply classified as normal, because many ILD
similar to that of other works in computer vision, such disease regions or slices have not yet been labeled. ILD
as [46], where heterogeneous image information channels [37] is a partially labeled database; this is one of its main
are jointly fed into the CNN convolutional layers for high- limitations. Research is being conducted to address this issue.
accuracy human parsing and segmentation. Finally, if there In particular,[47] has proposed to fully label the ILD dataset
are correlations among CNN input channels, one may observe pixel-wise via proposed segmentation label propagation.
the corresponding correlated patterns in the learned filters. To leverage the CNN architectures designed for color im-
In summary, the assumption that there are or must be pixel- ages and to transfer CNN parameters pre-trained on ImageNet,
wise spatial correlations among input channels does not apply we transform all gray-scale axial CT slice images via three CT
to the CNN model representation. For other medical imaging window ranges: lung window range [-1400, -200HU], high-
problems, such as pulmonary embolism detection [29], in attenuation range [-160, 240HU], and low-attenuation range

4
Fig. 4. An example of lung/high-attenuation/low-attenuation CT windowing

for an axis lung CT slice. We encode the lung/high-attenuation/low-attenuation
CT windowing into red/green/blue channels.
III. M ETHODS
In this study, we explore, evaluate and analyze the influence
of various CNN Architectures, dataset characteristics (when
we need more training data or better models for object
Fig. 3. Some examples of 64 × 64 pixel CT image patches for (a) NM, (b) detection [51]) and CNN transfer learning from non-medical to
EM, (c) GG, (d) FB, (e) MN (f) CD.
medical image domains. These three key elements of building
effective deep CNN models for CADe problems are described
normal emphysema ground glass fibrosis micronodules consolidation below.
30.2 20.2 85.4 96.8 63.2 39.2
TABLE I
AVERAGE NUMBER OF IMAGES IN EACH FOLD FOR DISEASE CLASSES , A. Convolutional Neural Network Architectures
WHEN DIVIDING THE DATASET IN 5- FOLD PATIENT SETS .
We mainly explore three convolutional neural network ar-
chitectures (CifarNet [5], [22], AlexNet [4] and GoogLeNet
[33]) with different model training parameter values. The
current deep learning models [22], [52], [53] in medical image
[-1400; -950HU]. We then encode the transformed images into tasks are at least 2 ∼ 5 orders of magnitude smaller than even
RGB channels (to be aligned with the input channels of CNN AlexNet [4]. More complex CNN models [22], [52] have only
models [4], [33] pre-trained from natural image datasets [1]). about 150K or 15K parameters. Roth et al. [22] adopt the CNN
The low-attenuation CT window is useful for visualizing cer- architecture tailored to the Cifar-10 dataset [5] and operate on
tain texture patterns of lung diseases (especially emphysema). image windows of 32×32×3 pixels for lymph node detection,
The usage of different CT attenuation channels improves while the simplest CNN in [54] has only one convolutional,
classification results over the usage of a single CT windowing pooling, and FC layer, respectively.
channel, as demonstrated in [40]. More importantly, these CT We use CifarNet [5] as used in [22] as a baseline for
windowing processes do not depend on the lung segmentation, the LN detection. AlexNet [4] and GoogLeNet [33] are also
which instead is directly defined in the CT HU space. Figure 4 modified to evaluate these state-of-the-art CNN architecture
shows a representative example of lung, high-attenuation, and from ImageNet classification task [2] to our CADe prob-
low-attenuation CT windowing for an axis lung CT slice. lems and datasets. A simplified illustration of three CNN
As observed in [40], lung segmentation is crucial to holistic architectures exploited is shown in Figure 5. CifarNet always
slice-level ILD classification. We empirically compare per- takes 32 × 32 × 3 image patches as input while AlexNet
formance in two scenarios with a rough lung segmentation1 and GoogLeNet are originally designed for the fixed image
There is no significant difference between two setups. Due dimension of 256 × 256 × 3 pixels. We also reduced the
to the high precision of CNN based image processing, highly filter size, stride and pooling parameters of AlexNet and
accurate lung segmentation is not necessary . The localization GoogLeNet to accommodate a smaller input size of 64 ×
of ILD regions within the lung is simultaneously learned 64 × 3 pixels. We do so to produce and evaluate “simplified”
through selectively weighted CNN reception fields in the AlexNet and GoogLeNet versions that are better suited to the
deepest convolutional layers during the classification based smaller scale training datasets common in CADe problems.
CNN training [49], [50]. Some areas outside of the lung Throughout the paper, we refer to the models as CifarNet
appear in both healthy or diseased images. CNN training learns (32x32) or CifarNet (dropping 32x32); AlexNet (256x256) or
to ignore them by setting very small filter weights around AlexNet-H (high resolution); AlexNet (64x64) or AlexNet-L
the corresponding regions (Figure 13). This observation is (low resolution); GoogLeNet (256x256) or GoogLeNet-H and
validated by [40]. GoogLeNet (64x64) or GoogLeNet-L (dropping 3 since all
image inputs are three channels).
a) CifarNet: CifarNet, introduced in [5], was the state-
1 This can be achieved by segmenting the lung using simple label-fusion of-the-art model for object recognition on the Cifar10 dataset,
methods [48]. In the first case, we overlay the target image slice with the which consists of 32 × 32 images of 10 object classes. The
average lung mask among the training folds. In the second, we perform objects are normally centered in the images. Some example
simple morphology operations to obtain the lung boundary. In order to retain
information from the inside of the lung, we apply Gaussian smoothing to the images and class categories from the Cifar10 dataset are
regions outside of the lung boundary. shown in Figure 7. CifarNet has three convolution layers,

5
Fig. 5. A simplified illustration of the CNN architectures used. GoogLeNet [33] contains two convolution layers, three pooling layers, and nine inception
layers. Each of the inception layer of GoogLeNet consists of six convolution layers and one pooling layer.
detection.
b) AlexNet: The AlexNet architecture was published in

[4], achieved significantly improved performance over the
other non-deep learning methods for ImageNet Large Scale Vi-
sual Recognition Challenge (ILSVRC) 2012. This success has
revived the interest in CNNs [3] in computer vision. ImageNet
consists of 1.2 million 256 × 256 images belonging to 1000
categories. At times, the objects in the image are small and ob-
scure, and thus pose more challenges for learning a successful
classification model. More details about the ImageNet dataset
Fig. 6. Illustration of inception3a layer of GoogLeNet. Inception layers will be discussed in Sec. III-B. AlexNet has five convolution
of GoogLeNet consist of six convolution layers with different kernel sizes layers, three pooling layers, and two fully-connected layers
and one pooling layer.
with approximately 60 million free parameters. AlexNet is
our default CNN architecture for evaluation and analysis in
the remainder of the paper.
c) GoogLeNet: The GoogLeNet model proposed in [33],

is significantly more complex and deep than all previous
CNN architectures. More importantly, it also introduces a
new module called “Inception”, which concatenates filters of
Fig. 7. Some examples of Cifar10 dataset and some images of “tennis
different sizes and dimensions into a single new filter (refer to
ball” class from ImageNet dataset. Images of Cifar10 dataset are small Figure 6). Overall, GoogLeNet has two convolution layers, two
(32 × 32) images with object of the image class category in the center. pooling layers, and nine “Inception” layers. Each “Inception”
Images of ImageNet dataset are larger (256×256), where object of the image
class category can be small, obscure, partial, and sometimes in a cluttered
layer consists of six convolution layers and one pooling
environment. layer. An illustration of an “Inception” layer (inception3a)
from GoogLeNet is shown in Figure 6. GoogLeNet is the
current state-of-the-art CNN architecture for the ILSVRC
three pooling layers, and one fully-connected layer. This CNN challenge, where it achieved 5.5% top-5 classification error on
architecture, also used in [22] has about 0.15 million free the ImageNet challenge, compared to AlexNet’s 15.3% top-5
parameters. We adopt it as a baseline model for the LN classification error.

6
B. ImageNet: Large Scale Annotated Natural Image Dataset tens of millions of free parameters to train, and thus require
sufficiently large numbers of labeled medical images.
ImageNet [1] has more than 1.2 million 256 × 256 images
For transfer learning, we follow the approach of [57],
categorized under 1000 object class categories. There are more
[6] where all CNN layers except the last are fine-tuned at
than 1000 training images per class. The database is organized
a learning rate 10 times smaller than the default learning
according to the WordNet [55] hierarchy, which currently
rate. The last fully-connected layer is random initialized and
contains only nouns in 1000 object categories. The image-
freshly trained, in order to accommodate the new object
object labels are obtained largely through crowd-sourcing,
categories in our CADe applications. Its learning rate is kept
e.g., Amazon Mechanical Turk, and human inspection. Some
at the original 0.01. We denote the models with random
examples of object categories in ImageNet are “sea snake”,
initialization or transfer learning as AlexNet-RI and AlexNet-
“sandwich”, “vase”, “leopard”, etc. ImageNet is currently the
TL, and GoogLeNet-RI and GoogLeNet-TL. We found that
largest image dataset among other standard datasets for visual
the transfer learning strategy yields the best performance
recognition. Indeed, the Caltech101, Caltech256 and Cifar10
results. Determining the optimal learning rate for different
dataset merely contain 60000 32 × 32 images and 10 object
layers is challenging, especially for very deep networks such
classes. Furthermore, due to the large number (1000+) of
as GoogLeNet.
object classes, the objects belonging to each ImageNet class
We also perform experiments using “off-the-shelf” CNN
category can be occluded, partial and small, relative to those in
features of AlexNet pre-trained on ImageNet and training only
the previous public image datasets. This significant intra-class
the final classifier layer to complete the new CADe classifica-
variation poses greater challenges to any data-driven learning
tion tasks. Parameters in the convolutional and fully connected
system that builds a classifier to fit given data and generalize
layers are fixed and are used as deep image extractors, as in
to unseen data. For comparison, some example images of
[10], [9], [12]. We refer to this model as AlexNet-ImNet in the
Cifar10 dataset and ImageNet images in the “tennis ball”
remainder of the paper. Note that [10], [9], [12] train support
class category are shown in Figure 7. The ImageNet dataset
vector machines and random forest classifiers using ImageNet
is publicly available, and the ImageNet Large Scale Visual
pre-trained CNN features. Our simplified implementation is
Recognition Challenge (ILSVRC) has become the standard
intended to determine whether fine-tuning the “end-to-end”
benchmark for large-scale object recognition.
CNN network is necessary to improve performance, as op-
posed to merely training the final classification layer. This is
C. Training Protocols and Transfer Learning a slight modification from the method described in [10], [9],
[12].
When learned from scratch, all the parameters of CNN Finally, transfer learning in CNN representation, as empiri-
models are initialized with random Gaussian distributions cally verified in previous literature [59], [60], [61], [11], [62],
and trained for 30 epochs with the mini-batch size of 50 can be effective in various cross-modality imaging settings
image instances. Training convergence can be observed within (RGB images to depth images [59], [60], natural images to
30 epochs. The other hyperparameters are momentum: 0.9; general CT and MRI images [11], and natural images to
weight decay: 0.0005; (base) learning rate: 0.01, decreased by neuroimaging [61] or ultrasound [62] data). More thorough
a factor of 10 at every 10 epochs. We use the Caffe framework theoretical studies on cross-modality imaging statistics and
[56] and NVidia K40 GPUs to train the CNNs. transferability will be needed for future studies.
AlexNet and GoogLeNet CNN models can be either learned
from scratch or fine-tuned from pre-trained models. Gir- IV. E VALUATIONS AND D ISCUSSIONS
shick et al. [6] find that, by applying ImageNet pre-trained
In this section, we evaluate and compare the performances
ALexNet to PASCAL dataset [8], performances of semantic
of nine CNN model configurations (CifarNet, AlexNet-ImNet,
20-class object detection and segmentation tasks significantly
AlexNet-RI-H, AlexNet-TL-H, AlexNet-RI-L, GoogLeNet-
improve over previous methods that use no deep CNNs.
RI-H, GoogLeNet-TL-H, GoogLeNet-RI-L and combined)
AlexNet can be fine-tuned on the PASCAL dataset to sur-
on two important CADe problems using publicly available
pass the performance of the ImageNet pre-trained AlexNet,
datasets [22], [41], [37].
although the difference is not as significant as that between
the CNN and non-CNN methods. Similarly, [57], [58] also
demonstrate that better performing deep models are learned A. Thoracoabdominal Lymph Node Detection
via CNN transfer learning from ImageNet to other datasets of We train and evaluate CNNs using three-fold cross-
limited scales. validation (folds are split into disjoint sets of patients), with the
Our hypothesis on CNN parameter transfer learning is the different CNN architectures described above. In testing, each
following: despite the disparity between natural images and LN candidate has multiple random 2.5D views tested by CNN
natural images, CNNs comprehensively trained on the large classifiers to generate LN class probability scores. We follow
scale well-annotated ImageNet may still be transferred to make the random view aggregation by averaging probabilities, as in
medical image recognition tasks more effective. Collecting [22].
and annotating large numbers of medical images still poses We first sample the LN image patches at a 64 × 64 pixel
significant challenges. On the other hand, the mainstream deep resolution. We then up-sample the 64 × 64 pixel LN images
CNN architectures (e.g., AlexNet and GoogLeNet) contain via bi-linear interpolation to 256 × 256 pixels, in order to

7
Fig. 8. FROC curves averaged on three-fold CV for the abdominal (left) and mediastinal (right) lymph nodes using different CNN models.
Region Mediastinum Abdomen in Figure 8. The area-under-the-FROC-curve (AUC) and true

Method AUC TPR/3FP AUC TPR/3FP
positive rate (TPR, recall or sensitivity) at three false positives
[41] - 0.63 - 0.70 per patient (TPR/3FP) are used as performance metrics. Of
[22] 0.92 0.70 0.94 0.83
[36] - 0.78 - 0.78 the nine investigated CNN models, CifarNet, AlexNet-ImNet
CifarNet 0.91 0.70 0.81 0.44 and GoogLeNet-RI-H generally yielded the least competitive
AlexNet-ImNet 0.89 0.63 0.80 0.41 detection accuracy results. Our LN datasets are significantly
AlexNet-RI-H 0.94 0.79 0.92 0.67 more complex (i.e., display much larger within-class appear-
AlexNet-TL-H 0.94 0.81 0.92 0.69 ance variations), especially due to the extracted fields-of-view
GoogLeNet-RI-H 0.85 0.61 0.80 0.48 (FOVs) of (35mm-128mm) compared to (30mm-45mm) in
GoogLeNet-TL-H 0.94 0.81 0.92 0.70 [22], where CifarNet is also employed. In this experiment,
AlexNet-RI-L 0.94 0.77 0.88 0.61
GoogLeNet-RI-L 0.95 0.85 0.91 0.69
CifarNet is under-trained with respect to our enhanced LN
Combined 0.95 0.85 0.93 0.70 datasets, due to its limited input resolution and parameter com-
plexity. The inferior performance of AlexNet-ImNet implies
TABLE II
C OMPARISON OF MEDIASTINAL AND ABDOMINAL LN DETECTION that using the pre-trained ImageNet CNNs alone as “off-the-
RESULTS USING VARIOUS CNN MODELS . N UMBERS IN BOLD INDICATE shelf” deep image feature extractors may not be optimal or
THE BEST PERFORMANCE VALUES ON CLASSIFICATION ACCURACY. adequate for mediastinal and abdominal LN detection tasks.
To complement “off-the-shelf” CNN features, [10], [9], [12]
all add and integrate various other hand-crafted image features
as hybrid inputs for the final CADe classification.
accommodate AlexNet-RI-L, AlexNet-TL-H, GoogLeNet-RI-
H and GoogLeNet-TL-H. For the modified AlexNet-RI-L at GoogLeNet-RI-H performs poorly, as it is susceptible
(64 × 64) pixel resolution, we reduce the number of first layer to over-fitting. No sufficient data samples are available to
convolution filters from 96 to 64 and reduce the stride from 4 train GoogLeNet-RI-H with random initialization. Indeed,
to 2. For the modified GoogLeNet-RI (64 × 64), we decrease due to GoogLeNet-RI-H’s complexity and 22-layer depth,
the number of first layer convolution filters from 64 to 32, million-image datasets may be required to properly train
the pad size from 3 to 2, the kernel size from 7 to 5, stride this model. However, GoogLeNet-TL-H significantly improves
from 2 to 1 and the stride of the subsequent pooling layer upon GoogLeNet-RI-H (0.81 versus 0.61 TPR/3FP in me-
from 2 to 1. We slightly reduce the number of convolutional diastinum; 0.70 versus 0.48 TPR/3FP in abdomen). This
filters in order to accommodate the smaller input image sizes indicates that transfer learning offers a much better initial-
of target medical image datasets [22], [37], while preventing ization of CNN parameters than random initialization. Like-
over-fitting. This eventually improves performance on patch- wise, AlexNet-TL-H consistently outperforms AlexNet-RI-H,
based classification. CifarNet is used in [22] to detect LN though by smaller margins (0.81 versus 0.79 TPR/3FP in
samples of 32 × 32 × 3 images. For consistency purposes, mediastinum; 0.69 versus 0.67 TPR/3FP in abdomen). This
we down-sample 64 × 64 × 3 resolution LN sample images to is also consistent with the findings reported for ILD detection
the dimension of 32 × 32 × 3. in Table III and Figure 11.
Results for lymph node detection in the mediastinum and GoogLeNet-TL-H yields results similar to AlexNet-TL-H’s
abdomen are reported in Table II. FROC curves are illustrated for the mediastinal LN detection, and slightly outperforms

8
Alex-Net-H for abdominal LN detection. AlexNet-RI-H ex-

hibits less severe over-fitting than GoogLeNet-RI-H. We also
evaluate a simple ensemble by averaging the probability scores
from five CNNs: AlexNet-RI-H, AlexNet-TL-H, AlexNet-RI-
H, GoogLeNet-TL-H and GoogLeNet-RI-L. This combined
ensemble outputs the classification accuracies matching or
slightly exceeding the best performing individual CNN models
on the mediastinal or abdominal LN detection tasks, respec-
tively.
Many of our CNN models achieve notably better (FROC-
AUC and TPR/3FP) results than the previous state-of-the-art
models [36] for mediastinal LN detection: GoogLeNet-RI-L
obtains an AUC=0.95 and 0.85 TPR/3FP, versus AUC=0.92 Fig. 9. Examples of misclassified lymph nodes (in axial view) of both false
and 0.70 TPR/3FP [22] and 0.78 TPR/3FP [36] which uses negatives (Left) and false positives (Right). Mediastinal LN examples are
shown in the upper row, and abdominal LN examples in the bottom row.
stacked shallow learning. This difference lies in the fact that
annotated lymph node segmentation masks are required to NM EM GG FB MN CD
learn a mid-level semantic boundary detector [36], whereas Patch-LOO [38] 0.84 0.75 0.78 0.84 0.86 -
CNN approaches only need LN locations for training [22]. Patch-LOO [39] 0.88 0.77 0.80 0.87 0.89 -
In abdominal LN detection, [22] obtains the best trade- Patch-CV10 [54] 0.84 0.55 0.72 0.76 0.91 -
Patch-CV5 0.64 0.81 0.74 0.78 0.82 0.64
off between its CNN model complexity and sampled data
configuration. Our best performing CNN model is GoogLeNet- Slice-Test [40] 0.40 1.00 0.75 0.80 0.56 0.50
Slice-CV5 0.22 0.35 0.56 0.75 0.71 0.16
TL (256x256) which obtains an AUC=0.92 and 0.70 TPR/3FP. Slice-Random 0.90 0.86 0.85 0.94 0.98 0.83
The main difference between our dataset preparation pro-
TABLE IV
tocol and that from [22] is a more aggressive extraction of C OMPARISON OF INTERSTITIAL LUNG DISEASE CLASSIFICATION RESULTS
random views within a much larger range of FOVs. The USING F- SCORES : NM, EM, GG, FB, MN AND CD.
usage of larger FOVs to capture more image spatial context is
inspired by deep zoom-out features [44] that improve semantic
segmentation. This image sampling scheme contributes to our
with transfer learning from [33]). All ILD images (patches
best reported performance results in both mediastinal LN
of 64 × 64 and CT axial slices of 512 × 512) are re-sampled
detection (in this paper) and automated pancreas segmentation
to a fixed dimension of 256 × 256 pixels.
[45]. As shown in Figure 1, abdominal LNs are surrounded by
We evaluate the ILD classification task with five-fold CV
many other similar looking objects. Meanwhile, mediastinal
on patient-level split, as it is more informative for real clinical
LNs are more easily distinguishable, due to the images’
performance than LOO. The classification accuracy rates for
larger spatial contexts. Finally, from the perspective of the
interstitial lung disease detection are shown in Table III. Two
data-model trade-off: “Do We Need More Training Data or
sub-tasks on ILD patch and slice classifications are conducted.
Better Models?” [51], more abdomen CT scans from distinct
In general, patch-level ILD classification is less challenging
patient populations need to be acquired and annotated, in
than slice-level classification, as far more data samples can
order to take full advantage of deep CNN models of high
be sampled from the manually annotated ROIs (up to 100
capacity. Nevertheless, deeper and wider CNN models (e.g.,
image patches per ROI), available from [37]. From Table III,
GoogLeNet-RI-L and GoogLeNet-TL-H versus Cifar-10 [22])
all five deep models evaluated obtain comparable results within
have shown improved results in the mediastinal LN detection.
the range of classification accuracy rates [0.74, 0.76]. Their
Figure 9 provides examples of misclassified lymph nodes
averaged model achieves a slightly better accuracy of 0.79.
(in axial view) (both false negatives (Left) and false posi-
F1-scores [38], [39], [54] and the confusion matrix (Table
tives(Right)), from the Abdomen and Mediastinum datasets.
V) for patch-level ILD classification using GoogLeNet-TL
The overall reported LN detection results are clinically signif-
under five-fold cross-validation (we denote as Patch-CV5) are
icant, as indicated in [63].
Ground Prediction
B. Interstitial Lung Disease Classification truth NM EM GG FB MN CD
The CNN models evaluated in this experiment are 1) NM 0.68 0.18 0.10 0.01 0.03 0.01
AlexNet-RI (training from scratch on the ILD dataset with EM 0.03 0.91 0.00 0.02 0.03 0.01
random initialization); 2) AlexNet-TL (with transfer learn- GG 0.06 0.01 0.70 0.09 0.06 0.08
FB 0.01 0.02 0.05 0.83 0.05 0.05
ing from [4]); 3) AlexNet-ImNet: pre-trained ImageNet-CNN
MN 0.09 0.00 0.07 0.04 0.79 0.00
model [4] with only the last cost function layer retrained from CD 0.02 0.01 0.10 0.18 0.01 0.68
random initialization, according to the six ILD classes (similar
TABLE V
to [9] but without using additional hand-crafted non-deep C ONFUSION MATRIX FOR ILD CLASSIFICATION ( PATCH - LEVEL ) WITH
feature descriptors, such as GIST and BoVW); 4) GoogLeNet- FIVE - FOLD CV USING G OOG L E N ET-TL.
RI (random initialization); 5) GoogLeNet-TL (GoogLeNet

9
Method AlexNet-ImNet AlexNet-RI AlexNet-TL GoogLeNet-RI GoogLeNet-TL Avg-All

Slice-CV5 0.45 0.44 0.46 0.41 0.57 0.53
Patch-CV5 0.76 0.74 0.76 0.75 0.76 0.79
TABLE III
C OMPARISON OF INTERSTITIAL LUNG DISEASE CLASSIFICATION ACCURACIES ON BOTH SLICE - LEVEL (S LICE -CV5) AND PATCH - BASED (PATCH -CV5)
CLASSIFICATION USING FIVE - FOLD CV. B OLD NUMBERS INDICATE THE BEST PERFORMANCE VALUES ON CLASSIFICATION ACCURACY.
also computed. F1-scores are reported on patch classification No existing work has reached the performance requirements
only (32×32 pixel patches extracted from manual ROIs) [38], for a realistic clinical setting [40], in which simple ROI-guided
[39], [54], as shown in Table IV. Both [38] and [39] use the image patch extraction and classification (which requires man-
evaluation protocol of “leave-one-patient-out” (LOO), which ual ROI selection by clinicians) is implemented. The main goal
is arguably much easier and not directly comparable to 10-fold of this paper is to investigate the three factors (CNN architec-
CV [54] or our Patch-CV5. In this study, we classify six ILD tures, dataset characteristics and transfer learning) that affect
classes by adding a consolidation (CD) class to five classes performance on a specific medical image analysis problem
of healthy (normal - NM), emphysema (EM), ground glass and to ultimately deliver clinically relevant results. For ILD
(GG), fibrosis (FB), and micronodules (MN) in [38], [39], classification, the most critical performance bottlenecks are
[54]. Patch-CV10 [54] and Patch-CV5 report similar medium the challenge of cross-dataset learning and the limited patient
to high F-scores. This implies that the ILD dataset (although population size. We attempt to overcome these obstacles by
one of the mainstream public medical image datasets) may not merging the ILD [37] and LTRC datasets. Although the ILD
adequately represent ILD disease CT lung imaging patterns, [37] and LTRC datasets [64] (used in [19]) were generated
over a population of only 120 patients. Patch-CV5 yields and annotated separately, they contain many common disease
higher F-scores than [54] and classifies the extra consolidation labels. For instance, the ILD disease classes emphysema (EM),
(CD) class. At present, the most pressing task is to drastically ground glass (GG), fibrosis (FB), and micronodules (MN)
expand the dataset or to explore across-dataset deep learning belong to both datasets, and thus can be jointly trained/tested
on the combined ILD and LTRC datasets [64]. to form a larger and unified dataset.
Recently, Gao et al. [40] have argued that a new CADe Adapting fully convolutional CNN or FCNN to parse every
protocol on holistic classification of ILD diseases directly, pixel location in the ILD lung CT images or slices, or adapting
using axial CT slice attenuation patterns and CNN, may be other methods from CNN based semantic image segmentation
more realistic for clinical applications. We refer to this as using PASCAL or ImageNet, may improve accuracy and
slice-level classification, as image patch sampling from manual efficiency. However, current FCNN approaches [65], [66]
ROIs can be completely avoided (hence, no manual ROI lack adequate spatial resolution in their directly output label
inputs will be provided). The experimental results in [40] are space. A segmentation label propagation method was recently
conducted with a patient-level hard split of 100 (training) and proposed [47] to provide full pixel-wise labeling of the ILD
20 (testing). The method’s testing F-scores (i.e., Slice-Test) data images. In this work, we sample image patches from the
are given in Table IV. Note that the F-scores in [40] are not slice using the ROIs for the ILD provided in the dataset, in
directly comparable to our results, due to different evaluation order to be consistent with previous methods in patch-level
criteria. Only Slice-Test is evaluated and reported in [40], and [38], [39], [54] and slice-level classification [40].
we find that F-scores can change drastically from different
rounds of the five-fold CV. C. Evaluation of Five CNN Models using ILD Classification
While it is a more practical CADe scheme, slice-level In this work, we mainly focus on AlexNet and GoogLeNet.
CNN learning [40] is very challenging, as it is restricted AlexNet is the first notably successful CNN architecture on
to only 905 CT image slices with tagged ILD labels. We the ImageNet challenge and has rekindled significant research
only benchmark the slice-level ILD classification results in interests on CNN. GoogLeNet is the state-of-the-art deep
this section. Even with the help of data augmentation (de- model, which has outperformed other notable models, such as
scribed in Sec. II), the classification accuracy of GoogLeNet- AlexNet, OverFeat, and VGGNet [67], [68] in various com-
TL from Table III is only 0.57. However, transfer learning puter vision benchmarks. Likewise, a reasonable assumption
from ImageNet pre-trained model is consistently beneficial, is that OverFeat and VGGNet may generate quantitative per-
as evidenced by AlexNet-TL (0.46) versus AlexNet-RI (0.44), formance results ranked between AlexNet’s and GoogLeNet’s.
and GoogLeNet-TL (0.57) versus GoogLeNet-RI (0.41). It For completeness, we include the Overfeat and VGGNet in the
especially prevents GoogLeNet from over-fitting on the limited following evaluations, to bolster our hypothesis.
CADe datasets. Finally, when the cross-validation is conducted d) Overfeat: OverFeat is described in [67] as an inte-
by randomly splitting the set of all 905 CT axial slices into five grated framework for using CNN for classification, localiza-
folds, markedly higher F-scores are obtained (Slice-Random tion and detection. Its architecture is similar to that of AlexNet,
in Table IV). This further validates the claim that the dataset but contains far more parameters (e.g., 1024 convolution filters
poorly generalizes ILDs for different patients. Figure 10 shows in both “conv4” and “conv5” layers compared to 384 and
examples of misclassified ILD patches (in axial view), with 256 convolution kernels in the “conv4” and “conv5” layers of
their ground truth labels and inaccurately classified labels. AlexNet), and operates more densely (e.g., smaller kernel size

10
Fig. 10. Visual examples of misclassified ILD 64x64 patches (in axial view), with their ground truth labels and inaccurately classified labels.
of 2 in “pool2” layer “pool5” compared to the kernel size 3 in Method ILD-Slice Method ILD-Patch
“pool2” and “pool5” of AlexNet) on the input image. Overfeat CifarNet - CifarNet 0.799
AlexNet-TL 0.867 AlexNet-TL 0.865
is the winning model of the ILSVRC 2013 in detection and
Overfeat-TL 0.877 Overfeat-TL 0.879
classification tasks. VGG-16-TL 0.90 VGG-16-TL 0.893
e) VGGNet: The VGGNet architecture is introduced in GoogLeNet-TL 0.902 GoogLeNet-TL 0.911
[68], where it is designed to significantly increase the depth TABLE VI
of the existing CNN architectures with 16 or 19 layers. Very C LASSIFICATION RESULTS ON ILD AND LN DETECTION WITH LOO.
small 3×3 size convolutional filters are used in all convolution
layers with a convolutional stride of size 1, in order to reduce
the number of parameters in deeper networks. Since VGGNet
is substantially deeper than the other CNN models, VGGNet CifarNet AlexNet Overfeat VGG-16 GoogLeNet
Time 7m16s 1h2m 1h26m 20h24m 2h49m
is more susceptible to the vanishing gradient problem [69],
Memory 2.25 GB 3.45 GB 4.22 GB 9.26 GB 5.37 GB
[70], [71]. Hence, the network may be more difficult to
TABLE VII
train. Training the network requires far more memory and T RAINING TIME AND MEMORY REQUIREMENTS OF THE FIVE CNN
computation time than AlexNet. We use the 16 layer variant ARCHITECTURES ON ILD PATCH - BASED CLASSIFICATION UP TO 90
as our default VGGNet model in our study. EPOCHS .
The classification accuracy results for ILD slice and

patch level classification of five CNN architectures (CifarNet,
AlexNet, Overfeat, VGGNet and GoogLeNet) are shown in
Table VI. Based on the analysis in Sec. IV-B, transfer learning
is only used for the slice level classification task. From in ILD slice-level classification, the accuracy level drastically
Table VI, quantitative classification accuracy rates increase as increases from 0.46 to 0.867 using AlexNet-TL, and from 0.57
the CNN model becomes more complex (CifarNet, AlexNet, to 0.902 for GoogLeNet-TL.
Overfeat, VGGNet and GoogLeNet, in ascending order), for CNN training is implemented with the Caffe [56] deep
both ILD slice and patch level classification problems. The learning framework, using a NVidia K40 GPU on Ubuntu
reported results validate our assumption that OverFeat’s and 14.04 Linux OS. All models are trained for up to 90 epochs
VGGNets performance levels fall between AlexNet’s and with early stopping criteria, where a model snapshot with
GoogLeNets (this observation is consistent with the computer low validation loss is taken for the final model. Other hyper-
vision findings). CifarNet is designed for images with smaller parameters are fixed as follows: momentum: 0.9; weight de-
dimensions (32 × 32 images), and thus is not catered to cay: 0.0005; and a step learning rate schedule with base learn-
classification tasks involving 256 × 256 images. ing rate of 0.01, decreased by a factor of 10 every 30 epochs.
To investigate the performance difference between five-fold The image batch size is set to 128, except for GoogLeNet’s
cross-validation (CV) in Sec. IV-B and leave-one-patient-out (64) and VGG-16’s (32), which are the maximum batch sizes
(LOO) validation, this experiment is performed under the that can fit in the NVidia K40 GPU with 12GB of memory
LOO protocol. By comparing results in Table III (CV-5) to capacity. Table VII illustrates the training time and memory
those in Table VI (LOO), one can see that LOOs quantitative requirements of the five CNN architectures on ILD patch-
performances are remarkably better than CV-5’s. For example, based classification up to 90 epochs.

11
D. Training with “Equal Prior” vs. “Biased Prior” loss, validation loss and validation accuracy of AlexNet-RI and
Medical datasets are often “biased”, in that the number of AlexNet-TL, are shown in Figure 11. For AlexNet-RI in Figure
healthy samples is much larger than the number of diseased 11 (a), the training loss significantly decreases as the number
instances, or that the numbers of images per class are uneven. of training epochs increases, while the validation loss notably
In ILD dataset, the number of fibrosis samples is about increases and the validation accuracy does not improve much
3.5 times greater than the number of emphysema samples. before reaching a plateau. With transfer learning and fine-
The number of non-LNs is 3 ∼ 4 times greater than the tuning, much better and consistent performances of training
number of LNs in lymph node detection. Different sampling loss, validation loss and validation accuracy traces are obtained
or resampling rates are routinely applied to both ILD and LN (see Figure 11 (b)). We begin the optimization problem – that
detection to balance the data sample number or scale per class, of fine-tuning the ImageNet pre-trained CNN to classify a
as in[22]. We refer this as “Equal Prior”. If we use the same comprehensive set of images – by initializing the parameters
sampling rate, that will lead to a “Biased Prior” across different close to an optimal solution. One could compare this process
classes. to making adults learn to classify ILDs, as opposed to babies.
Without loss of generality, after GoogLeNet is trained on During the process, the validation loss, having remained at
the training sets under “Equal” or “Biased” priors, we com- lower values throughout, achieves higher final accuracy levels
pare its classification results on the balanced validation sets. than the validation loss on a similar problem with random
Evaluating a classifier on a biased validation set will cause initialization. Meanwhile, the training losses in both cases
unfair assessment of its performance. For instance, a classifier decrease to values near zero. This indicates that both AlexNet-
that predicts every image patch as “non-LN” will still achieve a RI and AlexNet-TL over-fit on the ILD dataset, due to its small
70% accuracy rate on a biased set with 3.5 times as many non- instance size. The quantitative results in Table III indicate
LN samples as LN samples. The classification accuracy results that AlexNet-TL and GoogLeNet-TL have consistently better
of GoogLeNet trained under two configurations are shown in classification accuracies than AlexNet-RI and GoogLeNet-RI,
Table VIII. Overall, it achieves lower accuracy results when respectively.
trained with a “biased prior” in both tasks, and the accuracy The last pooling layer (pool-5) activation maps of the Ima-
difference for ILD patch-based classification is small. geNet pre-trained AlexNet [4] (analogical to AlexNet-ImNet)
and AlexNet-TL, obtained by processing two input images of
ILD-Slice ILD-Patch Figure 2 (b,c), are shown in Figure 13 (a,b). The last pooling
Equal Prior 0.902 0.953 layer activation map summarizes the entire input image by
Biased Prior 0.872 0.952 highlighting which relative locations or neural reception fields
TABLE VIII relative to the image are activated. There are a total of 256
C LASSIFICATION ACCURACIES FOR ILD SLICE AND LN PATCH - LEVEL
DETECTION WITH “ EQUAL PRIOR ” AND “ BIASED PRIOR ”, USING
(6x6) reception fields in AlexNet [4]. Pooling units where the
G OOG L E N ET-TL. relative image location of the disease region is present in the
image are highlighted with green boxes. Next, we reconstruct
the original ILD images using the process of de-convolution,
back-propagating with convolution and un-pooling from the
V. A NALYSIS VIA CNN L EARNING T RACES & activation maps of the chosen pooling units [72]. From the
LULV ISUALIZATION reconstructed images (Figure 13 bottom), we observe that
In this section, we determine and analyze, via CNN visu- with fine-tuning, AlexNet-TL detects and localizes objects of
alization, the reasons for which transfer learning is beneficial interest (ILD disease regions depicted in in Figure 2 (b) and
to achieve better performance on CAD applications. (c)) better than AlexNet-ImNet. The filters shown in Figure
Thoracoabdominal LN Detection. In Figure 12, the 13 that better localize regions on the input images (Figure 2
first layer convolution filters from five different CNN ar- (b) and (c)) respectively, produce relatively higher activations
chitectures are visualized. We notice that without trans- (in the top 5%) among all 512 reception field responses in
fer learning [57], [6], somewhat blurry filters are learned the fine-tuned AlexNet-TL model. As observed in [73], the
(AlexNet-RI (256x256), AlexNet-RI (64x64), GoogLeNet- final CNN classification score can not be driven solely by a
RI (256x256) and GoogLeNet-RI (64x64)). However, in single strong activation in the receptions fields, but often by a
AlexNet-TL (256x256), many higher orders of contrast- or sparse set of high activations (i.e., varying selective or sparse
edge-preserving patterns (that enable capturing image ap- activations per input image).
pearance details) are evidently learned through fine-tuning
from ImageNet. With a smaller input resolution, AlexNet-RI
(64x64) and GoogLeNet-RI (64x64) can learn image contrast VI. F INDINGS AND F UTURE D IRECTIONS
filters to some degree; whereas, GoogLeNet-RI (256x256)
and AlexNet-RI (256x256) have over-smooth low-level filters We summarize our findings as follows.
throughout. • Deep CNN architectures with 8, even 22 layers [4],
ILD classification. We focus on analyzing visual CNN [33], can be useful even for CADe problems where the
optimization traces and activations from the ILD dataset, as available training datasets are limited. Previously, CNN
its slice-level setting is most similar to ImageNet’s. Indeed, models used in medical image analysis applications have
both datasets use full-size images. The traces of the training often been 2 ∼ 5 orders of magnitude smaller.

12
Fig. 11. Traces of training and validation loss (blue and green lines) and validation accuracy (orange lines) during (a) training AlexNet from random
initialization and (b) fine-tuning from ImageNet pre-trained CNN, for ILD classification.
Fig. 12. Visualization of first layer convolution filters of CNNs trained on abdominal and mediastinal LNs in RGB color, from random initialization (AlexNet-RI
(256x256), AlexNet-RI (64x64), GoogLeNet-RI (256x256) and GoogLeNet-RI (64x64)) and with transfer learning (AlexNet-TL (256x256)).
• The trade-off between using better learning models and medical image dataset, as evaluated in this paper.
using more training data [51] should be carefully consid-
VII. C ONCLUSION
ered when searching for an optimal solution to any CADe
problem (e.g., mediastinal and abdominal LN detection). In this paper, we exploit and extensively evaluate three im-
• Limited datasets can be a bottleneck to further ad- portant, previously under-studied factors on deep convolutional
vancement of CADe. Building progressively growing (in neural networks (CNN) architecture, dataset characteristics,
scale), well annotated datasets is at least as crucial as and transfer learning. We evaluate CNN performance on
developing new algorithms. This has been accomplished, two different computer-aided diagnosis applications: thoraco-
for instance, in the field of computer vision. The well- abdominal lymph node detection and interstitial lung disease
known scene recognition problem has made tremendous classification. The empirical evaluation, CNN model visual-
progress, thanks to the steady and continuous develop- ization, CNN performance analysis, and conclusive insights
ment of Scene-15, MIT Indoor-67, SUN-397 and Place can be generalized to the design of high performance CAD
datasets [58]. systems for other medical imaging tasks.
• Transfer learning from the large scale annotated natural ACKNOWLEDGMENT
image datasets (ImageNet) to CADe problems has been
This work was supported in part by the Intramural Re-
consistently beneficial in our experiments. This sheds
search Program of the National Institutes of Health Clinical
some light on cross-dataset CNN learning in the medical
Center, and in part by a grant from the KRIBB Research
image domain, e.g., the union of the ILD [37] and LTRC
Initiative Program (Korean Biomedical Scientist Fellowship
datasets [64], as suggested in this paper.
Program), Korea Research Institute of Bioscience and Biotech-
• Finally, applications of off-the-shelf deep CNN image
nology, Republic of Korea. This study utilized the high-
features to CADe problems can be improved by either
performance computational capabilities of the Biowulf Linux
exploring the performance-complementary properties of
cluster at the National Institutes of Health, Bethesda, MD
hand-crafted features [10], [9], [12], or by training CNNs
(http://biowulf.nih.gov). We thank NVIDIA for the K40 GPU
from scratch and better fine-tuning CNNs on the target
donation.

13
Fig. 13. Visualization of the last pooling layer (pool-5) activations (top). Pooling units where the relative image location of the disease region is located in
the image are highlighted with green boxes. The original images reconstructed from the units are shown in the bottom [72]. The examples in (a) and (b) are
computed from the input ILD images in Figure 2 (b) and (c), respectively.
R EFERENCES phy using an ensemble of 2d views and a convolutional neural network

out-of-the-box,” Medical Image Analysis, 2015.
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: [13] B. Menze, M. Reyes, and K. Van Leemput, “The multimodal brain tumor
A large-scale hierarchical image database,” in IEEE CVPR, 2009. image segmentation benchmark (brats),” Medical Imaging, IEEE Trans.
on, vol. 34, no. 10, pp. 1993–2024, 2015.
[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and [14] Y. Pan, W. Huang, Z. Lin, W. Zhu, J. Zhou, J. Wong, and Z. Ding,
L. Fei-Fei, “Imagenet large scale visual recognition challenge,” “Brain tumor grading based on neural networks and convolutional neural
arXiv:1409.0575, 2014. networks,” in IEEE EMBC, 2015, pp. 699–702.
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning [15] W. Shen, M. Zhou, F. Yang, C. Yang, and J. Tian, “Multi-scale
applied to document recognition,” Proc. of the IEEE, vol. 86, no. 11, convolutional neural networks for lung nodule classification,” in IPMI,
pp. 2278–2324, 1998. 2015, pp. 588–599.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [16] G. Carneiro, J. Nascimento, and A. P. Bradley, “Unregistered multiview
with deep convolutional neural networks,” in NIPS, 2012. mammogram analysis with pre-trained deep learning models,” in MIC-
[5] A. Krizhevsky, “Learning multiple layers of features from tiny images,” CAI, 2015, pp. 652–660.
in Master’s Thesis. Dept. of Comp. Science, University of Toronto, [17] J. M. Wolterink, T. Leiner, M. A. Viergever, and I. Išgum, “Automatic
2009. coronary calcium scoring in cardiac ct angiography using convolutional
[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based neural networks,” in MICCAI, 2015, pp. 589–596.
convolutional networks for accurate object detection and semantic seg- [18] T. Schlegl, J. Ofner, and G. Langs, “Unsupervised pre-training across
mentation,” IEEE Trans. Pattern Anal. Mach. Intell., 2015. image domains improves lung tissue classification,” in Medical Com-
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in puter Vision: Algorithms for Big Data. Springer, 2014, pp. 82–93.
deep convolutional networks for visual recognition,” IEEE Trans. Pattern [19] J. Hofmanninger and G. Langs, “Mapping visual features to semantic
Anal. Mach. Intell., 2015. profiles for retrieval in medical imaging,” in IEEE Conf. on CVPR, 2015.
[8] M. Everingham, S. M. A. Eslami, L. Van Gool, C. Williams, J. Winn, [20] G. Carneiro and J. Nascimento, “Combining multiple dynamic models
and A. Zisserman, “The pascal visual object classes challenge: A and deep learning architectures for tracking the left ventricle endo-
retrospective,” International journal of computer vision, vol. 111, no. 1, cardium in ultrasound data,” IEEE Trans. Pattern Anal. Mach. Intell.,
pp. 98–136, 2015. vol. 35, no. 11, pp. 2592–2607, 2013.
[9] B. van Ginneken, A. Setio, C. Jacobs, and F. Ciompi, “Off-the-shelf [21] R. Li, W. Zhang, H. Suk, L. Wang, J. Li, D. Shen, and S. Ji, “Deep
convolutional neural network features for pulmonary nodule detection learning based imaging data completion for improved brain disease
in computed tomography scans,” in IEEE ISBI, 2015, pp. 286–289. diagnosis,” in MICCAI, 2014.
[10] Y. Bar, I. Diamant, H. Greenspan, and L. Wolf, “Chest pathology [22] H. Roth, L. Lu, J. Liu, J. Yao, A. Seff, K. M. Cherry, E. Turkbey, and
detection using deep learning with non-medical training,” in IEEE ISBI, R. M. Summers, “Improving computer-aided detection using convolu-
2015. tional neural networks and random view aggregation,” in IEEE Trans.
[11] H. Shin, L. Lu, L. Kim, A. Seff, J. Yao, and R. Summers, “Interleaved on Medical Imaging, 2016.
text/image deep mining on a large-scale radiology image database,” in [23] A. Barbu, M. Suehling, X. Xu, D. Liu, S. K. Zhou, and D. Comaniciu,
IEEE Conf. on CVPR, 2015, pp. 1–10. “Automatic detection and segmentation of lymph nodes from ct data,”
[12] F. Ciompi, B. de Hoop, S. J. van Riel, K. Chung, E. Scholten, Medical Imaging, IEEE Trans. on, vol. 31, no. 2, pp. 240–250, 2012.
M. Oudkerk, P. de Jong, M. Prokop, and B. van Ginneken, “Automatic [24] J. Feulner, S. K. Zhou, M. Hammon, J. Hornegger, and D. Comaniciu,
classification of pulmonary peri-fissural nodules in computed tomogra- “Lymph node detection and segmentation in chest ct data using discrim-

14
inative learning and a spatial prior,” Medical image analysis, vol. 17, [48] H. Wang, J. W. Suh, S. R. Das, J. B. Pluta, C. Craige, P. Yushkevich
no. 2, pp. 254–270, 2013. et al., “Multi-atlas segmentation with joint label fusion,” IEEE Trans.
[25] M. Feuerstein, B. Glocker, T. Kitasaka, Y. Nakamura, S. Iwano, and Pattern Anal. Mach. Intell., vol. 35, no. 3, pp. 611–623, 2013.
K. Mori, “Mediastinal atlas creation from 3-d chest computed tomogra- [49] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for
phy images: application to automated detection and station mapping of free?–weakly-supervised learning with convolutional neural networks,”
lymph nodes,” Medical image analysis, vol. 16, no. 1, pp. 63–74, 2012. in IEEE CVPR, 2015, pp. 685–694.
[26] L. Lu, P. Devarakota, S. Vikal, D. Wu, Y. Zheng, and M. Wolf, [50] M. Oquab, L. Bottou, I. Laptev, and S. Josef, “Learning and transferring
“Computer aided diagnosis using multilevel image features on large- mid-level image representations using convolutional neural networks,”
scale evaluation,” in Medical Computer Vision. Large Data in Medical in IEEE CVPR, 2015, pp. 1717–1724.
Imaging. Springer, 2014, pp. 161–174. [51] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes, “Do we need more
[27] L. Lu, J. Bi, M. Wolf, and M. Salganicoff, “Effective 3d object detection training data or better models for object detection?” in BMVC, 2012.
and regression using probabilistic segmentation features in ct images,” [52] D. Ciresan, A. Giusti, L. Gambardella, and J. Schmidhuber, “Mitosis
in IEEE CVPR, 2011. detection in breast cancer histology images with deep neural networks,”
[28] L. Lu, A. Barbu, M. Wolf, J. Liang, M. Salganicoff, and D. Comaniciu, in MICCAI, 2013.
“Accurate polyp segmentation for 3d ct colonography using multi-staged [53] W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, and D. Shen, “Deep
probabilistic binary learning and compositional model,” in IEEE CVPR, convolutional neural networks for multi-modality isointense infant brain
2008. image segmentation,” NeuroImage, vol. 108, pp. 214–224, 2015.
[29] N. Tajbakhsh, M. B. Gotway, and J. Liang, “Computer-aided pulmonary [54] Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, and M. Chen, “Med-
embolism detection using a novel vessel-aligned multi-planar image ical image classification with convolutional neural network,” in IEEE
representation and convolutional neural networks,” in MICCAI, 2015. ICARCV, 2014, pp. 844–848.
[30] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” [55] G. A. Miller, “Wordnet: a lexical database for english,” Communications
International journal of computer vision, vol. 60, no. 2, pp. 91–110, of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
2004. [56] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,
[31] N. Dalal and B. Triggs, “Histograms of oriented gradients for human S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
detection,” in IEEE CVPR, vol. 1, 2005, pp. 886–893. fast feature embedding.” in ACM Multimedia, vol. 2, 2014, p. 4.
[32] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image [57] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features
databases for recognition,” in IEEE CVPR, 2008, pp. 1–8. off-the-shelf: an astounding baseline for recognition,” in IEEE CVPRW,
[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, 2014, pp. 512–519.
and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conf. [58] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning
on CVPR, 2015. deep features for scene recognition using places database,” in NIPS,
[34] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of 2014, pp. 487–495.
the devil in the details: Delving deep into convolutional nets,” in BMVC, [59] S. Gupta, R. Girshick, P. Arbelez, and J. Malik, “Learning rich features
2014. from rgb-d images for object detection and segmentation,” in ECCV,
2014, pp. 345–360.
[35] K. Chatfield, V. S. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil
[60] S. Gupta, P. Arbelez, R. Girshick, and J. Malik, “Indoor scene under-
is in the details: an evaluation of recent feature encoding methods.” in
standing with rgb-d images: Bottom-up segmentation, object detection
BMVC, 2011.
and semantic segmentation,” International Journal of Computer Vision,
[36] A. Seff, L. Lu, A. Barbu, H. Roth, H.-C. Shin, and R. M. Sum-
vol. 112, no. 2, pp. 133–149, 2015.
mers, “Leveraging mid-level semantic boundary cues for computer-aided
[61] A. Gupta, M. Ayhan, and A. Maida, “Natural image bases to represent
lymph node detection,” in MICCAI, 2015.
neuroimaging data,” in ICML, 2013, pp. 987–994.
[37] A. Depeursinge, A. Vargas, A. Platon, A. Geissbuhler, P.-A. Poletti, and [62] H. Chen, Q. Dou, D. Ni, J. Cheng, J. Qin, S. Li, and P. Heng, “Automatic
H. Müller, “Building a reference multimedia database for interstitial lung fetal ultrasound standard plane detection using knowledge transferred
diseases,” Computerized medical imaging and graphics, vol. 36, no. 3, recurrent neural networks,” in MICCAI, 2015, pp. 507–514.
pp. 227–238, 2012. [63] L. Kim, H. Roth, L. Lu, S. Wang, E. Turkbey, and S. M. Ronald,
[38] Y. Song, W. Cai, Y. Zhou, and D. D. Feng, “Feature-based image patch “Performance assessment of retroperitoneal lymph node computer-
approximation for lung tissue classification,” Medical Imaging, IEEE assisted detection using random forest and deep convolutional neural
Trans. on, vol. 32, no. 4, pp. 797–808, 2013. network learning algorithms in tandem,” in the 102nd Annual Meeting
[39] Y. Song, W. Cai, H. Huang, Y. Zhou, D. Feng, Y. Wang, M. Fulham, of Radiological Society of North America, 2014.
and M. Chen, “Large margin local estimate with applications to medical [64] D. Holmes III, B. Bartholmai, R. Karwoski, V. Zavaletta, and R. Robb,
image classification.” IEEE Trans. on Medical Imaging, 2015. “The lung tissue research consortium: an extensive open database
[40] M. Gao, U. Bagci, L. Lu, A. Wu, M. Buty, H.-C. Shin, H. Roth, containing histological, clinical, and radiological data to study chronic
Z. G. Papadakis, A. Depeursinge, M. R. Summers, Z. Xu, and J. D. lung disease,” in 2006 MICCAI Open Science Workshop, 2006.
Mollura, “Holistic classification of ct attenuation patterns for interstitial [65] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
lung diseases via deep convolutional neural networks,” in MICCAI first for semantic segmentation,” in IEEE CVPR, 2015.
Workshop on Deep Learning in Medical Image Analysis, 2015. [66] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
[41] A. Seff, L. Lu, K. M. Cherry, H. R. Roth, J. Liu, S. Wang, J. Hoffman, “Semantic image segmentation with deep convolutional nets and fully
E. B. Turkbey, and R. M. Summers, “2d view aggregation for lymph connected crfs,” ICLR, 2015.
node detection using a shallow hierarchy of linear classifiers,” in [67] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-
MICCAI, 2014, pp. 544–552. Cun, “Overfeat: Integrated recognition, localization and detection using
[42] L. Lu, M. Liu, X. Ye, S. Yu, and H. Huang, “Coarse-to-fine classification convolutional networks,” in ICLR, 2014.
via parametric and nonparametric models for computer-aided diagnosis,” [68] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
in ACM Conf. on CIKM, 2011, pp. 2509–2512. large-scale image recognition,” ICLR, 2014.
[43] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical [69] S. Hochreiter, “The vanishing gradient problem during learning recurrent
features for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell., neural nets and problem solutions,” Int. J. of Uncertainty, Fuzziness and
vol. 35, no. 8, pp. 1915–1929, 2013. Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998.
[44] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feedfor- [70] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
ward semantic segmentation with zoom-out features,” arXiv preprint deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
arXiv:1412.0774, 2014. 2006.
[45] H. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. Turkbey, and R. M. [71] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
Summers, “Deeporgan: Multi-level deep convolutional networks for with gradient descent is difficult,” Neural Networks, IEEE Transactions
automated pancreas segmentation,” in MICCAI, 2015. on, vol. 5, no. 2, pp. 157–166, 1994.
[46] X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, and S. Yan, [72] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
“Human parsing with contextualized convolutional neural network,” in tional networks,” in ECCV, 2014, pp. 818–833.
IEEE ICCV, 2015, pp. 1386–1394. [73] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of
[47] M. Gao, L. Lu, I. Nogues, M. R. Summers, and D. Mollura, “Segmen- multilayer neural networks for object recognition,” in ECCV, 2014.
tation label propagation using deep convolutional neural networks and
dense conditional random field,” in IEEE ISBI. IEEE, 2016.

Deep Convolutional Neural Networks For Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning

Uploaded by

Deep Convolutional Neural Networks For Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning

Uploaded by

This article has been accepted for publication in a future issue of this journal, but has not been

Deep Convolutional Neural Networks for

U.S. Government work not protected by U.S. copyright.

U.S. Government work not protected by U.S. copyright.

Fig. 1. Some examples of abdominal and mediastinal lymph nodes sampled

Afterwards, negative LN samples are randomly re-selected at a

U.S. Government work not protected by U.S. copyright.

Fig. 4. An example of lung/high-attenuation/low-attenuation CT windowing

U.S. Government work not protected by U.S. copyright.

b) AlexNet: The AlexNet architecture was published in

c) GoogLeNet: The GoogLeNet model proposed in [33],

U.S. Government work not protected by U.S. copyright.

U.S. Government work not protected by U.S. copyright.

Region Mediastinum Abdomen in Figure 8. The area-under-the-FROC-curve (AUC) and true

U.S. Government work not protected by U.S. copyright.

Alex-Net-H for abdominal LN detection. AlexNet-RI-H ex-

U.S. Government work not protected by U.S. copyright.

Method AlexNet-ImNet AlexNet-RI AlexNet-TL GoogLeNet-RI GoogLeNet-TL Avg-All

U.S. Government work not protected by U.S. copyright.

The classification accuracy results for ILD slice and

U.S. Government work not protected by U.S. copyright.

U.S. Government work not protected by U.S. copyright.

U.S. Government work not protected by U.S. copyright.

R EFERENCES phy using an ensemble of 2d views and a convolutional neural network

U.S. Government work not protected by U.S. copyright.

U.S. Government work not protected by U.S. copyright.

You might also like