Valkonen2017 PDF
Valkonen2017 PDF
Valkonen2017 PDF
Mira Valkonen,1,2 Kimmo Kartasalo,1,2 Kaisa Liimatainen,1,2 Matti Nykter,1,2 Leena Latonen,1
Pekka Ruusuvuori1,3*
Abstract
1
BioMediTech and Faculty of Medicine Digital pathology has led to a demand for automated detection of regions of interest,
and Life Sciences, University of such as cancerous tissue, from scanned whole slide images. With accurate methods
Tampere, Tampere, Finland using image analysis and machine learning, significant speed-up, and savings in costs
2
BioMediTech Institute and Faculty of through increased throughput in histological assessment could be achieved. This article
Biomedical Science and Engineering, describes a machine learning approach for detection of cancerous tissue from scanned
Tampere University of Technology, whole slide images. Our method is based on feature engineering and supervised learn-
Tampere, Finland ing with a random forest model. The features extracted from the whole slide images
3
include several local descriptors related to image texture, spatial structure, and distribu-
Faculty of Computing and Electrical tion of nuclei. The method was evaluated in breast cancer metastasis detection from
Engineering, Tampere University of lymph node samples. Our results show that the method detects metastatic areas with
Technology, Pori, Finland high accuracy (AUC 5 0.97–0.98 for tumor detection within whole image area,
AUC 5 0.84–0.91 for tumor vs. normal tissue detection) and that the method general-
Grant sponsor: Academy of Finland, Grant izes well for images from more than one laboratory. Further, the method outputs
number: 269474 an interpretable classification model, enabling the linking of individual features to
differences between tissue types. VC 2017 International Society for Advancement of Cytometry
Grant sponsor: Tekes - The Finnish Fund-
ing Agency for Innovation, Grant number:
Key terms
269/31/2015
metastasis detection; digital pathology; computer aided diagnosis; whole slide images;
Grant sponsor: Cancer Society of Finland, machine learning; random forest; breast cancer; sentinel lymph nodes
Sigrid Juselius Foundation and Doctoral
Programme of Computing and Electrical
Engineering, Tampere University of
Technology INTRODUCTION
Additional Supporting Information may be IN recent years, improvements in computational power and whole slide digital scan-
found in the online version of this article. ners have allowed digitalization of histopathological tissue sections and enabled the
Mira Valkonen and Kimmo Kartasalo
contributed equally to this work.
development of digital pathology into a routine practice (1). Histopathological whole
slide images (WSI) contain vast amounts of data, for which digital pathology enables
*Correspondence to: Pekka Ruusuvuori;
quantitative analysis and the utilization of all available data, allowing for more infor-
Tampere University of Technology,
P.O. Box 300, 28101, Pori, Finland. mation to be gained from the images (2,3). This has led to increased interest in the
E-mail: [email protected] development of image analysis tools for tasks such as automatic detection of regions of
Published online 00 Month 2017 in Wiley interest (4), stain normalization (5), and nuclei detection (6). These advances hold
Online Library (wileyonlinelibrary.com) great promise for providing clinical decision support systems for pathologists (7).
DOI: 10.1002/cyto.a.23089 Breast cancer is the most common malignant disease in women worldwide (8).
C 2017 International Society for
In less developed countries, it is the most frequent cause of cancer death in women,
V
Advancement of Cytometry while in developed countries it is the second most common cause of cancer death
after lung cancer (8). With over 1.7 million new cancer cases diagnosed annually,
diagnosis, and treatment of breast cancer poses a humane as well as an economic
challenge all over the world.
In breast cancer patients, the main cause of death is metastasis at distant sites of
the body. Metastasis in sentinel lymph nodes is one of the most important prognostic
variables in breast cancer (9). Traditional histopathological biomedical image analysis have gained interest from the com-
diagnosis is, however, time-consuming as well as prone to munity of image analysis developers. Such events facilitate the
misinterpretation and subjectivity. Automated detection of sharing of new ideas and best practices. More importantly,
lymph node metastasis could facilitate the task of pathologists they provide annotated datasets for the use of the community.
by reducing their workload in breast cancer diagnostics and In this study, we use data from the Camelyon16 breast cancer
overcome the subjective interpretation problem (10). Ideally, metastasis detection challenge which was organized in con-
automated analysis would screen the samples and provide the junction with the IEEE International Symposium on Biomedi-
detected regions for pathologist review, or even proceed cal Imaging 2016 (http://camelyon16.grand-challenge.org).
directly to decisions. A more realistic scenario is to use auto- The challenge dataset contains altogether 270 images obtained
mated analysis for pre-screening the images in order to give at two separate laboratories, each equipped with a different
supporting information and to potentially exclude areas not scanner device. The set consists of images from 160 normal
relevant for diagnosis. samples and 110 tumor samples with cancer metastases out-
As diagnosis of cancer requires a significant amount of lined by experts, providing a valuable resource for method
expertise—in practice, a pathologist—it is natural that any development and validation purposes.
automated methods should be capable of incorporating or In this article, we present a method for automated detec-
mimicking such knowledge in their decision making process. tion of cancer hot-spots in hematoxylin and eosin (H&E)
Certain qualitative decision rules apply in the diagnosis, and stained WSI of sentinel lymph node sections. Our method is
in order to automatize the process, such rules should be based on feature engineering and machine learning, and it is
replaced by quantitative analysis of numerical data. Supervised an extension of the learning-based analysis presented in Ref.
machine learning provides a powerful tool for deriving deci- 25 into a fully automated WSI analysis pipeline. The proposed
sion rules based on example data. Traditionally, supervised system also enables learning about tissue texture, potentially
learning involves the process of feature extraction from images linking the extracted features with growth properties in nor-
prior to applying the learning algorithm. Thus, in addition to mal and metastatic tissue. We evaluated the performance of
providing the teaching samples by outlining regions of tumor the method in breast cancer metastasis detection via blockwise
content and normal tissue, expert, and prior knowledge can receiver operating characteristic (ROC) analysis.
be included in the feature generation step.
A number of studies available in the literature show the MATERIALS AND METHODS
great potential of machine learning tools in digital pathology
Image Data
applications, such as in the detection of regions of interest The first dataset used in this study consists of 170 whole
(ROI), or in phenotype, cell type, or tissue type classification, slide images of sentinel lymph node sections collected at the
see Refs. 11–15 for recent examples. In order to use learning Radboud University Medical Center (Nijmegen, the Nether-
based methods, a training dataset is required, that is, slides/
lands). A total of 100 WSIs presented normal lymph node sec-
images for which the ground truth segmentation/annotation
tions and 70 WSIs contained micro- and macro-metastases.
of ROIs is available. Typically, this approach utilizes available
Altogether 60 of these cancerous lymph node sections were
training data both for determining the decision rules and for
fully annotated and 10 partially annotated. The second dataset
selecting the features to be used in the decision process, where
of 100 WSIs was collected at the University Medical Center
the latter property may be either a separate step or belong
Utrecht (Utrecht, the Netherlands) and it contains 60 WSIs of
intrinsically to the classifier design (16,17). Recently, methods
normal lymph node sections and 40 WSIs with lymph node
relying on built-in automated feature extraction and deep
metastases. Of the 40 cancerous slides, 37 were fully annotated
learning, such as convolutional neural networks, have gained
and 3 partially annotated. Both datasets were provided for the
ground in classification and detection tasks (18–21). Using
Camelyon16 challenge (http://camelyon16.grand-challenge.
the deep learning approach, several breakthrough results in
org). The whole slide images and the corresponding annota-
contest challenges and image classification tasks have been
tion masks were provided as multi-resolution pyramids in
achieved (22–24). While appealing due to the high accuracy in
Phillips BigTIFF format. The pixel size of the images at the
tasks where a large amount of training data is available, meth-
full resolution level was 243 nm. We used the fully annotated
odology for interpreting a deep classifier model is currently
slides to obtain both positive and negative training examples.
lacking.
The partially annotated slides were only used to obtain posi-
The requirement of a large and representative annotated
tive examples to avoid the risk of using unannotated metastat-
dataset when applying machine learning for image segmenta-
ic regions as negative training data.
tion poses a challenge in practice (2). Generation of such
annotations is expensive, since it requires expertise and time System Overview
of pathologists, and an extensive amount of manual work An overview of the system presented in this study is
especially when considering pixelwise annotations. Thus, shown in Figure 1. As preprocessing steps, we segment the tis-
datasets of decent size paired with ground truth information sue region, and apply color correction through matching the
are extremely valuable for the community developing the color space to that of a reference image. Color correction is
detection and segmentation methods. Recently, challenges needed for the purpose of generalizing the method to inputs
and contests organized within conferences in the field of with different characteristics due to scanner and staining
Figure 1. The analysis workflow for training (upper half) and classification (lower half). During model training, the lymph node tissue
(blue outline) is first segmented from the whole slide image containing annotated metastatic regions (yellow outline). The detected tissue
sections are then divided into 8,192 3 8,192 pixel RGB subimages and subjected to an optional stain normalization step. Eosin and hema-
toxylin channels are separated from each subimage using a color deconvolution approach. Tissue blocks of 200 3 200 pixels are then ran-
domly sampled from normal (boxes outlined in green) and cancerous (boxes outlined in red) regions from both channels. Features are
extracted from each tissue block to get feature vector representations, which are fed to a random forest model as training data. During
classification, the workflow proceeds similarly until the extraction of eosin and hematoxylin channels. Instead of random sampling, all
200 3 200 pixel blocks (boxes outlined in blue) are analyzed from each stain channel and fed to the feature extraction module. The trained
random forest model is then used to classify each test block and as an output the model assigns a confidence value associated with its
choice. Confidence value is an estimate of probability for a sample block to belong to the group of cancerous tissue. This confidence value
is assigned for each tissue block to get a confidence map for the entire WSI as an output. Here, the ground truth annotations are overlaid
in yellow on the confidence map for reference. Depending on the application, the confidence maps can be further refined to obtain differ-
ent final outputs, such as binary classification of entire slides, visualizations of cancer hotspots or quantification of the properties of
detected lesions.
3. Apply a threshold of 0.5 3 tOtsu to S, where tOtsu is the val- boxes. Regions of the WSIs corresponding to each bounding
ue obtained using Otsu’s method (26), to obtain a binary box were then retrieved at full resolution. The histogram map-
image B. ping functions estimated in the previous step were then
4. Exclude objects in B with aspect ratio (defined as major applied to the lymph node tissue pixels of the full resolution
axis length per minor axis length) over 10 or mean value bounding box image. Each region was saved into a separate
of the V component under a fixed threshold (here: 0.3). file first in uncompressed BigTiff format using a tile size of
These objects are dark and thin artifacts caused by cover- 1,024 3 1,024, followed by conversion into JPEG2000
slip edges. format with a compression ratio of 50 using the JP2 WSI Con-
5. Perform dilation for B using a disk-shaped structuring ele- verter (28).
ment with a radius of 50 pixels to obtain smooth object In addition to the actual images, we also saved the seg-
boundaries. mentation masks in B corresponding to each bounding box.
6. Fill holes within objects in B. The masks were scaled up to full resolution by nearest-
7. Exclude pixels close to the image’s edges in B. Pixels on the neighbor interpolation and saved in BigTiff format using one
left and right side or the top and bottom are excluded if bit per sample, a tile size of 1,024 3 1,024 and PackBits com-
their distance from the closest edge is less than 2% of the pression. In the case of training images containing tumor, the
image’s width or height, respectively. parts of the ground truth masks corresponding to each
8. Exclude objects in B with area under a desired limit bounding box were retrieved at full resolution and saved in
(500,000 pixels). These small objects represent remaining separate image files using the same format as the segmenta-
debris or very small torn-off pieces of tissue. tion masks.
For convenient handling of the image data during model
The value of 50 pixels (12 mm) was selected for the training and classification, we further divided the images
smoothing operations in steps 2 and 5 based on the consider- obtained in the previous step into smaller subimages and
ation that details smaller than this are mainly subcellular and stored them in JPEG2000 format. Each resulting subimage
can be neglected when detecting the gross boundaries of the had dimensions of 8,192 3 8,192, except for partial subimages
tissue slice. A constant multiplier of 0.5 was introduced in at the edges of the bounding boxes. The segmentation and
step 3 to avoid losing faintly stained lymph node tissue, while ground truth masks were processed similarly and saved in TIF
still excluding the background and most of the weakly stained format. The location of each subimage relative to the corre-
adipose tissue. The thresholds in steps 4, 7 and 8 were selected sponding full-resolution WSI was also stored.
experimentally to exclude most of the debris and imaging arti-
facts present in the images. For the tissue segmentation, we Preprocessing of Subimages
used the fifth image in the resolution pyramid stored within Color deconvolution and nuclei segmentation steps were
the input TIF files. The images on this level had been down- applied to each train and test subimage. A color deconvolu-
sampled by a factor of 16. All values given above in pixels are tion algorithm (29) was used to convert the image’s RGB
reported relative to the full resolution and were scaled accord- channels into hematoxylin stain, eosin stain and background.
ingly and rounded to the nearest integer. The numerical In H&E stained images, hematoxylin stains mainly the cell
parameter values are given as applied for the data in this nuclei and therefore the hematoxylin channel was used in the
nuclei segmentation. The hematoxylin channel was filtered
study, and should be modified when data with a different res-
with a 10 3 10 pixel Gaussian filter (standard deviation 5 5
olution or different characteristics is processed.
pixels) and then an adaptive thresholding method was applied
Color Normalization to get the binary image. The applied adaptive thresholding
We used histogram matching, applied separately to each method (30) separates the cell nuclei from the background
color channel, to correct for color variation across the WSIs based on an individual threshold for each pixel. The individu-
(27). The training image Tumor_015.tif was selected as the al threshold is selected based on the mean intensity in 20 3 20
reference based on visual examination, and the histograms of pixel local neighborhood. Watershed segmentation was used
the image were used as templates for the other images. For to separate the overlapping and touching nuclei from each
each WSI and color channel, we computed the mapping func- other. The separation lines of the watershed segmentation
tion required for matching the original histogram to the were computed from the distance transform of the binary
template histogram. When estimating the histograms and the image using 8-connected neighborhood.
resulting mapping functions, we again used images down-
Sampling
sampled with a factor of 16 and only considered lymph node
Random sampling was performed to reduce the amount
tissue pixels detected in the previous step (i.e., the pixels indi-
of training data. Approximately 200,000 sample blocks were
cated by TRUE in B). As a result, a mapping function was
randomly selected from the subimages containing normal tis-
obtained for each color channel of each WSI.
sue and 200,000 sample blocks from subimages containing
Data Handling and Storage tumor. Sample blocks were selected from the full resolution
After the detection of lymph nodes in an image, we com- subimages and the block size was 200 3 200 pixels. These neg-
puted the bounding box for each of the remaining objects in B ative and positive samples were selected only from the seg-
and merged any overlapping bounding boxes into larger mented lymph node tissue mask area and ground truth mask
area, respectively, while excluding the background. As all pro- block were calculated. The density feature was the mean value
vided tissue area from all training images was covered, this led of the Gaussian filtered sample block from the nuclei location
to approximately 200 sample blocks per tumor subimage and map.
15 sample blocks per normal tissue subimage. From each sam-
Model Comparison
ple block, 214 descriptive features related to image texture,
For selecting the learning algorithm, we compared the
spatial structure, and distribution of nuclei were extracted.
performance of a number of different models for classifying
Feature Extraction the sample blocks as either normal or tumor tissue based on
The properties of each tissue sample block were described the extracted features. Approximately 1,000,000 sample blocks
with 104 texture features extracted from both hematoxylin were randomly selected and used to train a linear regression
and eosin channels. See Supplementary Table for a full list of model, a support vector machine (SVM), a random forest
features with descriptions. Texture features included, for model and two nearest neighbor (NN) classifiers, one using
example, contrast, correlation and energy, calculated from the all the features and one using a subset of 28 manually selected
gray level co-occurrence matrix (GLCM). Spatial sampling features which roughly corresponds to the feature set in Ref.
parameters for the gray level co-occurrence matrix were dis- 38 in single resolution. The trained regression model is a gen-
tance of one pixel and 8 directions. More specifically, the co- eralized linear regression model for the binomial distribution
occurrences of gray values were computed for all adjacent pix- using logit link function. The SVM model utilizes a nonlinear
els including corner pixels at distance of one pixel. The texture radial basis function as a kernel function and grid-search was
of each tissue sample block was further described using local used to find the optimal values for kernel size and soft margin.
binary patterns (LBP) (31,32). This texture operator is a mea- NN classifiers utilize kd-tree search to find the Euclidean dis-
sure of the spatial structure of local image intensities. The tance to the closest neighbor.
basic idea of the LBP operator is to transform a local circular Sensitivity, specificity, F-score and the percentage of cor-
neighborhood into a binary pattern by thresholding the rectly classified samples are shown for each method in Table 1.
neighborhood with the gray value of the center pixel. Due to The random forest model outperformed the other models in
this thresholding, the features are robust in terms of gray scale terms of correctly classified samples, sensitivity, and F-score.
variations caused by changes in illumination caused by, for The specificity of the NN classifier was higher than that of the
example, different scanners. The circularly symmetric neigh- random forest (96.8% vs. 93.3%). However, as this was at
borhood is determined by assigning parameters that control the expense of much lower sensitivity (85.7% vs. 92.6%), and
the quantization of the angular space and radius of the neigh- the random forest model had a higher percentage of correctly
borhood. In our method, we used radius of 2 pixels with classified samples (93.0% vs. 91.3%), and a higher F-score
angular space of 8 points. By applying a shift operation, the
(0.93 vs.0.91), we selected the random forest model as the
extracted LBP features are also rotation-invariant. Other
learning algorithm for our system.
extracted texture features were scale-invariant descriptors
obtained via the Scale-invariant feature transform (SIFT) Random Forest Model
(33), the histogram of oriented gradients (HOG) descriptor We used the feature representations of tissue samples to
(34,35), and maximally stable extremal regions (MSER) (36). train a random forest model (17). The model was an ensemble
In this work, the VLFeat (37) implementation of MSER and of 50 classification trees. The number of features selected
SIFT was used. randomly for each decision split was the square root of the
In addition to the texture features, six nuclei density fea- total number of features. Bootstrap aggregation was used to
tures were extracted, calculated from a nuclei location map. improve the stability and accuracy of the model. Bootstrap
This location map was generated by marking the center point aggregation is a machine learning algorithm that combines
of each segmented nuclei. Nuclei density features included multiple versions of decision trees into a random forest mod-
descriptors related to inter-nuclei distance inside the sample el. Each decision tree version is constructed from a randomly
block, such as mean, maximum, minimum and standard devi- sampled dataset with replacement. The trained model was
ation. Also density and number of nuclei inside the sample then used to evaluate the test images. About 214 features were
Approximately 1,000,000 sample blocks were classified using the following models: logistic regression, nearest neighbor (NN) using
either all or a subset of features, support vector machine (SVM) and a random forest model. Percentage of correctly classified samples,
sensitivity, specificity, and F-score are shown for each model.
Figure 2. Relative importance of the 10 most significant features selected by the random forest model (A). Example H&E images of nor-
mal tissue (B) and metastatic tissue (C) are shown with the corresponding features computed from the hematoxylin (H) or eosin (E) chan-
nel: local binary pattern 3 (H) (B1 and C1), number of nuclei (B2 and C2), local binary pattern 7 (H) (B3 and C3), local binary pattern 9 (E)
(B4 and C4), contrast (H) (B5 and C5), local binary pattern 8 (H) (B6 and C6), kurtosis of the intensity distribution (H) (B7 and C7), correlation
(H) (B8 and C8), local binary pattern 6 (H) (B9 and C9) and local binary pattern 4 (H) (B10 and C10). The intensity scales in 1–10 are compa-
rable between each feature pair B and C. [Color figure can be viewed at wileyonlinelibrary.com]
extracted from each 200 3 200 pixel block in test subimages. feature importance’s of the 10 most significant features for the
The confidence for being either a normal tissue block or a LOOCV experiment are shown in Figure 2A. An example area
tumor tissue block was predicted with the trained random of normal (Fig. 2B) and tumor tissue (Fig. 2C), as well as the
forest model. These subimage confidences were stored in feature values for the same areas, are shown in Figures
unsigned 8-bit integer format and pieced together to form a 2B1210 and 2C1–10. The majority of the ten most significant
metastasis confidence image for each test WSI. Since a single features were calculated from the hematoxylin channel,
confidence value is predicted for each 200 3 200 pixel block, excluding the NumberOfNuc-feature, which is based on the
the size of the resulting confidence images corresponds to a binary image of segmented nuclei and e-LBP9, which is calcu-
1:200 downsampling of the original WSIs along each lated from the eosin channel. Differences in feature values
dimension. between normal and tumor samples are clearly visible for
Training of one random forest model with 700,000 train- most of the ten features. LBP-3, number of nuclei, LBP-7, con-
ing samples takes approximately 90 minutes. To classify a new trast, LBP-8, correlation, LBP-6, and LBP-4 all tend to be
WSI with a trained random forest model, our method takes higher in normal lymph node tissue than in cancerous areas
approximately 130 minutes. The processing time varies of (Figs. 2B and 2C). Of these, LBP-3 (Figs. 2B1 and 2C1) and
course depending on the amount of tissue in WSI. These com- correlation (Figs. 2B8 and 2C8) seem most robust in
putation times for training the random forest model and tolerating the follicular material in addition to lymph node
processing of one WSI are obtained using parallel computing cortex, representing the normal variation in the lymph node
with 95 GB of memory and two six-core Intel X5660 tissue. eLBP-9 (Figs. 2B4 and 2C4) and kurtosis (Figs. 2B7
processors. and 2C7) signals were higher in cancerous material than in
the normal tissue. Contrast (Figs. 2B5 and 2C5) is especially
RESULTS low in cytoplasm-rich cancer cells and high in lymph node
Detection of metastatic regions from whole slide images cortex and helpful in finding especially large areas of
was evaluated with the data from the Camelyon 2016 contest. metastases.
First, we determined the performance for a set of 170 images An example of a classification result for a WSI is shown
from a single scanner, eliminating the variability of source in Figure 3. An original image of a tumor sample with pathol-
images due to technical reasons. Leave-one-out cross-valida- ogist’s annotations overlaid in yellow is presented in Figure
tion (LOOCV) was used to assess the performance of our ran- 3A. The corresponding confidence values given by the random
dom forest classification approach. Each sample from one forest classifier are shown as an image in Figure 3B. The
WSI not used in training was scored with confidence levels higher confidence values are concentrated in areas marked as
using a random forest model trained with all the samples tumor by the pathologist, while confidences in normal tissue
from 169 other images. area are generally lower, with occasional higher hits scattered
To interpret our random forest model, we visualized pre- around the tissue. The visual appearance of the example result
dictor importance weights assigned by the model for each fea- in Figures 3A and 3B suggests that the classifier is able to
ture. These weights are higher for the features that have detect the metastatic areas.
higher impact on the correct classification result. Weight esti- In order to evaluate the performance of our system
mates for every feature are based on changes in the mean numerically, we collected all confidence values within normal
squared error due to splits in every decision tree. The averaged and tumor tissue areas for all 170 images of the first dataset in
Figure 3. An example whole slide image from the first dataset (A) with the corresponding confidence map (B) and an example whole slide
image from the second dataset (C) with the corresponding confidence map (D). Ground truth annotations are shown in yellow.
the LOOCV experiment (Fig. 4A) and calculated the blockwise examined by separately considering tissue blocks from meta-
ROC curve both for the whole image area (Fig. 4B) and for static regions larger and smaller than the median area
the lymph node tissue areas with the background excluded (0.1867 mm2) of all regions in the LOOCV ROC analysis of
(Fig. 4C). Next, we applied the same computational pipeline the combined dataset. In line with the approach adopted in
to the second image dataset containing 100 WSIs scanned the Camelyon16 competition, we considered all regions anno-
with another device to obtain the corresponding confidence tated in the ground truth masks with area larger than that of a
WSIs. We again collected all confidence values within normal circle having a radius of 100 mm. This analysis resulted in
and tumor tissue areas (Fig. 4D) and calculated the blockwise AUC values of 0.801, 95% CI [0.787, 0.814] and 0.906, 95%
ROC curves for all blocks and tissue blocks only (Figs. 4E and CI [0.896, 0.916] for the small and large metastatic regions,
4F, respectively). Partially annotated images were excluded respectively.
from all numerical evaluations. The mean area under the Finally, we used the two independent image sets in turn
curve (AUC) value for metastatic tumor versus all image as a training set and as a testing set to determine if the system
blocks including background was 0.983 for the first image set is capable of handling the situation where the testing data are
(Fig. 4B) and 0.975 for the second set (Fig. 4E). For metastatic markedly different from the data used for training. First, we
tumor versus normal tissue, the mean AUC value was 0.905 trained our RF model with 350,000 samples collected from the
for first image set (Fig. 4C) and 0.887 for the second set (Fig. first set of 170 WSIs and evaluated the 100 WSIs from the sec-
4F). The numerical results in Figure 4 support the conclusions ond set. Then, we trained the RF model with 350,000 samples
drawn from the visual example in Figure 3. collected from the second set of 100 WSIs and evaluated the
In order to determine the generalizability of our 170 WSIs from the first set. The results of this experiment are
approach to datasets with more variability, containing images presented in Figures 5J–5L for the former and in Figures 5M–
originating from different laboratories and imaged with differ- 5O for the latter case. The distributions of confidence values
ent scanners, we combined the two datasets. Although repre- and the ROC analysis for all image blocks (mean AUC 5 0.970
senting the same tissue and in principle processed with a and mean AUC 5 0.978) and tissue blocks only (mean
similar H&E staining procedure, the visual appearance of the AUC 5 0.839 and mean AUC 5 0.855) indicate that classifica-
tissues differs between the images from the two laboratories, tion accuracy remains relatively high even when the testing
as can be seen from the example images in Figures 3A and 3C. data are completely independent of the training data and have
We trained our RF model with 700,000 samples from the different characteristics, although a slight decrease in perfor-
combined dataset and conducted the LOOCV experiment for mance is observed compared with the LOOCV results.
all of the 270 images. The confidence values from normal and Most false positive signals were detected where normal
metastatic tumor tissue areas (Fig. 5G) and the blockwise lymph node medulla was misinterpreted as cancerous tissue
ROC curves from all image blocks (Fig. 5H, mean (Fig. 5A). The reticular cells forming the lymph node stroma
AUC 5 0.985) or tissue blocks only (Fig. 5I, mean have partly similar color tones and size of nuclei as certain
AUC 5 0.902) indicate that the method generalizes well to breast cancer cell phenotypes, especially in areas surrounding
datasets containing images from different laboratories. The lymph node trabeculae and/or vasculature. False positive sig-
effect of metastasis size on the detection accuracy was nals were occasionally resulting also from nerve bundles cut in
Figure 4. Results obtained using leave-one-out cross validation for dataset 1 (A–C), dataset 2 (D–F) or the combined dataset (G–I) and for
a classifier trained on dataset 1 and evaluated on dataset 2 (J–L) or for a classifier trained on dataset 2 and evaluated on dataset 1 (M–O).
Distribution of confidence values for all normal and tumor tissue blocks in the dataset is shown in (A, D, G, J, M). The red line represents
the median, the edges of the blue box correspond to the 25th and 75th percentiles and the length of the whiskers is 1.5 times the interquar-
tile range. Outliers beyond this limit are shown in red. Blockwise ROC curves are shown for all blocks in (B, E, H, K, N) and for tissue blocks
only in (C, F, I, L, O). The solid lines represent the mean and the dashed lines represent the pointwise 95% confidence interval. Corre-
sponding AUC values are shown above each ROC curve. The total number of classified blocks was 85,545,658 (dataset 1, all blocks),
6,393,412 (dataset 1, tissue blocks), 29,660,702 (dataset 2, all blocks), or 5,301,888 (dataset 2, tissue blocks). [Color figure can be viewed at
wileyonlinelibrary.com]
Figure 5. Examples of false positives caused by normal tissue texture resembling metastatic tissue (A, B) or an out-of-focus region (C)
and an example of a false negative where a small lesion has been falsely detected as normal tissue (D). The H&E images are shown on the
left and the corresponding confidence maps on the right. The ground truth annotation in (D) is shown as a green outline.
such an orientation that an approximately similar ratio of demonstrates the generic nature of the features used in our
blue nuclei to surrounding light pink material was created, system and exemplifies one possible approach for utilizing the
where myelin sheets in nerve bundles resembled the appear- WSI confidence maps for downstream analysis, such as for
ance of the cytoplasm of cancer cells (Fig. 5B). Some out-of- slide-level classification between cancer versus normal.
focus image areas also resulted in false positive signals (Fig.
5C). False negative signals were detected in especially infiltra- DISCUSSION
tive areas (Fig. 5D) or small metastases, where single or only a Automated processing of whole slide images and detec-
few cancer cells are surrounded by lymphocytic cells. tion of regions of interest is an open challenge in digital
The blockwise confidence output can be used as a start- pathology based cancer diagnosis (14). Herein, we developed
ing point for other tasks. Ideally, automated analysis would a method for automated detection of hot-spot regions in
screen the WSIs and for example provide the detected cancer- whole slide images. The feature based classification approach
ous regions for pathologist’s review or perform slide-level clas- presented here is generic and can be applied to a variety of
sification to exclude some slides as completely negative for segmentation and detection tasks. We evaluated the perfor-
cancer. To provide an example of further analyzing the WSI mance of the method in detection of breast cancer metastases
confidence maps and to determine the generalization capabili- in lymph node sections from H&E stained WSI. This detec-
ty of our computational pipeline, we finally used our tion task represents an interesting challenge for digital pathol-
approach for slide-level binary classification. We used the ogy, since one of the major factors in breast cancer
same feature extraction and random forest classification prognostics is metastasis of cancer cells to sentinel lymph
approach as in the earlier experiments but this time, the input nodes (9). The diagnostic procedure for pathologists is cur-
to the classifier was the WSI confidence map (in other words, rently tedious and time-consuming, as well as prone to misin-
the output from the classification model for an H&E WSI) terpretation. Automated detection of lymph node metastases
instead of the underlying tissue image. The same 104 texture has great potential to help the pathologist to improve diagnos-
features, which were extracted from each hematoxylin or eosin tics as well as to reduce both the workload and costs. Our
sample block, were now extracted from the WSI confidence anticipation is that the method presented in this study is use-
map. These features were then used to train our RF model to ful for the detection of hot-spots, including the task of sepa-
separate the normal WSIs from the WSIs containing metasta- rating regions of metastatic breast cancer cells from normal
sis. LOOCV was used to determine one confidence value for lymphatic tissue composed of lymphocytes. Qualitative (Fig.
each of the 270 WSIs indicating the likelihood for the whole 3) and quantitative (Fig. 4) results support this anticipation.
slide to contain any metastatic tissue. We collected all whole From the pathologist’s viewpoint, the sensitivity of the
slide confidence values and calculated the image-wise ROC method (Table 1) and the confidence map provided by the
curve and obtained a mean AUC value of 0.73 for metastasis- method of the possible hotspots in each slide are the most
containing WSIs versus normal WSIs. This example useful parameters for pre-screening the slides to help focus on
suspect areas. In addition to the hot-spot (here: metastatic classifiers applied to non-small cell lung cancer samples (39).
tumor tissue) detection, our method enables linking the dif- In comparison to the Camelyon16 measure, blockwise or pix-
ferences between tissue types in hot-spot areas versus normal elwise metrics take into consideration the entire tumor
tissue to specific features describing the tissue properties. This regions and avoid the artificial coordinate selection step. The
can potentially provide insights into the tissue type character- downside of blockwise evaluation is that larger tumor regions
istics or even suggest differences in growth patterns. The aver- attain more weight in the final scoring, as they consist of a
age random forest model obtained in the cross validation larger number of pixels than smaller lesions.
study was illustrated in Figure 2A. The top ten most impor- This is problematic in the sense that examining the slides
tant features contributing to the classifier model are in prac- for micrometastases or individual tumor cells can be very
tice the descriptors which behave differently in normal and time-consuming for the pathologist, while large macrometa-
metastatic tumor tissue areas. While part of them are not stases can often be spotted more easily. In the context of com-
straightforwardly interpretable, there are also some features puter aided diagnosis, the capability to accurately detect small
that either support existing knowledge (e.g., nuclear count in tumor regions should thus not be neglected during evaluation.
local neighborhood, Figs. 2B2 and 2C2) or stand out as candi- Still, in the absence of a universal evaluation metric suitable
dates for straightforward computational readouts (e.g., local for all intended applications, the blockwise metrics represent a
contrast, Figs. 2B5 and 2C5). straightforward application-independent approach for quanti-
Evaluating the performance of methods for cancer detec- fying detection performance in a task that can be seen as the
tion from digitized slides is a non-trivial task (2). Obtaining basis for all further steps—discrimination between target and
ground truth annotations can be a very laborious process and non-target areas in an image. Good performance in this task
represents a significant bottleneck in the development of new is a prerequisite for the consequent delineation of entire meta-
methods. Even if this issue can be overcome to obtain large, static regions, binary classification of entire WSIs and other
annotated datasets, as in the case of the Camelyon16 chal- more refined analysis steps, and should thus be a common
lenge, the problem of designing a relevant performance metric characteristic of all well-performing methods.
remains. The selection of a suitable evaluation metric depends In addition to performing large-scale numerical evalua-
heavily on the way the method is intended to be used in a tion using the entire dataset, we also visually examined exam-
practical setting. If the aim is to, for example, classify entire ples of different normal and metastatic tissue areas, which had
WSIs as either normal or tumor containing, it is sensible to been either successfully or unsuccessfully detected. Normal
evaluate performance using slide-level ROC analysis. This lymph nodes are composed of primarily lymphocytic cells and
approach was adopted by us in our slide-level classification follicles structured along a supportive reticular network. The
experiment and as the first metric in the Camelyon16 chal- appearance of cancer cells of epithelial origin is most often
lenge. If, on the other hand, the intention is to use the method well distinguishable from especially the lymph cell component
to pinpoint suspicious areas in the images to speed up the of lymph nodes with their relatively large size, prominent
work of pathologists, as in the case of metastasis detection presence of cytoplasm and light staining of nuclei. However,
from lymph node sections, performance must be evaluated in there are phenotypically various cancer cell types, and the
a pixelwise, blockwise, or region-based manner for each WSI. growth pattern within the lymph node may affect the classifi-
As an example, for the second evaluation metric of the Camel- cation outcome. Most nodular metastatic lesions are easily
yon16 challenge, participants of the competition had to pro- distinguishable with our method. In contrast, especially small
vide a single coordinate and a confidence value for each metastatic lesions with only a few cells and especially with an
metastatic region detected from the images. Coordinates invasive growth pattern alongside normal tissue structures are
located within annotated tumor regions were considered as more challenging for the method to detect.
correct detections and the teams were ranked according to the False positives occasionally emerged at certain areas of
AUC metric computed based on free-response receiver operat- normal lymph node medulla. This seems to be due to that
ing characteristic (FROC) analysis. This metric relies on scor- the reticular cells forming the lymph node stroma have partly
ing a single coordinate point per region as either a hit or a similar color tones and size of nuclei as certain breast cancer
miss, instead of evaluating the identification of the actual cell phenotypes, especially in areas surrounding lymph node
regions. However, accurate detection of the boundaries of trabeculae. Another source of error was out-of-focus image
metastatic areas is a prerequisite for further computational areas, emphasizing the importance of consistently high tech-
analysis of their size, shape and numerous other characteris- nical quality of the images. False negative signals were mainly
tics. Moreover, selecting a single coordinate to represent the associated to small metastases with a small number of cancer
entire cancerous region in a meaningful way is problematic, cells or especially infiltrative metastatic growth patterns. In
especially for regions with a complicated shape featuring, for these cases, cancer cells appeared as single cells, or small
example, protrusions. groups of cells were surrounded by lymphocytic cells. A
Considering the above, in this study we treated the probable reason for the weaker performance observed in
metastasis detection task as a blockwise classification problem such tissue regions is that many of the analyzed subimages in
and evaluated the performance of our method by ROC analy- these regions contain some normal tissue in addition to can-
sis applied to the 200 3 200 pixel blocks. A similar approach cer cells. The feature values computed from such subimages
has been used for example to evaluate the performance of partly resemble those obtained from entirely normal tissue,
which leads to false negatives. Improved performance in 12. Abas FS, Gokozan HN, Goksel B, Otero JJ, Gurcan MN. Intraoperative neuropathol-
ogy of glioma recurrence: Cell detection and classification. In: Proceedings of the
these kind of regions could possibly be achieved by using a International Society for Optics and Photonics (SPIE) Conference on Medical Imag-
multi-scale approach, where the size of the analysis window ing, San Diego, CA, USA; 2016. p 979109-979109.
13. Kornaropoulos EN, Niazi M, Lozanski G, Gurcan MN. Histopathological image
would be varied over a certain range, and/or by utilizing analysis for centroblasts classification through dimensionality reduction approaches.
superpixels (4). Cytometry Part A 2014;85A:242–255.
14. Doyle S, Feldman M, Tomaszewski J, Madabhushi A. A boosted Bayesian multireso-
In conclusion, the machine learning based approach for lution classifier for prostate cancer detection from digitized needle biopsies. IEEE
detecting metastatic tissue regions presented in this article Trans Biomed Eng 2012;59:1205–1218.
15. Turkki R, Linder N, Kovanen PE, Pellinen T, Lundin J. Identification of immune cell
performs well in blockwise detection of breast cancer metasta- infiltration in hematoxylin-eosin stained breast cancer samples: Texture-based classi-
ses from lymph node tissue sections. The method was applied fication of tissue morphologies. In: Proceedings of the International Society for
Optics and Photonics (SPIE) Conference on Medical Imaging, San Diego, CA, USA;
to whole slide images of H&E stained tissue obtained using 2016. p 979110-979110.
two different scanners at two separate laboratories. Even 16. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear mod-
els via coordinate descent. J Stat Softw 2010;33:1–22.
though H&E images were used here, the presented method is 17. Breiman L. Random forests. Mach Learn 2001;45:5–32. 1
generic in nature, and the information extracted from other 18. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolu-
histological images can be included in our analysis pipeline in tional neural networks. In: Proceedings of the 26th Annual Conference on Neural
Information Processing Systems (NIPS), Lake Tahoe, NV, USA; 2012. pp 1097–1105.
a straightforward manner. The method is extendable also in 19. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436–444.
the sense that it allows the incorporation of any number of 20. Sirinukunwattana K, Raza SEA, Tsang YW, Snead D, Cree IA, Rajpoot NM. Locality
sensitive deep learning for detection and classification of nuclei in routine colon can-
new features that can be extracted from H&E images and, cer histology images. IEEE Trans Med Imaging 2016;35:1196–1206.
when available, other measurements from the same spatial 21. Wang H, Cruz-Roa A, Basavanhally A, Gilmore H, Shih N, Feldman M,
Tomaszewski J, Gonzalez F, Madabhushi A. Mitosis detection in breast cancer pathol-
location, such as images of immunohistochemically stained ogy images by combining handcrafted and convolutional neural network features.
samples. Other potential places for improvement and further J Med Imaging 2014;1:034003.
22. Cireşan DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis Detection in Breast
study include applying more advanced strategies for training, Cancer Histology Images with Deep Neural Networks. Medical Image Computing
such as using misclassification from the cross validation step and Computer-Assisted Intervention–MICCAI 2013. Berlin, Heidelberg: Springer;
2013. pp 411–418.
for boosting the classifier in a re-training step. Furthermore, 23. Chen H, Qi X, Yu L, Heng PA. DCAN: Deep Contour-Aware Networks for Accurate
deep learning based methods have been used in similar tasks Gland Segmentation. arXiv preprint arXiv 2016; 1604.02677.
with very high detection accuracy (40,41). The presented clas- 24. Sirinukunwattana K, Pluim JPW, Chen H, Qi X, Heng P-A, Guo YB, Wang LY,
Matuszewski BJ, Bruni E, Sanchez U, et al. Gland Segmentation in Colon Histology
sification pipeline could benefit from complementing the fea- Images: The GlaS Challenge Contest. arXiv preprint arXiv 2016; 1603.00275.
ture extraction phase with convolutional neural networks or 25. Ruusuvuori P, Valkonen M, Nykter M, Visakorpi T, Latonen L. Feature-based analy-
sis of mouse prostatic intraepithelial neoplasia in histological tissue sections. J Pathol
autoencoders, gaining the benefits of deep learning methods Inform 2016;7:5.
while preserving also the interpretable features. 26. Otsu N. A threshold selection method from gray-level histograms. Automatica 1975;
11:23–27.
27. Gonzalez RC, Woods RE. Digital Image Processing, 2nd ed. New Jersey, Upper
ACKNOWLEDGMENT Saddle River: Prentice hall, Inc.; 2002.
The authors declare that there are no conflicts of interest. 28. Tuominen VJ, Isola J. Linking whole-slide microscope images with DICOM by using
JPEG2000 interactive protocol. J Digit Imaging 2010;23:454–462.
29. Ruifrok AC, Johnston DA. Quantification of histochemical staining by color decon-
LITERATURE CITED volution. Anal Quant Cytol Histol 2001;23:291–299.
1. Pantanowitz L, Valenstein PN, Evans AJ, Kaplan KJ, Pfeifer JD, Wilbur DC, Collins 30. Bradley D, Roth G. Adaptive thresholding using the integral image. J Graph Gpu
LC, Colgan TJ. Review of the current state of whole slide imaging in pathology. Game Tools 2007;12:13–21.
J Pathol Inform 2011;2:36. 31. Ojala T, Pietik€ainen M, M€aenp€a€a T. Multiresolution gray-scale and rotation invariant
2. Gurcan MN, Boucheron LE, Can A, Madabhushi A, Rajpoot NM, Yener B. Histo- texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell
pathological image analysis: A review. IEEE Rev Biomed Eng 2009;2:147–171. 2002;24:971–987.
3. Ghaznavi F, Evans A, Madabhushi A, Feldman M. Digital imaging in pathology: 32. Pietik€ainen M, Ojala T, Xu Z. Rotation-invariant texture classification using feature
Whole-slide imaging and beyond. Annu Rev Pathol 2013;8:331–359. distributions. Pattern Recognit 2000;33:43–52.
4. Bejnordi BE, Litjens G, Hermsen M, Karssemeijer N, van der Laak JA. A multi-scale 33. Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput
superpixel classification approach to the detection of regions of interest in whole Vis 2004;60:91–110.
slide histopathology images. In: Proceedings of the International Society for Optics 34. Ludwig O, Delgado D, Goncalves V, Nunes U. Trainable Classifier-fusion Schemes:
and Photonics (SPIE) Conference on Medical Imaging, Orlando, FL, USA; 2015.
An Application to Pedestrian Detection. In: Proceedings of the 12th International
p 94200H-94200H.
IEEE Conference on Intelligent Transportation Systems, St. Louis, MO, USA; 2009.
5. Khan AM, Rajpoot N, Treanor D, Magee D. A nonlinear mapping approach to stain pp 432–437.
normalization in digital histopathology images using image-specific color deconvo-
lution. IEEE Trans Biomed Eng 2014;61:1729–1738. 35. Dalal N, Triggs B. Histograms of oriented gradients for human detection. In:
Proceedings of the 1st IEEE Computer Society Conference on Computer Vision and
6. Veta M, van Diest PJ, Kornegoor R, Huisman A, Viergever MA, Pluim JP. Automatic Pattern Recognition, San Diego, CA, USA; 2005. pp 886–893.
nuclei segmentation in H&E stained breast cancer histopathology images. PLoS One
2013;8:e70221. 36. Matas J, Chum O, Urban M, Pajdla T. Robust wide-baseline stereo from maximally
stable extremal regions. Image Vis Comput 2004;22:761–767.
7. Kothari S, Phan JH, Stokes TH, Wang MD. Pathology imaging informatics for quan-
titative analysis of whole-slide images. J Am Med Inform Assoc 2013;20:1099–1108. 37. Vedaldi A, Fulkerson B. VLFeat: An Open and Portable Library of Computer Vision
Algorithms. In: Proceedings of the 18th ACM International Conference on Multime-
8. Ferlay J, Soerjomataram I, Ervik M, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin
dia, Firenze, Italy; 2010. pp 1469–1472.
DM, Forman D, Bray F. GLOBOCAN 2012 v1.0, Cancer Incidence and Mortality
Worldwide: IARC CancerBase No. 11 [Internet]. Lyon, France: International Agency 38. Sertel O, Kong J, Shimada H, Catalyurek UV, Saltz JH, Gurcan MN. Computer-aided
for Research on Cancer; 2013. Available from: http://globocan.iarc.fr, accessed on 27/ prognosis of neuroblastoma on whole-slide images: Classification of stromal devel-
4/2016. opment. Patt Recogn 2009;42:1093–1103.
9. Ran S, Volk L, Hall K, Flister MJ. Lymphangiogenesis and lymphatic metastasis in 39. Yu KH, Zhang C, Berry GJ, Altman RB, Re C, Rubin DL, Snyder M. Predicting non-
breast cancer. Pathophysiology 2010;17:229–251. small cell lung cancer prognosis by fully automated microscopic pathology image
10. Veta M, Pluim JP, van Diest PJ, Viergever MA. Breast cancer histopathology image features. Nat Commun 2016;7:12474.
analysis: A review. IEEE Trans Biomed Eng 2014;61:1400–1411. 40. Wang D, Khosla A, Gargeya R, Irshad H, Beck AH. Deep learning for identifying
11. Niwas SI, Palanisamy P, Sujathan K, Bengtsson E. Analysis of nuclei textures of fine metastatic breast cancer. arXiv preprint 2016; arXiv:1606.05718.
needle aspirated cytology images for breast cancer diagnosis using complex Daube- 41. Chen R, Jing Y, Jackson H. identifying metastases in sentinel lymph nodes with deep
chies wavelets. Signal Process 2013;93:2828–2837. convolutional neural networks. arXiv preprint 2016; arXiv:1608.01658.