0% found this document useful (0 votes)
44 views

Campen Ella

a great paper on multiple instance learning

Uploaded by

Why Bother
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Campen Ella

a great paper on multiple instance learning

Uploaded by

Why Bother
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Articles

https://doi.org/10.1038/s41591-019-0508-1

Clinical-grade computational pathology using


weakly supervised deep learning on whole
slide images
Gabriele Campanella1,2, Matthew G. Hanna1, Luke Geneslaw1, Allen Miraflor1,
Vitor Werneck Krauss Silva1, Klaus J. Busam1, Edi Brogi1, Victor E. Reuter1, David S. Klimstra1
and Thomas J. Fuchs   1,2*

The development of decision support systems for pathology and their deployment in clinical practice have been hindered by
the need for large manually annotated datasets. To overcome this problem, we present a multiple instance learning-based deep
learning system that uses only the reported diagnoses as labels for training, thereby avoiding expensive and time-consuming
pixel-wise manual annotations. We evaluated this framework at scale on a dataset of 44,732 whole slide images from 15,187
patients without any form of data curation. Tests on prostate cancer, basal cell carcinoma and breast cancer metastases to
axillary lymph nodes resulted in areas under the curve above 0.98 for all cancer types. Its clinical application would allow
pathologists to exclude 65–75% of slides while retaining 100% sensitivity. Our results show that this system has the ability
to train accurate classification models at unprecedented scale, laying the foundation for the deployment of computational
decision support systems in clinical practice.

P
athology is the cornerstone of modern medicine and, in showed ophthalmologist-level performance on optical coherence
particular, cancer care. The pathologist’s diagnosis on glass tomography images.
slides is the basis for clinical and pharmaceutical research Computational pathology, compared with other fields, has to
and, more importantly, for the decision on how to treat the patient. face additional challenges related to the nature of pathology data
Nevertheless, the standard practice of microscopy for diagnosis, generation. The lack of large annotated datasets is even more severe
grading and staging of cancer has remained nearly unchanged for a than in other domains. This is due in part to the novelty of digital
century1,2. While other medical disciplines, such as radiology, have pathology and the high cost associated with the digitization of glass
a long history of research and clinical application of computational slides. Furthermore, pathology images are tremendously large: glass
approaches, pathology has remained in the background of the digi- slides scanned at 20× magnification (0.5 µm pixel−1) produce image
tal revolution. Only in recent years has digital pathology emerged files of several gigapixels; about 470 WSIs contain roughly the same
as a potential new standard of care where glass slides are digitized number of pixels as the entire ImageNet dataset. Leveraging the
into whole slide images (WSIs) using digital slide scanners. As scan- peculiarity of pathology datasets has led most efforts in computa-
ner technologies have become more reliable, and WSIs increasingly tional pathology to apply supervised learning for classifying small
available in larger numbers, the field of computational pathology has tiles within a WSI13–22. This usually requires extensive annotations
emerged to facilitate computer-assisted diagnostics and to enable a at the pixel level by expert pathologists. For these reasons, state-
digital workflow for pathologists3–5. These diagnostic decision sup- of-the-art pathology datasets are small and heavily curated. The
port tools can be developed to empower pathologists’ efficiency and CAMELYON16 challenge for breast cancer metastasis detection23
accuracy to ultimately provide better patient care. contains one of the largest labeled datasets in the field, with a total
Traditionally, predictive models used in decision support sys- of 400 non-exhaustively annotated WSIs.
tems for medical image analysis relied on manually engineered fea- Applying deep learning for supervised classification on these
ture extraction based on expert knowledge. These approaches were small datasets has achieved encouraging results. Of note, the
intrinsically domain specific and their performance was, in general, CAMELYON16 challenge reported performance on par with that
not sufficient for clinical applications. This approach was changed of pathologists in discerning between benign tissue and metastatic
in recent years based on the enormous success and advancement breast cancer23. Yet, the applicability of these models in clinical
of deep learning6 in solving image classification tasks, such as clas- practice remains in question because of the wide variance of clini-
sification and categorization on ImageNet7–10, where high-capacity cal samples that is not captured in small datasets. Experiments pre-
deep neural network models have been reported to surpass human sented in this article will substantiate this claim.
performance10. To properly address the shortcomings of current computa-
The medical image analysis field has seen widespread applica- tional approaches and enable clinical deployment of decision sup-
tion of deep learning, showing in some cases that clinical impact port tools requires training and validation of models on large-scale
can be achieved for diagnostic tasks. Notably, ref. 11 reported datasets representative of the wide variability of cases encountered
dermatologist-level diagnosis of dermoscopy images, while ref. 12 every day in the clinic. At that scale, reliance on expensive and

Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, NY, USA. 2Weill Cornell Graduate School of Medical Sciences,
1

New York, NY, USA. *e-mail: [email protected]

Nature Medicine | www.nature.com/naturemedicine


Articles NaTurE MEDIcInE

time-consuming, manual annotations is impossible. We address all Typically, a two-step approach is used, where first a classifier is
of these issues by collecting a large computational pathology dataset trained with MIL at the tile level and then the predicted scores for
and by proposing a new framework for training classification mod- each tile within a WSI are aggregated, usually by combining (pool-
els at a very large scale without the need for pixel-level annotations. ing) their results with various strategies35, or by learning a fusion
Furthermore, in light of the results we present in this work, we will model36. Inspired by these works, we developed a novel frame-
formalize the concept of clinical-grade decision support systems, work that leverages MIL to train deep neural networks, resulting
proposing—in contrast with the existing literature—a new measure in a semantically rich tile-level feature representation. These rep-
for clinical applicability. resentations are then used in a recurrent neural network (RNN) to
One of the main contributions of this work is the scale at which integrate the information across the whole slide and report the final
we learn classification models. We collected three datasets in the classification result (Fig. 1c,d).
field of computational pathology: (1) a prostate core biopsy dataset
consisting of 24,859 slides; (2) a skin dataset of 9,962 slides; and Results
(3) a breast metastasis to lymph nodes dataset of 9,894 slides. Each Test performance of ResNet34 models trained with MIL for each
of these datasets is at least one order of magnitude larger than all tissue type. We trained ResNet34 models to classify tiles using MIL.
other datasets in the field. To put this in the context of other com- At test time, a slide is predicted positive if at least one tile is pre-
puter vision problems, we analyzed an equivalent number of pixels dicted positive within that particular slide. This slide-level aggrega-
to 88 ImageNet datasets (Fig. 1a). It is important to stress that the tion derives directly from the standard multiple instance assumption
data were not curated. The slides collected for each tissue type rep- and is generally referred to as max-pooling. Performance on the test
resent the equivalent of at least 1 year of clinical cases and are thus set was measured for models trained at different magnifications for
representative of slides generated in a true pathology laboratory, each dataset (Extended Data Fig. 2). Histology contains informa-
including common artifacts, such as air bubbles, microtomy knife tion at different scales, and pathologists review patient tissue on
slicing irregularities, fixation problems, cautery, folds and cracks, glass slides at varying zoom levels. For example, in prostate histo-
as well as digitization artifacts, such as striping and blurred regions. pathology, architectural and cytological features are both important
Across the three tissue types, we included 17,661 external slides, for diagnosis and are more easily appreciated at different magni-
which were produced in the pathology laboratories of their respec- fications. For prostate, the highest magnification consistently gave
tive institutions within the United States and another 44 countries better results (Extended Data Fig. 2a), while for BCC detection,
(Extended Data Fig. 1), illustrating the unprecedented technical 5× magnification showed higher accuracy (Extended Data Fig. 2b).
variability included in a computational pathology study. Interestingly, the error modes on the test set across magnifica-
The datasets chosen represent different but complementary tion conditions were complementary: in prostate, the 20× model
views of clinical practice, and offer insight into the types of chal- performed better in terms of false negatives, while the 5× model
lenges a flexible and robust decision support system should be able performed better on false positives. Simple ensemble models were
to solve. Prostate cancer is the leading source of new cancer cases generated by max-pooling the response across the different magni-
and the second most frequent cause of death among men after lung fications. We note that these naive multiscale models outperformed
cancers24. Multiple studies have shown that prostate cancer diag- the single-scale models for the prostate dataset in terms of accu-
nosis has a high inter- and intraobserver variability25–27 and is fre- racy and area under the curve (AUC), but not for the other datasets.
quently based on the presence of very small lesions that comprise Models trained at 20× achieved AUCs of 0.986, 0.986 and 0.965 on
<1% of the entire tissue surface area (Fig. 1b). Making diagnosis the test sets of the prostate, BCC and axillary lymph node datasets,
more reproducible and aiding in the diagnosis of cases with low respectively, highlighting the efficacy of the proposed method in
tumor volume are examples of how decision support systems can discerning tumor regions from benign regions in a wide variety of
improve patient care. The skin cancer basal cell carcinoma (BCC) tissue types.
rarely causes metastases or death28. In its most common form
(nodular), pathologists can readily identify and diagnose the lesion. Dataset size dependence of classification accuracy. We conducted
With approximately 4.3 million individuals diagnosed annually in experiments to determine whether the dataset was large enough to
the United States29, it is the most common form of cancer. In this saturate the error rate on the validation set. For these experiments,
scenario, a decision support system should increase clinical effi- the prostate dataset (excluding the test portion) was split in a com-
ciency by streamlining the work of the pathologist. mon validation set with 2,000 slides and training sets of different
To fully leverage the scale of our datasets, it is unfeasible to sizes (100, 200, 500, 1,000, 2,000, 4,000, 6,000 and 8,000), with each
rely on supervised learning, which requires manual annotations. training dataset being a superset of all of the previous datasets. The
Instead, we propose to use the slide-level diagnosis, which is read- results indicate that while the validation error is starting to saturate,
ily available from anatomic pathology laboratory information sys- further improvement can be expected from even larger datasets
tems (LISs) or electronic health records, to train a classification than the one collected for this study (Fig. 2a). Although the number
model in a weakly supervised manner. Crucially, diagnostic data of slides needed to achieve satisfactory results may vary by tissue
retrieved from pathology reports are easily scalable, as opposed to type, we observed that, in general, at least 10,000 slides are neces-
expert annotation for supervised learning, which is time prohibitive sary for good performance.
at scale. To be more specific, the slide-level diagnosis casts a weak
label on all tiles within a particular WSI. In addition, we know that Model introspection by visualization of the feature space in two
if the slide is negative, all of its tiles must also be negative and not dimensions. To gain insight into the model’s representation of his-
contain tumor. In contrast, if the slide is positive, it must be true that topathology images, we visualized the learned feature space in two
at least one of all of the possible tiles contains tumor. This formaliza- dimensions so that tiles that have similar features according to the
tion of the WSI classification problem is an example of the general model are shown close to each other (see Fig. 2b,c for the prostate
standard multiple instance assumption, for which a solution was model and Extended Data Fig. 3 for the BCC and axillary lymph
first described in ref. 30. Multiple instance learning (MIL) has since nodes models). The prostate model shows a large region of different
been widely applied in many machine learning domains, including stroma tiles at the center of the plot in Fig. 2c, extending towards
computer vision31–34. the top right corner. The top left corner is where benign-looking
Current methods for weakly supervised WSI classification rely on glands are represented. The bottom portion contains background
deep learning models trained under variants of the MIL assumption. and edge tiles. The discriminative tiles with high tumor probability

Nature Medicine | www.nature.com/naturemedicine


NaTurE MEDIcInE Articles
a
Dataset Years Slides Patients Positive slides External slides ImageNet
Prostate in house 2016 12,132 836 2,402 0 19.8×
Prostate external 2015–2017 12,727 6,323 12,413 12,727 29.0×
Skin 2016–2017 9,962 5,325 1,659 3,710 21.4×
Axillary lymph nodes 2013–2018 9,894 2,703 2,521 1,224 18.2×
Total 44,732 15,187 88.4×

b
28,649 px / 14.3 mm

63,744 px / 31.9 mm 3,000 px / 1.5 mm 1,200 px / 600 µm 300 px / 150 µm

c
Inference Learning
Slide tiling Classifier Tile probability Ranked tiles Top tiles Slide
CNN targets
Clinically relevant
dataset 1

0
...

...

...
...

...
0

d Trained 1 Diagnosis
MIL RNN aggregation
4
model
2 9 8
2 1 2 3 S

5
Tumor 3
probability 1 MIL feature representation
1.0 S

7
...

0.5 3 S
6
0

Fig. 1 | Overview of the data and proposed deep learning framework presented in this study. a, Description of the datasets. This study is based on a total
of 44,732 slides from 15,187 patients across three different tissue types: prostate, skin and axillary lymph nodes. The prostate dataset was divided into
in-house slides and consultation slides to test for staining bias. The class imbalance varied from 1:4 for prostate to 1:3 for breast. A total of 17,661 slides
were submitted to MSK from more than 800 outside institutions in 45 countries for a second opinion. To put the size of our dataset into context, the last
column shows a comparison, in terms of the pixel count, with ImageNet—the state of the art in computer vision, containing over 14 million images. b, Left,
hematoxylin and eosin slide of a biopsy showing prostatic adenocarcinoma. The diagnosis can be based on very small foci of cancer that account for <1%
of the tissue surface. In the slide to the left, only about six small tumor glands are present. The right-most image shows an example of a malignant gland.
Its relation to the entire slide is put in perspective to reiterate the difficulty of the task. c, The MIL training procedure includes a full inference pass through
the dataset, to rank the tiles according to their probability of being positive, and learning on the top-ranking tiles per slide. CNN, convolutional neural
network. d, Slide-level aggregation with a recurrent neural network (RNN). The S most suspicious tiles in each slide are sequentially passed to the RNN to
predict the final slide-level classification.

are clustered in two regions at the bottom and left of the plot. A For example, Hou et al.36. learned a logistic regression based on the
closer look reveals the presence of malignant glands. Interestingly, number of tiles per class as predicted by an ensemble of tile classi-
a subset of the top-ranked tiles with a tumor probability close to fiers. Similarly, Wang et  al.18. extracted geometrical features from
0.5, indicating uncertainty, are tiles that contain glands suspicious the tumor probability heat map generated by a tile-level classifier
of being malignant. and trained a random forest model, winning the CAMELYON16
challenge. Following the latter approach, we trained a random forest
Comparison of different slide aggregation approaches. The max- model on manually engineered features extracted from the heat map
pooling operation that leads to the slide prediction under the MIL generated by our MIL-based tile classifier. For prostate cancer clas-
assumption is not robust. A single spurious misclassification can sification, the random forest trained on the validation split at 20×
change the slide prediction, possibly resulting in a large number of magnification produced an AUC of 0.98 on the test set, which was
false positives. One way to mitigate this type of mistake is to learn not statistically significantly different from MIL alone (Extended
a slide aggregation model on top of the MIL classification results. Data Fig. 4). Although this procedure drastically decreased the false

Nature Medicine | www.nature.com/naturemedicine


Articles NaTurE MEDIcInE

a b
0.5
20
1.00

Tumor probability
Minimum balanced validation error

0.4 0.75
10 0.50

0.3 0.25

t-SNE2
0

0.2

−10
200
0.1

Count
−20 100

102 103 104 −20 −10 0 10 20


Number of training WSIs t-SNE1

c Malignant Benign

112 µm

56 µm

Malignant Suspicious

Fig. 2 | Dataset size impact and model introspection. a, Dataset size plays an important role in achieving clinical-grade MIL classification performance.
Training of ResNet34 was performed with datasets of increasing size; for every reported training set size, five models were trained, and the validation
errors are reported as box plots (n = 5). This experiment underlies the fact that a large number of slides are necessary for generalization of learning
under the MIL assumption. b,c, The prostate model has learned a rich feature representation of histopathology tiles. b, A ResNet34 model trained at 20×
was used to obtain the feature embedding before the final classification layer for a random set of tiles in the test set (n = 182,912). The embedding was
reduced to two dimensions with t-SNE and plotted using a hexagonal heat map. Top-ranked tiles coming from negative and positive slides are represented
by points colored by their tumor probability. c, Tiles corresponding to points in the two-dimensional t-SNE space were randomly sampled from different
regions. Abnormal glands are clustered together on the bottom and left sides of the plot. A region of tiles with a tumor probability of ~0.5 contains glands
with features suspicious for prostatic adenocarcinoma. Normal glands are clustered on the top left region of the plot.a, Dataset size plays an important
role in achieving clinical-grade MIL classification performance. Training of ResNet34 was performed with datasets of increasing size; for every reported
training set size, five models were trained, and the validation errors are reported as box plots (n = 5). This experiment underlies the fact that a large
number of slides are necessary for generalization of learning under the MIL assumption. b,c, The prostate model has learned a rich feature representation
of histopathology tiles. b, A ResNet34 model trained at 20× was used to obtain the feature embedding before the final classification layer for a random
set of tiles in the test set (n = 182,912). The embedding was reduced to two dimensions with t-SNE and plotted using a hexagonal heat map. Top-ranked
tiles coming from negative and positive slides are represented by points colored by their tumor probability. c, Tiles corresponding to points in the two-
dimensional t-SNE space were randomly sampled from different regions. Abnormal glands are clustered together on the bottom and left sides of the plot.
A region of tiles with a tumor probability of ~0.5 contains glands with features suspicious for prostatic adenocarcinoma. Normal glands are clustered on
the top left region of the plot.

positive rate, and at 20× achieved a better balanced error than the empirical support from ref. 37, we introduce an RNN-based model
basic max-pooling aggregation, this came with an unacceptable that can integrate information at the representation level to emit a
decrease in sensitivity. final slide classification (Fig. 1d). Interestingly, information can also
The previous aggregation methods do not take advantage of the be integrated across the various magnifications to produce a multi-
information contained in the feature representation learned during scale classification. At 20×, the MIL-RNN models resulted in 0.991,
training. Given a vector representation of tiles, even if singularly they 0.989 and 0.965 AUCs for the prostate, BCC and breast metastases
were not classified as positive by the tile classifier, taken together datasets, respectively (Fig. 3). For the prostate experiment, the MIL-
they could be suspicious enough to trigger a positive response by a RNN method was statistically significantly better than max-pooling
representation-based slide-level classifier. Based on these ideas and aggregation. The multiscale approach was tested on the prostate

Nature Medicine | www.nature.com/naturemedicine


NaTurE MEDIcInE Articles
a b c
1.00 1.00 1.00

0.75 1.00 0.75 1.00 0.75 1.00


Sensitivity

Sensitivity

Sensitivity
0.95 0.95 0.95

0.50 0.50 0.50


0.90 0.90 0.90

Model Model Model


0.25 0.85 (P = 0.00023) 0.25 0.85 (P = 0.1) 0.25 0.85 (P = 0.9)
MIL (AUC: 0.986) MIL (AUC: 0.986) MIL (AUC: 0.965)
MIL-RNN (AUC: 0.991) MIL-RNN (AUC: 0.988) MIL-RNN (AUC: 0.966)
0.80 0.80 0.80

1.00 0.95 0.90 0.85 0.80 1.00 0.95 0.90 0.85 0.80 1.00 0.95 0.90 0.85 0.80
0 0 0
1.00 0.75 0.50 0.25 0 1.00 0.75 0.50 0.25 0 1.00 0.75 0.50 0.25 0
Specificity Specificity Specificity

Fig. 3 | Weakly supervised models achieve high performance across all tissue types. The performances of the models trained at 20× magnification on
the respective test datasets were measured in terms of AUC for each tumor type. a, For prostate cancer (n = 1,784) the MIL-RNN model significantly
(P < 0.001) outperformed the model trained with MIL alone, resulting in an AUC of 0.991. b,c, The BCC model (n = 1,575) performed at 0.988 (b), while
breast metastases detection (n = 1,473) achieved an AUC of 0.966 (c). For these latter datasets, adding an RNN did not significantly improve performance.
Statistical significance was assessed using DeLong’s test for two correlated ROC curves.

data, but its performance was not better than that achieved by the In addition, two false positives were corrected to true positives.
single-scale model trained at 20×. False negative to true negative corrections were due to the tissue of
interest not being present on a deeper hematoxylin and eosin slide,
Pathology expert analysis of the MIL-RNN error modes. or sampling error at the time the frozen section was prepared. False
Pathologists specialized in each discipline analyzed the test set positive to true positive corrections were due to soft tissue meta-
errors made by MIL-RNN models trained at 20× magnification static deposits or tumor emboli. The AUC improved from 0.965 to
(a selection of cases is presented in Fig. 4a–c). Several discrepan- 0.989 given these corrections. Of the 23 false negatives, eight were
cies (six in prostate, eight in BCC and 23 in axillary lymph nodes; macro-metastasis, 13 were micro-metastasis and two were isolated
see Fig. 4d) were found between the reported case diagnosis tumor cells (ITCs). Notably, 12 cases (four false negatives and eight
and the true slide class (that is, presence/absence of tumor). false positives) showed signs of treatment effect from neoadjuvant
Because the ground truth is reliant on the diagnosis reported in chemotherapy.
the LIS, the observed discrepancies can be due to several factors:
(1) under the current WSI scanning protocol, as only select slides Investigation of technical variability introduced by slide prepa-
are scanned in each case, there exists the possibility of a mismatch ration at multiple institutions and different scanners. Several
between the slide scanned and the reported LIS diagnosis linked to sources of variability come into play in computational pathology. In
each case; (2) a deeper slide level with no carcinoma present could addition to all of the morphological variability, technical variability
be selected for scanning; and (3) tissue was removed to create tis- is introduced during glass slide preparation and scanning. How this
sue microarrays before slide scanning. Encouragingly, the training variability can affect the prediction of an assistive model is a ques-
procedure proved robust to the ground truth noise in our datasets. tion that must be investigated thoroughly.
For the prostate model, three of the 12 false negatives were cor- Assessing the performance of models on slides digitized on dif-
rectly predicted as negative by the algorithm. Three other slides ferent scanners is crucial for enabling the application of the same
showed atypical morphological features, but they were not sufficient model in departments with varied scanner vendor workflows or
to diagnose carcinoma. The confirmed six false negatives were char- smaller clinics that operate scanners from different vendors and do
acterized by having very low tumor volume. Taking into account the not have the infrastructure to train a model tailored to their needs.
corrections to the ground truth, the AUC for the prostate test set To test the effect of the whole slide scanner type on model perfor-
improved from 0.991 to 0.994. The 72 false positives were reviewed mance, we scanned a substantial subset of the in-house prostate test
as well. The algorithm falsely identified small foci of glands as set (1,274 out of 1,784) on a Philips IntelliSite Ultra Fast Scanner
cancer, focusing on small glands with hyperchromatic nuclei that that was recently approved by the Food and Drug Administration
contained at least a few cells with prominent nucleoli. Many of the for primary diagnostic use. We observed a decrease in performance
flagged glands also showed intraluminal secretions. Overall, the in terms of AUC of 3% points (Fig. 5a and Extended Data Fig. 5a).
algorithm was justified in reporting the majority of these cases as Analyzing the mismatches between the predictions on Leica Aperio
suspicious, thus fulfilling the requisites of a screening tool. WSIs and their matching Philips digital slides revealed a perceived
For the BCC model, four false negatives were corrected to true difference in brightness, contrast and sharpness that could affect the
negatives, and four false positives were corrected to true positives. prediction performance. In practice, an effective solution to reducing
Given these corrections, the AUC improved from 0.988 to 0.994. the generalization error even further could be training on a mixed
The 12 cases determined to be false negatives were characterized dataset or fine-tuning the model on data from the new scanner.
by low tumor volume. The 15 false positives included squamous To measure the effects of slide preparation on model perfor-
cell carcinomas and miscellaneous benign neoplastic and non- mance, we gathered a very large set consisting of over 12,000 pros-
neoplastic skin lesions. tate consultation slides submitted to the Memorial Sloan Kettering
For the breast metastasis model, 17 of the initially classified false Cancer Center (MSK) from other institutions in the United States
negatives were correctly classified as negatives, while four slides and abroad. It should be noted that these slides are typically
contained suspicious morphology that would likely require follow- diagnostically challenging and are the basis for the requested
up tests. A total of 21 false negatives were corrected to true negatives. expert pathologist review. We applied the MIL-RNN model trained

Nature Medicine | www.nature.com/naturemedicine


Articles NaTurE MEDIcInE

a b c

True positives

200 µm 200 µm 200 µm

200 µm 200 µm 200 µm


False negatives

200 µm
False positives

200 µm 200 µm

d
Prostate BCC Axillary lymph nodes
False False False False False False
negative positive negative positive negative positive
Benign/negative 3 56 3 2 17 1
Atypical/other/suspicious 3 16 1 11 4 31
Carcinoma/positive 6 0 12 4 23 2

True error rate 6/345 72/1,439 12/255 13/1,320 23/403 32/1,070

Fig. 4 | Pathology analysis of the misclassification errors on the test sets. a–c, Randomly selected examples of classification results on the test
set. Examples of true positive, false negative and false positive classifications are shown for each tumor type. The MIL-RNN model trained at 20×
magnification was run with a step size of 20 pixels across a region of interest, generating a tumor probability heat map. On every slide, the blue square
represents the enlarged area. For the prostate dataset (a), the true positive represents a difficult diagnosis due to tumor found next to atrophy and
inflammation; the false negative shows a very low tumor volume; and for the false positive the model identified atypical small acinar proliferation, showing
a small focus of glands with atypical epithelial cells. For the BCC dataset (b), the true positive has a low tumor volume; the false negative has a low tumor
volume; and for the false positive the tongue of the epithelium abutting from the base of the epidermis shows an architecture similar to BCC. For the
axillary lymph nodes dataset (c), the true positive shows ITCs with a neoadjuvant chemotherapy treatment effect; the false negative shows a slightly out of
focus cluster of ITCs missed due to the very low tumor volume and blurring; and the false positive shows displaced epithelium/benign papillary inclusion
in a lymph node. d, Subspecialty pathologists analyzed the slides that were misclassified by the MIL-RNN models. While slides can either be positive or
negative for a specific tumor, sometimes it is not possible to diagnose a single slide with certainty based on morphology alone. These cases were grouped
into the categories ‘atypical’ and ‘suspicious’ for prostate and breast lesions, respectively. The ‘other’ category consisted of skin biopsies that contained
tumors other than BCC. We observed that some of the misclassifications stem from incorrect ground truth labels.

at 20× to the large submitted slides dataset and observed a drop CAMELYON16 challenge. The approach can be considered state
of about 6% points in terms of AUC (Fig. 5a and Extended Data of the art for this task and relies on fully supervised learning and
Fig. 5a). Importantly, the decrease in performance was mostly seen pixel-level expert annotations. The main differences in our imple-
in the specificity to the new test set, while sensitivity remained high. mentation of ref. 18 are the architecture used (ResNet34 instead of
GoogLeNetv3), their usage of hard negative mining, and the features
Comparison of fully supervised learning with weakly super- extracted to train the slide-level random forest classifier. Our imple-
vised learning. To substantiate the claim that models trained under mentation achieved an AUC of 0.930 on the CAMELYON16 test
full supervision on small, curated datasets do not translate well set, similar to the 0.925 achieved in ref. 18. This model would have
to clinical practice, several experiments were performed with the won the classification portion of the CAMELYON16 challenge and
CAMELYON16 dataset23, which includes pixel-wise annotations for would be ranked fifth on the open leaderboard. The same model,
270 training slides and is one of the largest annotated, public digital trained under full supervision on CAMELYON16, was applied to
pathology datasets available. We implemented a model for auto- the MSK test set of the axillary lymph nodes dataset and resulted
matic detection of metastatic breast cancer on the CAMELYON16 in an AUC of 0.727, constituting a 20% drop compared with its
dataset, modeled after Wang et  al.18—the winning team of the performance on the CAMELYON16 test set (Fig. 5b, right panel).

Nature Medicine | www.nature.com/naturemedicine


NaTurE MEDIcInE Articles
a b Multiple instance learning Fully supervised learning
(trained on MSK dataset) (trained on CAMELYON16 dataset)

1.0 1.0

–2.65%
0.9 –5.84% 0.9
–7.15%

0.8 0.8
AUC

AUC
0.7 0.7 –20.2%

0.6 0.6

0.5 0.5

MSK in-house MSK in-house MSK external MSK CAMELYON16 CAMELYON16 MSK
test set test set test set test set test set test set test set
scanned on Aperio scanned on Philips scanned on Aperio (n = 1,473) (n = 129) (n = 129) (n = 1,473)
(n = 1,784) (n = 1,274) (n = 12,727)

Fig. 5 | Weak supervision on large datasets leads to higher generalization performance than fully supervised learning on small curated datasets. The
generalization performance of the proposed prostate and breast models were evaluated on different external test sets. a, Results of the prostate model
trained with MIL on MSK in-house slides and tested on: (1) the in-house test set (n = 1,784) digitized on Leica Aperio AT2 scanners; (2) the in-house test
set digitized on a Philips Ultra Fast Scanner (n = 1,274); and (3) external slides submitted to MSK for consultation (n = 12,727). Performance in terms of
AUC decreased by 3 and 6% for the Philips scanner and external slides, respectively. b, Comparison of the proposed MIL approach with state-of-the-art
fully supervised learning for breast metastasis detection in lymph nodes. Left, the model was trained on MSK data with our proposed method (MIL-RNN)
and tested on the MSK breast data test set (n = 1,473) and on the test set of the CAMELYON16 challenge (n = 129), showing a decrease in AUC of 7%.
Right, a fully supervised model was trained following ref. 18 on CAMELYON16 training data. While the resulting model would have won the CAMELYON16
challenge (n = 129), its performance drops by over 20% when tested on a larger test set representing real-world clinical cases (n = 1,473). Error bars
represent 95% confidence intervals for the true AUC calculated by bootstrapping each test set.

The reverse experiment, done by training our MIL model on the the full breadth of slides presented to clinicians from real-life clini-
MSK axillary lymph node data and testing it on the CAMELYON16 cal practice, representing the full wealth of biological and technical
test data, produced an AUC of 0.899, representing a much smaller variability. (3) As a result, no data curation is necessary because the
drop in performance compared with the 0.965 on the MSK test set model can learn that artifacts are not important for the classifica-
(Fig. 5b, left panel). tion task. (4) The previous two points allow the model trained with
These results illustrate that current deep learning models, trained the proposed method to generalize better to real data that would be
on small datasets, even with the advantage of exhaustive, pixel-wise observed in pathology practice. (5) The generalization performance
labels, are not able to generalize to clinical-grade, real-world data. is clinically relevant with AUCs greater than 0.98 for all cancer types
We hypothesize that small, well-curated datasets are not sufficient tested. (6) We rigorously define clinical grade and propose a strat-
to capture the vast biological and morphological variability of can- egy to integrate this system in the clinical work flow.
cer, as well as the technical variability introduced by the staining Most literature refers to clinical grade in terms of comparison
and preparation processes in histopathology. Our observations with a human performing the same task, usually under time or
urge caution and in-depth evaluation on real-world datasets before other constraints. We suggest that these comparisons are artificial
applying deep learning models for decision support in clinical prac- and offer little insight into how to use such systems in clinical prac-
tice. These results also show that weakly supervised approaches tice. We propose a different approach to measure clinical-grade
such as the one proposed here have a clear advantage over conven- performance. In clinical practice, a case, especially if challenging,
tional fully supervised learning in that they enable training on mas- is reviewed by multiple pathologists with the help of immunohis-
sive, diverse datasets without the necessity for data curation. tochemistry and molecular information in addition to hematoxylin
and eosin morphology. On the basis of this companion information,
Discussion one can assume that a team of pathologists at a comprehensive can-
The main hypothesis addressed in this work is that clinical-grade cer center will operate with 100% sensitivity and specificity. Under
performance can be reached without annotating WSIs at the pixel these assumptions, clinical grade for a decision support system does
level. To test our hypothesis, we developed a deep learning frame- not mean surpassing the performance of pathologists, which is
work that combines convolutional neural networks with RNNs impossible, but achieving 100% sensitivity with an acceptable false
under a MIL approach. We compiled a large dataset comprising positive rate. This formulation lends itself to a clinical application
44,732 slides from 15,187 patients across three different cancer as follows.
types. We built a state-of-the-art compute cluster that was essential At a fully operational digital pathology department, the predic-
for the feasibility of the project. Extensive validation experiments tive model is run on each scanned slide. The algorithm sorts cases,
confirmed the hypothesis and showed that clinical-grade decision and slides within each case, based on the predicted tumor prob-
support is feasible. ability, as soon as they are available from the pathology laboratory.
The implications of these results are wide ranging. (1) The fact During diagnostic reporting, the pathologist is presented with the
that manual pixel-level annotation is not necessary allows for the model’s recommendations through an interface that would flag
compilation of datasets that are magnitudes larger than in previ- positive slides for rapid review in a screening scenario, or disregard
ous studies. (2) This, in turn, allows our algorithm to learn from all benign slides in a diagnostic scenario. In this latter case, we show

Nature Medicine | www.nature.com/naturemedicine


Articles NaTurE MEDIcInE

a b
1.00
probability
Tumor

0.75

Sensitivity
0.50

Cases
0.25
Predicted Predicted
positive negative
0
1.00
0.75

Probability
0.50
0.25
0
0 25 50 75 100
% slides reviewed

Fig. 6 | Impact of the proposed decision support system on clinical practice. a, By ordering the cases, and slides within each case, based on their tumor
probability, pathologists can focus their attention on slides that are probably positive for cancer. b, Following the algorithm’s prediction would allow
pathologists to potentially ignore more than 75% of the slides while retaining 100% sensitivity for prostate cancer at the case level (n = 1,784).

in Fig. 6 (see Extended Data Fig. 6 for BCC and breast metastases) 13. Liu, Y. et al. Detecting cancer metastases on gigapixel pathology images.
that our prostate model would allow the removal of more than 75% Preprint at https://arxiv.org/abs/1703.02442 (2017).
14. Das, K., Karri, S. P. K., Guha Roy, A, Chatterjee, J. & Sheet, D. Classifying
of the slides from the workload of a pathologist without any loss in histopathology whole-slides using fusion of decisions from deep
sensitivity at the patient level. For pathologists who must operate in convolutional network on a collection of random multi-views at multi-
the increasingly complex, detailed and data-driven environment of magnification. In 2017 IEEE 14th International Symposium on Biomedical
cancer diagnostics, tools such as this will allow non-subspecialized Imaging 1024–1027 (IEEE, 2017).
15. Valkonen, M. et al. Metastasis detection from whole slide images using local
pathologists to confidently and efficiently classify cancer with 100%
features and random forests. Cytom. Part A 91, 555–565 (2017).
sensitivity. 16. Bejnordi, B. E. et al. Using deep convolutional neural networks to
identify and classify tumor-associated stroma in diagnostic breast biopsies.
Online content Mod. Pathol. 31, 1502–1512 (2018).
Any methods, additional references, Nature Research reporting 17. Mobadersany, P. et al. Predicting cancer outcomes from histology and
genomics using convolutional networks. Proc. Natl Acad. Sci. USA 115,
summaries, source data, statements of code and data availability and E2970–E2979 (2018).
associated accession codes are available at https://doi.org/10.1038/ 18. Wang, D., Khosla, A., Gargeya, R., Irshad, H. & Beck, A. H. Deep learning
s41591-019-0508-1. for identifying metastatic breast cancer. Preprint at https://arxiv.org/
abs/1606.05718 (2016).
Received: 23 October 2018; Accepted: 3 June 2019; 19. Janowczyk, A. & Madabhushi, A. Deep learning for digital pathology image
Published: xx xx xxxx analysis: a comprehensive tutorial with selected use cases. J. Pathol. Inform. 7,
29 (2016).
20. Litjens, G. et al. Deep learning as a tool for increased accuracy and efficiency
References of histopathological diagnosis. Sci. Rep. 6, 26286 (2016).
1. Ball, C. S. The early history of the compound microscope. Bios 37, 21. Coudray, N. et al. Classification and mutation prediction from non-small cell
51–60 (1966). lung cancer histopathology images using deep learning. Nat. Med. 24,
2. Hajdu, S. I. Microscopic contributions of pioneer pathologists. Ann. Clin. Lab. 1559–1567 (2018).
Sci. 41, 201–206 (2011). 22. Olsen, T. et al. Diagnostic performance of deep learning algorithms applied to
3. Fuchs, T. J., Wild, P. J., Moch, H. & Buhmann, J. M. Computational pathology three common diagnoses in dermatopathology. J. Pathol. Inform. 9, 32 (2018).
analysis of tissue microarrays predicts survival of renal clear cell carcinoma 23. Ehteshami Bejnordi, B. et al. Diagnostic assessment of deep learning
patients. In Proc. International Conference on Medical Image Computing and algorithms for detection of lymph node metastases in women with breast
Computer-Assisted Intervention 1–8 (Lecture Notes in Computer Science cancer. J. Am. Med. Assoc. 318, 2199–2210 (2017).
Vol 5242, Springer, 2008). 24. Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2016. CA Cancer J. Clin.
4. Fuchs, T. J. & Buhmann, J. M. Computational pathology: challenges and 66, 7–30 (2016).
promises for tissue analysis. Comput. Med. Imaging Graph. 35, 515–530 (2011). 25. Ozdamar, S. O. et al. Intraobserver and interobserver reproducibility of
5. Louis, D. N. et al. Computational pathology: a path ahead. Arch. Pathol. Lab. WHO and Gleason histologic grading systems in prostatic adenocarcinomas.
Med. 140, 41–50 (2016). Int. Urol. Nephrol. 28, 73–77 (1996).
6. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015). 26. Svanholm, H. & Mygind, H. Prostatic carcinoma reproducibility of histologic
7. Deng, J. et al. ImageNet: a large-scale hierarchical image database. In grading. APMIS 93, 67–71 (1985).
Proc. IEEE Conference on Computer Vision and Pattern Recognition 27. Gleason, D. F. Histologic grading of prostate cancer: a perspective.
248–255 (IEEE, 2009). Hum. Pathol. 23, 273–279 (1992).
8. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with 28. LeBoit, P. E. et al. Pathology and Genetics of Skin Tumours (IARC Press, 2006).
deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 29. Rogers, H. W., Weinstock, M. A., Feldman, S. R. & Coldiron, B. M. Incidence
1097–1105 (2012). estimate of nonmelanoma skin cancer (keratinocyte carcinomas) in the US
9. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale population, 2012. JAMA Dermatol. 151, 1081–1086 (2015).
image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014). 30. Dietterich, T. G., Lathrop, R. H. & Lozano-P’erez, T. Solving the
10. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image multiple instance problem with axis-parallel rectangles. Artif. Intell. 89,
recognition. Preprint at https://arxiv.org/abs/1512.03385 (2015). 31–71 (1997).
11. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep 31. Andrews, S., Hofmann, T. & Tsochantaridis, I. Multiple instance learning with
neural networks. Nature 542, 115–118 (2017). generalized support vector machines. In AAAI/IAAI 943–944 (AAAI, 2002).
12. De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral 32. Nakul, V. Learning from Data with Low Intrinsic Dimension
in retinal disease. Nat. Med. 24, 1342–1350 (2018). (Univ. California, 2012).

Nature Medicine | www.nature.com/naturemedicine


NaTurE MEDIcInE Articles
33. Zhang, C., Platt, J. C. & Viola, P. A. Multiple instance boosting for object K.J.B. reviewed the BCC cases. M.G.H. and E.B. reviewed the breast metastasis cases.
detection. Adv. Neural Inf. Process. Syst. 1417–1424 (2006). A.M. classified the free text diagnosis for the BCC cases. G.C., D.S.K. and T.J.F. conceived
34. Zhang, Q. & Goldman, S. A. EM-DD: an improved multiple-instance learning the project. All authors contributed to preparation of the manuscript.
technique. Adv. Neural Inf. Process. Syst. 1073–1080 (2002).
35. Kraus, O. Z., Ba, J. L. & Frey, B. J. Classifying and segmenting microscopy
images with deep multiple instance learning. Bioinformatics 32, Competing interests
i52–i59 (2016). T.J.F. is the Chief Scientific Officer of Paige.AI. T.J.F. and D.S.K. are co-founders and
36. Hou, L. et al. Patch-based convolutional neural network for whole slide tissue equity holders of Paige.AI. M.G.H., V.W.K.S., D.S.K., and V.E.R. are consultants for
image classification. In Proc. IEEE Conference on Computer Vision and Pattern Paige.AI. V.E.R. is a consultant for Cepheid. M.G.H. is on the medical advisory board of
Recognition 2424–2433 (IEEE, 2016). PathPresenter. D.S.K has received speaking/consulting compensation from Merck. G.C.
37. Bychkov, D. et al. Deep learning based tissue analysis predicts outcome in and T.J.F. have intellectual property interests relevant to the work that is the subject of
colorectal cancer. Sci. Rep. 8, 3395 (2018). this paper. MSK has financial interests in Paige.AI. and intellectual property interests
relevant to the work that is the subject of this paper.
Acknowledgements
We thank The Warren Alpert Center for Digital and Computational Pathology and
MSK’s high-performance computing team for their support. We also thank J. Samboy
Additional information
Extended data is available for this paper at https://doi.org/10.1038/s41591-019-0508-1.
for leading the digital scanning initative and E. Stamelos and F. Cao, from the pathology
informatics team at MSK, for their invaluable help querying the digital slide and LIS Supplementary information is available for this paper at https://doi.org/10.1038/
databases. We are in debt to P. Schueffler for extending the digital whole slide viewer s41591-019-0508-1.
specifically for this study and for supporting its use by the whole research team. Finally, Reprints and permissions information is available at www.nature.com/reprints.
we thank C. Virgo for managing the project, D. V. K. Yarlagadda for development
Correspondence and requests for materials should be addressed to T.J.F.
support and D. Schnau for help editing the manuscript. This research was funded in part
through the NIH/NCI Cancer Center Support Grant P30 CA008748. Peer review information: Javier Carmona was the primary editor on this article and
managed its editorial process and peer review in collaboration with the rest of the
editorial team.
Author contributions
G.C. and T.J.F. designed the experiments. G.C. wrote the code, performed the Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in
experiments and analyzed the results. L.G. queried MSK’s WSI database and transferred published maps and institutional affiliations.
the digital slides to the compute cluster. V.W.K.S. and V.E.R. reviewed the prostate cases. © The Author(s), under exclusive licence to Springer Nature America, Inc. 2019

Nature Medicine | www.nature.com/naturemedicine


Articles NaTurE MEDIcInE

Methods This representation will be used as input to an RNN. The complete pipeline for the
Hardware and software. All experiments were conducted on MSK’s high- MIL classification algorithm (Fig. 1c) comprises the following steps: (1) tiling of
performance computing cluster. In particular, we took advantage of seven NVIDIA each slide in the dataset (for each epoch, which consists of an entire pass through
DGX-1 compute nodes, each containing eight V100 Volta graphics processing the training data); (2) a complete inference pass through all of the data; (3) intra-
units (GPUs) and 8TB SSD local storage. Each model was trained on a single slide ranking of instances; and 4) model learning based on the top-ranked instance
GPU. We used OpenSlide38 (version 3.4.1) to access the WSI files on the fly, for each slide.
and PyTorch39 (version 1.0) for data loading, building models and training.
The final statistical analysis was performed in R40 (version 3.3.3), using pROC41 Slide tiling. The instances were generated by tiling each slide on a grid (Extended
(version 1.9.1) for receiver operating characteristic (ROC) statistics and ggplot2 Data Fig. 7). Otsu’s method is used to threshold the slide thumbnail image to
(version 3.0.0)42 for generating plots. efficiently discard all background tiles, thus drastically reducing the amount of
computation per slide. Tiling can be performed at different magnification levels
Statistics. AUCs for the various ROC curves were calculated in R with pROC. and with various levels of overlap between adjacent tiles. We investigated three
Confidence intervals were computed with the pROC package41 using bootstrapping magnification levels (5×, 10x and 20×). The amount of overlap used was different
with nonparametric, unstratified resampling, as described by Carpenter and at each magnification during training and validation: no overlap at 20×, 50%
Bithell43. Pairs of AUCs were compared with the pROC package41 using the two- overlap at 10× and 67% overlap at 5×. For testing, we used 80% overlap at every
tailed DeLong’s test for two correlated ROC curves44. magnification. Given a tiling strategy, we produce bags B = {Bs : i = 1, 2,…, n},
i
where Bs  = {bi,1, bi,2,…, bi, m } is the bag for slide si containing mi total tiles.
i i
WSI datasets. We collected three large datasets of hematoxylin and eosin-stained
digital slides for the following tasks: (1) prostatic carcinoma classification; (2) BCC Model training. The model is a function fθ with current parameter θ that maps input
classification; and (3) the detection of breast cancer metastasis in axillary lymph tiles bi,j to class probabilities for ‘negative’ and ‘positive’ classes. Given our bags B,
nodes. A description is given in Fig. 1a. Unless otherwise stated, glass slides were we obtain a list of vectors O = {oi: i = 1, 2,…, n}—one for each slide si containing the
scanned at MSK with Leica Aperio AT2 scanners at 20× equivalent magnification probabilities of class ‘positive’ for each tile bi,j: j = 1, 2,…, m in Bs . We then obtain
i
(0.5 µm pixel−1). Each dataset was randomly divided at the patient level in training the index ki of the tile within each slide, which shows the highest probability of
(70%), validation (15%) and test (15%) sets. The training and validation sets were being ‘positive’: ki = argmax(oi). This is the most stringent version of MIL, but we
used for hyper-parameter tuning and model selection. The final models were run can relax the standard MIL assumption by introducing hyper-parameter K and
once on the test set to estimate generalization performance. assume that at least K tiles exist in positive slides that are discriminative. For K = 1,
The prostate dataset consisted of 12,132 core needle biopsy slides produced and the highest ranking tile in bag Bs is then bi,k. The output of the network y˜i = fθ(bi,k)
i
scanned at MSK (we refer to these as in-house slides). A subset of 2,402 slides were can then be compared to yi, the target of slide si, through the cross-entropy loss l
positive for prostatic carcinoma (that is, contained Gleason patterns 3 and above). as in equation (1). Similarly, if K > 1, all selected tiles from a slide share the same
An in-depth stratification by Gleason grade and tumor size is included in target yi and the loss can be computed with equation (1) for each one of the K tiles:
Supplementary Table 1. In addition to the in-house set, we also retrieved a set of
12,727 prostate core needle biopsies submitted to MSK for a second opinion from l = −w1[yi log[yi ̃]] −w 0[(1−yi )log[1−yi ]]
̃ (1)
other institutions around the world. These slides were produced at their respective
institutions but scanned on the whole slide scanners at MSK. For prostate only, Given the unbalanced frequency of classes, weights w0 and w1, for negative and
the external slides were not used during training, but only at test time to estimate positive classes, respectively, can be used to give more importance to the under-
generalization to various sources of technical variability in glass slide preparation. A represented examples. The final loss is the weighted average of the losses over a
portion of the prostate (1,274 out of 1,784) test set was scanned on a Philips IntelliSite mini-batch. Minimization of the loss is achieved via stochastic gradient descent
Ultra Fast Scanner to test generalization performance to scanning variability. (SGD) using the Adam optimizer and learning rate 0.0001. We used mini-batches
The skin dataset consisted of 9,962 slides from biopsies and excisions of a wide of size 512 for AlexNet, 256 for ResNets and 128 for VGGs and DenseNet201. All
range of neoplastic and non-neoplastic skin lesions, including 1,659 BCCs, with all models were initialized with ImageNet pretrained weights. Early stopping was used
common histological variants (superficial, nodular, micronodular and infiltrative) to avoid overfitting.
represented. The breast cancer metastases dataset of axillary lymph nodes consisted
of 9,894 slides, 2,521 of which contained macro-metastases, micro-metastases Model testing. At validation/test time, all of the tiles for each slide are fed through
or ITCs. Included in this dataset were slides generated from intraoperative the network. Given a threshold (usually 0.5), if at least one tile is positive, the
consultations (for example, frozen section slides), in which the quality of staining entire slide is called positive; if all of the instances are negative, the slide is negative.
varied from the standardized hematoxylin and eosin staining protocols used on In addition, we assume the probability of a slide being positive to be the highest
slides from formalin-fixed, paraffin-embedded tissue. The dataset also included probability among all of the tiles in that slide. This max-pooling over the tile
patients treated with neoadjuvant chemotherapy, which may be diagnostically probability is the easiest aggregation technique. We explore different aggregation
challenging in routine pathology practice (that is, a small volume of metastatic techniques below.
tumor and therapy-related changes in tumor morphology) and are known to lead
to high false negative rates45. For the skin and axillary lymph nodes data, external Naive multiscale aggregation. Given models f20×, f10x, and f5x trained at 20×, 10× and
slides were included during training. 5× magnifications, a multiscale ensemble can be created by pooling the predictions
of each model with an operator. We used average and max-pooling to obtain naive
Slide diagnosis retrieval. Pathology reports are recorded in the LIS of the multiscale models.
pathology department. For the prostate and axillary lymph nodes datasets, the
ground truth labels (that is, the slide-level diagnoses) are retrieved directly Random forest-based slide integration. Given a model f trained at a particular
by querying the LIS database. This is made possible by the structured nature resolution, and a WSI, we can obtain a heat map of tumor probability over the
of the reporting done for these subspecialties. In dermatopathology, BCCs slide. We can then extract several features from the heat map to train a slide
are not reported in structured form. To overcome this problem, a trained aggregation model. For example, Hou et al.36 used the count of tiles in each class to
dermatopathologist (A.M.) checked the free text diagnoses and assigned final train a logistic regression model. Here, we extend that approach by adding several
binary labels to each case manually. global and local features, and train a random forest to emit a slide diagnosis. The
features extracted are: (1) total count of tiles with probability ≥0.5; (2–11) ten-
Dataset curation. The datasets were not curated, to test the applicability of the bin histogram of tile probability; (12–30) count of connected components for a
proposed system in a real-world, clinical scenario. Across all datasets, fewer than probability threshold of 0.1 of size in the ranges 1–10, 11–15, 16–20, 21–25, 26–30,
ten slides were removed due to excessive pen markings. 31–40, 41–50, 51–60, 61–70 and >70, respectively; (31–40) ten-bin local histogram
with a window of size 3 × 3 aggregated by max-pooling; (41–50) ten-bin local
MIL-based slide diagnosis. Classification of a whole digital slide (for example, histogram with a window of size 3 × 3 aggregated by averaging; (51–60) ten-bin
WSI) based on a tile-level classifier can be formalized under the classic MIL local histogram with a window of size 5 × 5 aggregated by max-pooling; (61–70)
approach when only the slide-level class is known and the classes of each tile in ten-bin local histogram with a window of size 5 × 5 aggregated by averaging;
the slide are unknown. Each slide si from our slide pool S = {si : i = 1, 2, …, n} can (71–80) ten-bin local histogram with a window of size 7 × 7 aggregated by
be considered a bag consisting of a multitude of instances (we used tiles of size max-pooling; (81–90) ten-bin local histogram with a window of size 7 × 7 aggregated
224 × 224 pixels). For positive bags, there must exist at least one instance that is by averaging; (91–100) ten-bin local histogram with a window of size 9 × 9 aggregated
classified as positive by some classifier. For negative bags, instead, all instances by max-pooling; (101–110) ten-bin local histogram with a window of size 9 × 9
must be classified as negative. Given a bag, all instances are exhaustively aggregated by averaging; (111–120) ten-bin histogram of all tissue edge tiles;
classified and ranked according to their probability of being positive. If the bag is (121–130) ten-bin local histogram of edges with a linear window of size 3 × 3
positive, the top-ranked instance should have a probability of being positive that aggregated by max-pooling; (131–140) ten-bin local histogram of edges with a linear
approaches 1; if it is negative, its probability of being positive should approach 0. window of size 3 × 3 aggregated by averaging; (141–150) ten-bin local histogram
Solving the MIL task induces the learning of a tile-level representation that can of edges with a linear window of size 5 × 5 aggregated by max-pooling; (151–160)
linearly separate the discriminative tiles in positive slides from all other tiles. ten-bin local histogram of edges with a linear window of size 5 × 5 aggregated by

Nature Medicine | www.nature.com/naturemedicine


NaTurE MEDIcInE Articles
averaging; (161–170) ten-bin local histogram of edges with a linear window of size slides scanned at MSK, a tiling method was developed to extract tiles containing
7 × 7 aggregated by max-pooling; and (171–180) ten-bin local histogram of edges tissue from both inside and outside the annotated regions at MSK’s 20× equivalent
with a linear window of size 7 × 7 aggregated by averaging. The random forest was magnification (0.5 µm pixel−1) to enable direct comparison with our datasets.
learned of the validation set instead of the training set to avoid over-fitting. This method generates a grid of possible tiles, excludes background via Otsu
thresholding and determines whether a tile is inside an annotation region by
RNN-based slide integration. Model f mapping a tile to class probability solving a point in polygon problem.
consists of two parts: a feature extractor fF that transforms the pixel space to We used 80% of the training data to train our model, and we left 20% for
representation space, and a linear classifier fC that projects the representation model selection. We extracted at random 1,000 tiles from each negative slide, and
variables into the class probabilities. The output of fF for the ResNet34 architecture 1,000 negative tiles and 1,000 positive tiles from the positive slides. A ResNet34
is a 512-dimensional vector representation. Given a slide and model f, we can model was trained augmenting the dataset on the fly with 90° rotations, horizontal
obtain a list of the S most interesting tiles within the slide in terms of positive class flips and color jitter. The model was optimized with SGD. The best-performing
probability. The ordered sequence of vector representations e = e1, e2,…, eS is the model on the validation set was selected. Slide-level predictions were generated
input to an RNN along with a state vector h. The state vector is initialized with a with the random forest aggregation approach explained before and trained on the
zero vector. Then, for step i = 1, 2,…, S of the recurrent forward pass, the new state entire training portion of the CAMELYON16 dataset. To train the random forest
vector hi is given by equation (2): model, we exhaustively tiled with no overlap the training slides to generate the
tumor probability maps. The trained random forest was then evaluated on the
hi = ReLU(We
e i + Whhi − 1 + b) (2) CAMELYON16 test dataset and on our large breast lymph node metastasis
where We and Wh are the weights of the RNN model. At step S, the slide test datasets.
classification is simply o = WohS, where Wo maps a state vector to class probabilities.
With S = 1, the model does not recur and the RNN should learn the fC classifier. Data protection. This project was governed by an Institutional Review Board-
This approach can be easily extended to integrate information at multiple scales. approved retrospective research protocol under which consent/authorization was
Given models f20×, f10x and f5x trained at 20×, 10× and 5× magnifications, we obtain waived before research was carried out. All data collection, research and analysis
the S most interesting tiles from a slide by averaging the prediction of the three was conducted exclusively at MSK.
models on tiles extracted at the same center pixel but at different magnifications. All publicly shared WSIs were de-identified and do not contain any protected
Now, the inputs to the RNN at each step i are e20x,i, e10x,i, e5x,i, and the state vector health information or label text.
hi−1. The new state vector is then given by equation (3):
Reporting Summary. Further information on research design is available in the
hi = ReLU(W20 ×e20 ×, i + W10 ×e10×, i Nature Research Reporting Summary linked to this article.
(3)
+ W5 ×e5×, i + Whhi −1 + b)
Data availability
In all of the experiments, we used 128 dimensional vectors for the state The publicly shared MSK breast cancer metastases dataset is available at
representation of the recurrent unit, ten recurrent steps (S = 10), and weighted the http://thomasfuchslab.org/data/. The dataset consists of 130 de-identified WSIs of
positive class to give more importance to the sensitivity of the model. All RNN axillary lymph node specimens from 78 patients (see Extended Data Fig. 8). The
models were trained with cross-entropy loss and SGD with a batch size of 256. tissue was stained with hematoxylin and eosin and scanned on Leica Biosystems
AT2 digital slide scanners at MSK. Metastatic carcinoma is present in 36 whole
MIL exploratory experiments. We performed a set of exploratory experiments on slides from 27 patients, and the corresponding label is included in the dataset.
the prostate dataset. At least five training runs were completed for each condition. The remaining data that support the findings of this study were offered to editors
The minimum balanced error on the validation set for each run was used to and peer reviewers at the time of submission for the purposes of evaluating
decide the best condition in each experiment. ResNet34 achieved the best results the manuscript upon request. The remaining data are not publicly available, in
over other architectures tested. The relative balanced error rates with respect to accordance with institutional requirements governing human subject privacy
ResNet34 were: +0.0738 for AlexNet, −0.003 for VGG11BN, +0.025 for ResNet18, protection.
+0.0265 for ResNet101 and +0.0085 for DenseNet201. Using a class-weighted loss
led to better performance overall, and we adopted weights in the range of 0.80–0.95
in subsequent experiments. Given the scale of our data, augmenting the data with
Code availability
The source code of this work can be downloaded from https://github.com/
rotations and flips did not significantly affect the results: the best balanced error
MSKCC-Computational-Pathology/MIL-nature-medicine-2019.
rate on the model trained with augmentation was 0.0095 higher than without
augmentation. During training, we weighted the false negative errors more heavily
to obtain models with high sensitivity. References
38. Goode, A., Gilbert., B., Harkes, J., Jukic., D. & Satyanarayanan., M.
Visualization of feature space. For each dataset, we sampled 100 tiles from each OpenSlide: a vendor-neutral software foundation for digital pathology.
test slide, in addition to its top-ranked tile. Given the trained 20× models, we J. Pathol. Inform. 4, 27 (2013).
extracted for each of the sampled tiles the final feature embedding before the 39. Paszke, A. et al. Automatic differentiation in PyTorch. In 31st Conference on
classification layer. We used t-distributed stochastic neighbor embedding (t-SNE)46 Neural Information Processing Systems (2017).
for dimensionality reduction to two dimensions. 40. R Development Core Team R: A Language and Environment for Statistical
Computing (R Foundation for Statistical Computing, 2017).
Pathology analysis of model errors. A genitourinary subspecialized pathologist 41. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and
(V.E.R.) reviewed the prostate cases. A dermatopathology subspecialized compare ROC curves. BMC Bioinformatics 12, 77 (2011).
pathologist (K.J.B.) reviewed the BCC cases. Two breast subspecialized pathologists 42. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
(E.B. and M.G.H.) jointly reviewed the breast cases. For each tissue type, the 43. Carpenter, J. & Bithell, J. Bootstrap confidence intervals: when, which, what?
respective pathologists were presented with all of the test errors and a randomly A practical guide for medical statisticians. Stat. Med. 19, 1141–1164 (2000).
selected sample of 20 true positives. They were tasked with evaluating the 44. DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas
model’s predictions and interpreting possible systematic error modalities. During under two or more correlated receiver operating characteristic curves: a
the analysis, the pathologists had access to the model’s prediction and the full nonparametric approach. Biometrics 44, 837–845 (1988).
pathology report for each case. 45. Yu, Y. et al. Sentinel lymph node biopsy after neoadjuvant chemotherapy for
breast cancer: retrospective comparative evaluation of clinically axillary
CAMELYON16 experiments. The CAMELYON16 dataset consists of 400 total lymph node positive and negative patients, including those with axillary
patients for whom a single WSI is provided in a tag image file format (TIFF). lymph node metastases confirmed by fine needle aspiration. BMC Cancer 16,
Annotations are given in extensible markup language (XML) format, one per each 808 (2016).
positive slide. For each annotation, several regions, defined by vertex coordinates, 46. Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach.
may be present. Since these slides were scanned at a higher resolution than the Learn. Res. 9, 2579–2605 (2008).

Nature Medicine | www.nature.com/naturemedicine


Articles NaTurE MEDIcInE

Extended Data Fig. 1 | Geographical distribution of the external consultation slides submitted to MSKCC. We included in our work a total of 17,661
consultation slides: 17,363 came from other US institutions located across 48 US states, Washington DC and Puerto Rico; 248 cases came from
international institutions spread across 44 countries in all continents. a, Distribution of consultation slides coming from other US institutions. Top,
geographical distribution of slides in the continental United States. Red points correspond to pathology laboratories. Bottom, consultation slides
distribution per state (including Washington DC and Puerto Rico). b, Distribution of consultation slides coming from international institutions. Top,
geographical locations of consultation slides across the world (light gray, countries that did not contribute slides; light blue, countries that contributed
slides; dark blue, United States). Bottom, distribution of external consultation slides per country of origin (excluding the United States).

Nature Medicine | www.nature.com/naturemedicine


NaTurE MEDIcInE Articles

Extended Data Fig. 2 | MIL model classification performance for different cancer datasets. Performance on the respective test datasets was measured
in terms of AUC. a, Best results were achieved on the prostate dataset (n = 1,784), with an AUC of 0.989 at 20× magnification. b, For BCC (n = 1,575),
the model trained at 5× performed the best, with an AUC of 0.990. c, The worst performance came on the breast metastasis detection task (n = 1,473),
with an AUC of 0.965 at 20×. The axillary lymph node dataset is the smallest of the three datasets, which is in agreement with the hypothesis that larger
datasets are necessary to achieve lower error rates on real-world clinical data.

Nature Medicine | www.nature.com/naturemedicine


Articles NaTurE MEDIcInE

Extended Data Fig. 3 | t-SNE visualization of the representation space for the BCC and axillary lymph node models. Two-dimensional t-SNE projection of
the 512-dimensional representation space were generated for 100 randomly sampled tiles per slide. a, BCC representation (n = 144,935). b, Axillary lymph
nodes representation (n = 139,178).

Nature Medicine | www.nature.com/naturemedicine


NaTurE MEDIcInE Articles

Extended Data Fig. 4 | Performance of the MIL-RF model at multiple scales on the prostate dataset. The MIL model was run on each slide of the test
dataset (n = 1,784) with a stride of 40 pixels. From the resulting tumor probability heat map, hand-engineered features were extracted for classification
with the random forest (RF) model. The best MIL-RF model (ensemble model; AUC = 0.987) was not statistically significantly better than the MIL-only
model (20× model; AUC = 0.986; see Fig. 3), as determined using DeLong’s test for two correlated ROC curves.

Nature Medicine | www.nature.com/naturemedicine


Articles NaTurE MEDIcInE

Extended Data Fig. 5 | ROC curves of the generalization experiments summarized in Fig. 5. a, Prostate model trained with MIL on MSK in-house slides
tested on: (1) an in-house slides test set (n = 1,784) digitized on Aperio scanners; (2) an in-house slides test set digitized on a Philips scanner (n = 1,274);
and (3) external slides submitted to MSK for consultation (n = 12,727). b,c, Comparison of the proposed MIL approach with state-of-the-art fully
supervised learning for breast metastasis detection in lymph nodes. For b, the breast model was trained on MSK data with our proposed method
(MIL-RNN) and tested on the MSK breast data test set (n = 1,473) and on the test set of the CAMELYON16 challenge (n = 129), and achieved AUCs of
0.965 and 0.895, respectively. For c, the fully supervised model was trained on CAMELYON16 data and tested on the CAMELYON16 test set (n = 129),
achieving an AUC of 0.930. Its performance dropped to AUC = 0.727 when tested on the MSK test set (n = 1,473).

Nature Medicine | www.nature.com/naturemedicine


NaTurE MEDIcInE Articles

Extended Data Fig. 6 | Decision support with the BCC and breast metastases models. For each dataset, slides are ordered by their probability of being
positive for cancer, as predicted by the respective MIL-RNN model. The sensitivity is computed at the case level. a, BCC (n = 1,575): given a positive
prediction threshold of 0.025, it is possible to ignore roughly 68% of the slides while maintaining 100% sensitivity. b, Breast metastases (n = 1,473):
given a positive prediction threshold of 0.21, it is possible to ignore roughly 65% of the slides while maintaining 100% sensitivity.

Nature Medicine | www.nature.com/naturemedicine


Articles NaTurE MEDIcInE

Extended Data Fig. 7 | Example of a slide tiled on a grid with no overlap at different magnifications. A slide represents a bag, and the tiles constitute the
instances in that bag. In this work, instances at different magnifications are not part of the same bag. mpp, microns per pixel.

Nature Medicine | www.nature.com/naturemedicine


NaTurE MEDIcInE Articles

Extended Data Fig. 8 | The publicly shared MSK breast cancer metastases dataset is representative of the full MSK breast cancer metastases test
set. We created an additional dataset of the size of the test set of the CAMEYON16 challenge (130 slides) by subsampling the full MSK breast cancer
metastases test set, ensuring that the models achieved similar performance for both datasets. Left, the model was trained on MSK data with our proposed
method (MIL-RNN) and tested on: the full MSK breast data test set (n = 1,473; AUC = 0.968), the public MSK dataset (n = 130; AUC = 0.965); and the test
set of the CAMELYON16 challenge (n = 129; AUC = 0.898). Right, the model was trained on CAMELYON16 data with supervised learning18 and tested on:
the test set of the CAMELYON16 challenge (n = 129; AUC = 0.932); the full MSK breast data test set (n = 1,473; AUC = 0.731); and the public MSK dataset
(n = 130; AUC = 0.737). Error bars represent 95% confidence intervals for the true AUC calculated by bootstrapping each test set.

Nature Medicine | www.nature.com/naturemedicine


nature research | reporting summary
Corresponding author(s): Thomas J. Fuchs
Last updated by author(s): May 22, 2019

Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.

Software and code


Policy information about availability of computer code
Data collection Glass slides were digitized with Leica Aperio AT2 scanners and Philips Ultra Fast Scanner at a resolution of 0.5 microns per pixel.

Data analysis The algorithms were written in python. We used openslide (version 3.4.1) to access the whole slide images, and pytorch (version 1.0) to
train deep learning models.
R (version 3.3.3) was used for the statistical analysis of the results.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers.
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
October 2018

The publicly shared MSK breast cancer metastases dataset is available at http://thomasfuchslab.org/data/ . The dataset consists of 130 de-identified whole slide
images of axillary lymph node specimens from 78 patients (see Supplemental Figure 6). The tissue was stained with H&E and scanned on Leica Biosystems AT2
digital slide scanners at Memorial Sloan Kettering Cancer Center. Metastatic carcinoma is present in 36 whole slides from 27 patients and the corresponding label is
included in the dataset.
The remaining data that supports the findings of this study were offered to editors and peer reviewers at the time of submission for the purposes of evaluating the
manuscript upon request. The remaining data is not publicly available in accordance to institutional requirements governing human subject privacy protections.

1
nature research | reporting summary
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size No sample-size calculations were performed. Within the enrollment years listed in Figure 1a all cases with digitizes whole slides were included
in the study without data curation.

Data exclusions Less than ten whole slide images were excluded because of excessive pen ink marks present on the image. The exclusion criteria was pre-
established.

Replication Models were trained five times with each condition to ensure the stability of the training procedure. Replication was successful for all
conditions for which test results were reported.

Randomization Patients were randomly divided in three groups: training, validation, and test sets. No other covariates were controlled for.

Blinding Since our experiments are based on digitized pathology slides, blinding is not necessary.

Reporting for specific materials, systems and methods


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.

Materials & experimental systems Methods


n/a Involved in the study n/a Involved in the study
Antibodies ChIP-seq
Eukaryotic cell lines Flow cytometry
Palaeontology MRI-based neuroimaging
Animals and other organisms
Human research participants
Clinical data

Human research participants


Policy information about studies involving human research participants
Population characteristics Digital images of microscope slides from patients that were diagnosed at MSKCC over a period of at least 1 year and up to 5
years depending on the tissue type.

Recruitment No patient recruitment was performed. All digital images that were available for the pre-established collecting period were
analyzed.

Ethics oversight Memorial Sloan Kettering Cancer Center

Note that full information on the approval of the study protocol must also be provided in the manuscript.
October 2018

You might also like