Article 5
Article 5
Article 5
Cancer
Research
Deep Learning Predicts Lung Cancer Treatment
Response from Serial Medical Imaging
Yiwen Xu1, Ahmed Hosny1,2, Roman Zeleznik1,2, Chintan Parmar1, Thibaud Coroller1,
Idalid Franco1, Raymond H. Mak1, and Hugo J.W.L. Aerts1,2,3
Abstract
Purpose: Tumors are continuously evolving biological sys- patients with NSCLC treated with chemoradiation and
tems, and medical imaging is uniquely positioned to monitor surgery (178 scans).
changes throughout treatment. Although qualitatively track- Results: Deep learning models using time series scans were
ing lesions over space and time may be trivial, the develop- significantly predictive of survival and cancer-specific out-
Introduction graphic changes of tumors over time (4). Clinical response assess-
ment criteria, such as RECIST (5), analyze time series data using
Lung cancer is one of the most common cancers worldwide and
simple size-based measures such as axial diameter of lesions.
the highest contributor to cancer death in both the developed and
Artificial intelligence (AI) allows for a quantitative, instead of a
developing worlds (1). Among these patients, most are diagnosed
qualitative, assessment of radiographic tumor characteristics, a
with non–small cell lung cancer (NSCLC) and have a 5-year
process also referred to as "radiomics" (6). Indeed, several studies
survival rate of only 18% (1, 2). Despite recent advancements in
have demonstrated the ability to noninvasively describe tumor
medicine spurring a large increase in overall cancer survival rates,
phenotypes with more predictive power than routine clinical
this improvement is less consequential in lung cancer, as most
measures (7–10). Traditional machine learning techniques
symptomatic and diagnosed patients have late-stage disease (3).
involved the derivation of engineered features for quantitative
These late-stage lesions are often treated with nonsurgical
description of images with success in detecting biomarkers for
approaches, including radiation, chemotherapy, targeted, or
response assessment and clinical outcome prediction (11–15).
immunotherapies. This signals the dire need for monitoring
Recent advancements in deep learning (6) have demonstrated
therapy response using follow up imaging and tracking radio-
successful applications in image analysis without human feature
definition (16). The use of convolutional neural networks (CNN)
1
Department of Radiation Oncology, Brigham and Women's Hospital, Dana- allows for the automated extraction of imaging features and
Farber Cancer Institute, Harvard Medical School, Boston, Massachusetts. 2Radi- identification of nonlinear relationships in complex data. CNN
ology and Nuclear Medicine, GROW, Maastricht University Medical Centre,
networks that have been trained on millions of photographic
Maastricht, the Netherlands. 3Department of Radiology, Brigham and Women's
Hospital, Dana-Farber Cancer Institute, Harvard Medical School, Boston,
images can be applied to medical images through transfer learn-
Massachusetts. ing (17). This has been demonstrated in cancer research with
regards to tumor detection and staging (18). AI developments can
Note: Supplementary data for this article are available at Clinical Cancer
Research Online (http://clincancerres.aacrjournals.org/). be clinically applicable to enhance patient care by providing
accurate and efficient decision support (6, 11).
Corresponding Author: Hugo J.W.L. Aerts, Harvard–Dana-Farber Cancer Insti-
The majority of quantitative imaging studies have focused on
tute, 450 Brookline Avenue, Boston, MA 02115. Phone: 617-525-7156; Fax: 617-
525-7156; E-mail: hugo_aerts@dfci.harvard.edu the development of imaging biomarkers for a single timepoint
(19, 20). However, the tumor is a dynamic biological system with
Clin Cancer Res 2019;25:3266–75
vascular and stem cell contributions, which may respond, thus the
doi: 10.1158/1078-0432.CCR-18-2495 phenotype may not be completely captured at a single time-
2019 American Association for Cancer Research. point (21, 22). It may be beneficial to incorporate posttreatment
Figure 1.
Serial patient scans. Representative
CT images of patients with stage III
nonsurgical NSCLC before radiation
Patient 2
apart; the center slice is on the same axial slice as the seed point. deformation was on the order of millimeters and did not notice-
5 mm was the maximum slice thickness of the CT images. A ably change the morphology of the tumor or surrounding tissues.
transfer learning approach was applied using the pretrained
ResNet CNN that was trained on natural RGB images. The three Neural network structure
axial slices were used as input to the CNN network. Using three 2D The network structure was implemented in Python, using Keras
slices gives the network information to learn from but keeps the with Tensorflow backend (Python 2.7, Keras 2.0.8, Tensorflow
number of features lower than a full 3D approach, reduces GPU 1.3.0). The proposed network structure has a base ResNet CNN
memory usage and training time, as well as limits the overfitting. trained on the ImageNet database containing over 14 million
Image augmentation was performed on the training data, and natural images (Fig. 3). One CNN was defined for each timepoint
involved image flipping, translation, rotation, and deformation, input, such that an input with scans at three timepoints would
which is a conventional good practice and has shown to improve involve input into three CNNs. The output of the pretrained
performance (26). The same augmentation was performed on the network model was then input into recurrent layers with gated
pretreatment and follow-up images, such that the network gen- recurrent units (GRU), which takes the time domain into account.
erates a mapping for the entire input series of images. The To ensure the network was able to handle missing scans (27, 28),
Dataset A
Clinical
ChemoRT
Dataset A
IMAGENET n = 72
ChemoRT Volume
n = 1.4m
n = 107 Dataset B
ChemoRT surgery Diameter
ResNet RNN n = 89
Figure 2.
Analysis design. Depiction of the deep learning–based workflow with two datasets and additional comparative models. Dataset A included patients treated with
chemotherapy and definitive radiation therapy, and was used to train and fine-tune a ResNet CNN combined with an RNN for predictions of survival. A separate
test set from this cohort was used to assess performance and compared with the performance of radiographic and clinical features. Dataset B included patients
treated with chemotherapy and surgery. This cohort was used as an additional test set to predict pathologic response, and the model predictions were compared
with the change in volume.
3268 Clin Cancer Res; 25(11) June 1, 2019 Clinical Cancer Research
Longitudinal Deep Learning to Track Treatment Response
Pretreatment
Follow-up 1
Follow-up 2
(missed) Softmax
Fully
connected
Follow-up 3
Average
pooling
Input Pretrained CNN RNN
Figure 3.
RNN algorithms were used which allowed for amalgamation of stage, gender, age, tumor grade, performance, smoking status, and
several timepoints and the ability to learn from samples with clinical tumor size (primary maximum axial diameter).
missed patient scans at a certain timepoints. The output of the Statistical differences between positive and negative survival
pretrained network was masked to skip the timepoint when a scan groups in dataset A were assessed using the area under the receiver
was not available. Averaging and fully connected layers are then operator characteristic curve (AUC), and the Wilcoxon rank sums
applied after the GRU with batch normalization (29) and drop- test (also known as the Mann–Whitney U test). Prognostic and
out (30) after each fully connected layer to prevent overfitting. The survival estimates were calculated using the Kaplan–Meier meth-
final softmax layer allows for a binary classification output. To test od between low and high mortality risk groups, stratified at the
a model without the input of follow-up scans, the pretreatment median prediction probability of the training set and controlled
image alone was input into the proposed model, with the recur- using a log-rank test. Hazard ratios were calculated through the
rent and average pooling layers replaced by a fully connected Cox proportional-hazards model.
layer, as there was only one input timepoint. An additional test was performed on dataset B, the trimodality
cohort using the 1-year survival model from the definitive radi-
ation cohort with two timepoints. Survival predictions were made
Transfer learning
from the 1-year survival model trained on dataset A. The model
Weights trained with ImageNet (26, 31), a set of 14 million 2D
predictions were used to stratify the trimodality patients based on
color images, were used for the ResNet (31) CNN and the
survival and tumor response to radiation therapy prior to surgery.
additional weights following the CNN were randomized at ini-
The groups were assessed using their respective AUC, and were
tialization for transfer learning. Dataset A was randomly split 2:1
tested with the Wilcoxon rank sums test. This was compared with
into training/tuning and test. Training was performed with Monte
the volume change after radiation therapy and a random forest
Carlo cross-validation, using 10 different splits (further 3:2 split of
clinical model with the same features used for dataset A.
training:tuning) on 107 patients with class weight balancing for
up to 300 epochs. The model was evaluated on an independent
test set of 72 patients, who were not used in the training process. Results
The surviving fractions for training/tuning (n ¼ 107) and test sets
Clinical characteristics
(n ¼ 72) were comparable (Supplementary Table S1). Only the
To evaluate the value of deep learning based biomarkers to
pretreatment image was input into the proposed model, and the
predict overall survival using patient images prior and post
recurrent and average pooling layers were replaced with a fully
radiation therapy (Fig. 1), a total of 268 patients with stage III
connected layer.
NSCLC with 739 CT scans were analyzed (Fig. 2). Dataset A
consisted of 179 patients treated with definitive radiation therapy
Statistical analysis and was used as a cohort to train and test deep learning biomar-
Statistical analyses were performed in Python version 2.7. All kers (Supplementary Table S2). There was no significant differ-
predictions were evaluated on the independent test set of dataset A ence between the patient parameters in the training and test sets of
for survival and for prognostic factors after definitive radiation dataset A (P > 0.1, group summary values in Supplementary Table
therapy. The clinical endpoints included distant metastasis, pro- S2). The patients were 52.8% females (median age of 63 years; age
gression, and locoregional recurrence as well as overall survival for range 32–93 years) and were predominantly diagnosed as having
1 and 2 years following radiation therapy. The analyses were stage IIIA (58.9%) NSCLC at the time of diagnosis, with 58.1% in
compared with a random forest clinical model with features of the adenocarcinoma histology category. The median radiation
0.8
Survival probability
0.6
0.4 Figure 4.
Performance deep learning
biomarkers on validation datasets.
0.2 The deep learning models were
evaluated on an independent test
set for performance. The 2-year
0 overall survival Kaplan–Meier
curves were performed with
C Pretreatment + follow-up 1−2 D Pretreatment + follow-up 1−3 median stratification (derived from
the training set) of the low and high
0.4
0.2
>median
<=median
00 12 24 36 48 0 12 24 36 48
Time (months) Time (months)
dose was 66 Gy for the definitive radiation cohort (range 45– tumor size, did not yield a statistically significant prediction of
70 Gy, median follow-up of 31.4 months). Another cohort of 89 survival (2-year survival AUC ¼ 0.51, P ¼ 0.93) or treatment
patients treated with trimodality served as an external test set response (Supplementary Table S3).
(dataset B). The median radiation dose for the trimodality Further survival analyses were performed with Kaplan–Meier
patients was lower, at 54 Gy (range 50 to 70 Gy, median fol- estimates for low and high mortality risk groups based on median
low-up of 37.1 months). stratification of patient prediction scores (Fig. 4). The models for
2-year overall survival yielded significant differences between the
Deep learning–based prognostic biomarker development and groups with two (P ¼ 0.023, log-rank test) and three (P ¼ 0.027,
evaluation log-rank test) follow-up scans. Comparable results were found for
To develop deep learning–based biomarkers for overall surviv- the following predictions with their respective hazard ratios:
al, distant metastasis, disease progression, and locoregional recur- 1-year overall survival (6.16; 95% CI, 2.17–17.44]; P ¼
rence, training was performed using the discovery part of dataset A 0.0004), distant metastasis free (3.99; 95% CI, 1.31–12.13;
(Fig 2). To leverage the information from millions of photograph- P ¼ 0.01), progression free (3.20; 95% CI, 1.16–8.87; P ¼
ic images, the ResNet CNN model was pretrained on ImageNet 0.02), and no locoregional recurrence (2.74; 95% CI, 1.18–
and then applied to our dataset using transfer learning. The CNN 6.34; P ¼ 0.02), each with significant differences at three fol-
extracted features of the CT images of each timepoint were fed into low-up timepoint scans.
a recurrent network for longitudinal analysis. We observed that
baseline model with only pretreatment scans demonstrated low Predicting pathologic response
performance for predicting 2-year overall survival (AUC ¼ 0.58; As an additional independent validation and to evaluate the
P ¼ 0.3; Wilcoxon test). Improved performance to predict 2-year relationship between delta imaging analysis and pathologic
overall survival was observed with the addition of each follow-up response, the trimodality pre-radiation therapy and post-radia-
scan; at 1 month (AUC ¼ 0.64, P ¼ 0.04), 3 months (AUC ¼ 0.69, tion therapy prior to surgery scans were input into the neural
P ¼ 0.007), and 6 months (AUC ¼ 0.74, P ¼ 0.001; Supplemen- network model trained on dataset A. First for survival prediction
tary Fig. S2). We also observed the similar trend in performance evaluation, the model was tested on dataset B. To match the
for other clinical endpoints, that is 1-year, survival, metastasis, number of input timepoints, the 1-year survival model with the
progression, and locoregional recurrence-free survival (Supple- pretreatment and first follow-up at 1 month was used. The model
mentary Fig. S3). A clinical model, incorporating stage, gender, significantly predicted distant metastasis, progression, and local
age, tumor grade, performance, smoking status, and clinical regional recurrence (Supplementary Table S4). Although, for
3270 Clin Cancer Res; 25(11) June 1, 2019 Clinical Cancer Research
Longitudinal Deep Learning to Track Treatment Response
the scan performed and thus decrease the ability to predict outcome the network was trained to predict. The use of transfer
survival. learning has demonstrated its effectiveness on improving the
Survival is associated with tumor pathologic response (34, 35). performance of lung nodule detection in CT images (18). Our
Thus, we tested the relationship between the probabilities of the study contained a sample size not on the order of studies based on
survival network model on similar patients with stage III NSCLC photographic images, but the current performance was made
who were in different treatment cohorts (definite radiation ther- possible with the incorporation of pretrained networks on
apy and trimodality). Dataset B included the follow-up timepoint ImageNet. Transfer learning may also be used to test the feasibility
after radiation therapy and prior to surgery, for the prediction of of clinically applicable utilities prior to the collection of a full
response and for further validation of our model. This also serves cohort for analysis.
as a test for generalizability in locally advanced NSCLC patients The incorporation of follow-up timepoints to capture
treated with different standard of care treatment protocols. To dynamic tumor changes was key to the prediction of survival
match the number of input timepoints, the 1-year overall survival and tumor prognosis. This was feasible with the use of RNNs,
model with the pretreatment and first follow-up at 1 month was which allowed for amalgamation of several timepoints and the
used. The model was able to separate the pathologic responders ability to learn from samples with missed patient scans at a
from those with gross residual disease in the trimodality cohort. certain timepoint, which is inevitable in retrospective studies
This was the case, even though the model development was such as this one. Although this type of network has not been
completely blinded from this cohort. applied to medical images, similar network architectures have
3272 Clin Cancer Res; 25(11) June 1, 2019 Clinical Cancer Research
Longitudinal Deep Learning to Track Treatment Response
Ideally, after training on a larger diverse population and after addressed. Further research in this direction could make these
extensive external validation and benchmarking with current automatically learned feature representations more interpretable.
clinical standards, quantitative prognostic prediction models can
be implemented in the clinic (48). There are several lung nodule Conclusions
detection algorithms available in the literature and with the aid of This study demonstrated the impact of deep learning on tumor
the pretreatment tumor contours routinely delineated by the phenotype tracking before and after definitive radiation therapy
radiation oncologist, the location of the tumor on the follow up through pretreatment and CT follow-up scans. There were
images can be detected automatically (49). The input of our increases in performance of survival and prognosis prediction
model would simply be the bounding box surrounding the with incorporation of additional timepoints using CNN and RNN
detected tumor and can be cropped automatically as well. The networks. This was compared with the performance of clinical
trained network can generate probabilities of prognosis within a factors, which were not significant. The survival neural network
few seconds, and thus would not hinder current clinical efficiency. model could predict pathologic response in a separate cohort with
The probabilities can then be presented to the physician along trimodality treatment after radiation therapy. Although the input
with other clinical images and measures, such as the RECIST of this model consisted of a single seed point input at the center of
criteria (5), to aid in the process of patient assessment. the lesion, without the need for volumetric segmentation our
This proof of principle study has its limitations, one of which is model had comparable predictive power compared with tumor
the sample size of the study cohorts. Thus, a pretrained CNN was volume, acquired through time-consuming manual contours.
References
1. Torre LA, Bray F, Siegel RL, Ferlay J, Lortet-Tieulent J, Jemal A. Global cancer 2. Ettinger DS, Akerley W, Borghaei H, Chang AC, Cheney RT, Chirieac LR, et al.
statistics, 2012. CA Cancer J Clin 2015;65:87–108. Non-small cell lung cancer. J Natl Compr Canc Netw 2012;10:1236–71.
3. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2016. CA Cancer J Clin 25. Fedorov A, Beichel R, Kalpathy-Cramer J, Finet J, Fillion-Robin J-C, Pujol S,
2016;66:7–30. et al. 3D Slicer as an image computing platform for the Quantitative
4. Goldstraw P, Chansky K, Crowley J, Rami-Porta R, Asamura H, Imaging Network. Magn Reson Imaging 2012;30:1323–41.
Eberhardt WEE, et al. The IASLC lung cancer staging project: propo- 26. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep
sals for revision of the TNM stage groupings in the forthcoming convolutional neural networks. Commun ACM 2017;60:84–90.
(eighth) edition of the TNM Classification for Lung Cancer. 27. Rubins J, Unger M, Colice GL. Follow-up and surveillance of the lung
J Thorac Oncol 2016;11:39–51. cancer patient following curative intent therapy. Chest 2007;132:
5. Eisenhauer EA, Therasse P, Bogaerts J, Schwartz LH, Sargent D, Ford R, et al. 355S–367S.
New response evaluation criteria in solid tumours: revised RECIST guide- 28. Calman L, Beaver K, Hind D, Lorigan P, Roberts C, Lloyd-Jones M. Survival
line (version 1.1). Eur J Cancer 2009;45:228–47. benefits from follow-up of patients with lung cancer: a systematic review
6. Hosny A, Parmar C, Quackenbush J, Schwartz LH, Hugo J W. Artificial and meta-analysis. J Thorac Oncol 2011;6:1993–2004.
intelligence in radiology. Nat Rev Cancer 2018;18:500–10. 29. Ioffe S, Szegedy C. Batch normalization: accelerating deep network training
7. Parmar C, Grossmann P, Bussink J, Lambin P, Hugo J W. Machine learning by reducing internal covariate shift. arXiv Preprint 2015. arxiv.org/abs/
methods for quantitative radiomic biomarkers. Sci Rep 2015;5:13087. 1502.03167.
8. Aerts HJWL. Data science in radiology: a path forward. Clin Cancer Res 30. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R.
2018;24:532–4. Dropout: a simple way to prevent neural networks from overfitting.
9. Aerts HJWL, Velazquez ER, Leijenaar RTH, Parmar C, Grossmann P, J Mach Learn Res 2014;15:1929–58.
Carvalho S, et al. Decoding tumour phenotype by noninvasive imag- 31. He K, Zhang X, Ren S, Sun J. Deep residual learning for image
ing using a quantitative radiomics approach. Nat Commun 2014;5: recognition. In: Proceedings: 29th IEEE Conference on Computer
4006. Vision and Pattern Recognition. CVPR 2016; 2016 Jun 26–Jul 1; Las
3274 Clin Cancer Res; 25(11) June 1, 2019 Clinical Cancer Research
Longitudinal Deep Learning to Track Treatment Response
23–28; Columbus, OH. Washington (DC): IEEE Computer Society; 2014. Computer and Robot Vision. CRV 2015; 2015 Jun 3–5; Halifax, Nova
p. 1725–32. Scotia, Canada. Washington (DC): IEEE Computer Society; 2015. p.
44. Che Z, Purushotham S, Cho K, Sontag D, Liu Y. Recurrent neural 133–8.
networks for multivariate time series with missing values. Sci Rep 47. Hua K-L, Hsu C-H, Hidayati SC, Cheng W-H, Chen Y-J. Computer-aided
2018;8:6085. classification of lung nodules on computed tomography images via deep
45. Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk learning technique. Onco Targets Ther 2015;8:2015–22.
H, et al. Learning phrase representations using RNN encoder–decoder for 48. Lehman CD, Yala A, Schuster T, Dontchos B, Bahl M, Swanson K, et al.
statistical machine translation. In: Proceedings of the 2014 Conference on Mammographic breast density assessment using deep learning: clinical
Empirical Methods in Natural Language Processing (EMNLP); 2014 Oct implementation. Radiology 2018;180694.
25–29; Doha, Qatar. Stroudsburg (PA): Association for Computational 49. Valente IRS, Cortez PC, Neto EC, Soares JM, de Albuquerque VHC, Tavares
Linguistics; 2014. p. 1724–34. JMRS. Automatic 3D pulmonary nodule detection in CT images: a survey.
46. Kumar D, Wong A, Clausi DA. Lung nodule classification using deep Comput Methods Programs Biomed 2016;124:91–107.
features in CT images. In: Proceedings: 2015 12th Conference on 50. Wang G. A perspective on deep imaging. IEEE Access 2016;4:8914–24.