A Review of Explainable Deep Learning Cancer Detection Models in Medical Imaging
A Review of Explainable Deep Learning Cancer Detection Models in Medical Imaging
A Review of Explainable Deep Learning Cancer Detection Models in Medical Imaging
sciences
Review
A Review of Explainable Deep Learning Cancer Detection
Models in Medical Imaging
Mehmet A. Gulum *,† , Christopher M. Trombley † and Mehmed Kantardzic †
Computer Science & Engineering Department, J.B. Speed School of Engineering, University of Louisville,
Louisville, KT 40292, USA; [email protected] (C.M.T.);
[email protected] (M.K.)
* Correspondence: [email protected]
† These authors contributed equally to this work.
Abstract: Deep learning has demonstrated remarkable accuracy analyzing images for cancer detec-
tion tasks in recent years. The accuracy that has been achieved rivals radiologists and is suitable for
implementation as a clinical tool. However, a significant problem is that these models are black-box
algorithms therefore they are intrinsically unexplainable. This creates a barrier for clinical imple-
mentation due to lack of trust and transparency that is a characteristic of black box algorithms.
Additionally, recent regulations prevent the implementation of unexplainable models in clinical
settings which further demonstrates a need for explainability. To mitigate these concerns, there have
been recent studies that attempt to overcome these issues by modifying deep learning architectures
or providing after-the-fact explanations. A review of the deep learning explanation literature focused
on cancer detection using MR images is presented here. The gap between what clinicians deem
explainable and what current methods provide is discussed and future suggestions to close this gap
are provided.
Citation: Gulum, M.A.; Trombley,
C.M.; Kantardzic, M. A Review of Keywords: deep learning; explanability; explainability; cancer detecton; MRI; XAI
Explainable Deep Learning Cancer
Detection Models in Medical Imaging.
Appl. Sci. 2021, 11, 4573. https://
doi.org/10.3390/app11104573 1. Introduction
Deep learning has grown both in terms of research along with private and public
Academic Editor: Jose Antonio
sector application in recent years. A part of this growth is the application of deep learn-
Iglesias Martinez
ing algorithms to big data in the healthcare sector [1–11] as depicted in Figure 1. These
Received: 27 February 2021
algorithms have demonstrated results suitable for clinical implementation but explanation
Accepted: 15 April 2021
remains a barrier for wide-spread clinical adoption.
Published: 17 May 2021 Deep learning is a subset of machine learning that has grown in recent years due to
the advances in computational power and the access to large datasets. Deep learning has
Publisher’s Note: MDPI stays neutral proven to be successful in a wide breadth of applications [12–16]. There are three main cat-
with regard to jurisdictional claims in egories of deep learning algorithms which are supervised learning, unsupervised learning,
published maps and institutional affil- and reinforcement learning. Supervised learning fits a non-linear function using features
iations. as input data and labels as output data where the labels, or outputs, are known when
training the algorithm. Common supervised deep learning algorithms are convolutional
neural networks and recurrent neural networks. These algorithms are commonly used
for classification tasks such as classifying lesions as benign or malignant. Unsupervised
Copyright: © 2021 by the authors.
learning on the other hand does not use labeled output data. It is looking for patterns in
Licensee MDPI, Basel, Switzerland. the input data distribution. Some examples of unsupervised deep learning are generative
This article is an open access article adversarial networks and autoencoders. These are used for tasks such as data compression
distributed under the terms and and data augmentation. The final category is reinforcement learning. In reinforcement
conditions of the Creative Commons learning, a reward function is defined and the neural network optimizes its weights to
Attribution (CC BY) license (https:// maximize the reward function.
creativecommons.org/licenses/by/
4.0/).
Figure 1. This figure shows the number of published papers using deep learning models to predict
cancer risk, classification, prediction and detection per year since 2008. The data were collected using
keyword searches through PubMed.
Deep neural networks consist of an input layer, output layer, and many hidden layers
in the middle. Each layer consists of neurons which are inspired by how neurons in the
brain process information. Each neuron has a weight that is updated using a technique
called gradient descent during the training process. The weights of the network are
optimized by minimizing a cost function. Nonlinearities are applied between layers to fit
nonlinear functions from the input data to the output. Due to this, neural networks are
able to model complex mappings. After training, the model is able to classify unseen data.
Deep learning has achieved high performance for numerous types of cancers. The
authors of [17] demonstrated accuracy up to 98.42 for breast cancer detection. In [18], they
found that they could detect lung cancer with up to 97.67 using deep transfer learning. In
the fall of 2018, researchers at Seoul National University Hospital and College of Medicine
developed a machine learning algorithm called Deep Learning based Automatic Detection
(DLAD) to analyze chest radiographs and detect abnormal cell growth, such as poten-
tial cancers. In [19], the algorithm’s performance was compared to multiple physician’s
detection abilities on the same images and outperformed 17 of 18 doctors. These results
demonstrate the potential for using deep learning to aid medical practitioners. Despite this
performance, adoption of these systems in clinical environments has been stunted due to
lack of explainability.
Deep learning techniques are considered black boxes as depicted in Figure 2. Explain-
ability attempts to mitigate the black box nature of these algorithms. Explainability is
needed for a variety of reasons. First, there are legal and ethical requirements along with
laws and regulations that are required for deep learning cancer detection systems to be
implemented in a clinical setting. An example of a regulation is the European Union’s
General Data Protection Regulation (GDPR) requiring organizations that use patient data
for classifications and recommendations to provide on-demand explanations [20]. The
inability to provide such explanations on demand may result in large penalties for the
organizations involved. There are also monetary incentives associated with explainable
deep learning models. Beyond ethical and legal issues, clinicians and patients need to
be able to trust the classifications provided by these systems [21]. Explanation is also
needed for trust and transparency [22–28]. Explanation methods attempts to show the
reasoning behind the model’s classification therefore building a degree of trust between
the system, clinician, and patient. This can reduce the number of misinformed results that
would be a possible consequence of non-explainable systems. Finally, explainable deep
learning systems will provide the clinician with additional usefulness such as localization
and segmentation of the lesion location [29–32]. How explanability brings value to the end
user is depicted in Figure 3.
Appl. Sci. 2021, 11, 4573 3 of 21
Explainable and interpretable are used interchangeably in parts of the literature [33]
yet they are defined separately in others. This work will use the terms are used inter-
changeably. There is not an agreed upon definition of explainability. As Lipton mentioned,
“the term explainability holds no agreed upon meaning, and yet machine learning confer-
ences frequently publish papers which wield the term in a quasi-mathematical way”. This
demonstrates the ambiguity of term and the need for a formal definition. That is outside
the scope of this work but for the context of this paper, we adopt the [34] definition of
explainability which is, “explainability is the degree to which a human can understand the
cause of a decision”.
We conducted an extensive literature review by examining relevant papers from six
major academic databases. A keyword based search was used to select papers, it consists
of searching for index keywords based on common variations in literature of the terms
“explainable”, “transparency”, “black box”, “understandable”, “comprehensible”, and
“explainable” and its union with a set of terms related to deep learning cancer detection
including “deep learning”, “machine learning” and “convolutional neural network”. The
research was restricted to articles published between 2012 and 2020 with one exception
published in 1991. The papers were skimmed and then filtered by quality and relevancy.
More papers were then added by using a snowballing strategy that consists of using the
reference list of the selected papers to identify additional papers.
The contributions of this review are three fold. First, this work will focus on explain-
ability for cancer detection using MRI images. Second, we provide a thorough overview of
the current approaches to measure explainability that are commonly used in the literature
and why this is important for medical explainability. There is not an universally agreed
upon metric for an explanation. This review will discuss the advantages and disadvan-
tages of metrics commonly found in the literature. Third, this review will discuss papers
related to what clinicians consider important for explanation systems. This highlights a
Appl. Sci. 2021, 11, 4573 4 of 21
gap between ML researchers who design the algorithms and clinicians who are the end
users. This paper attempts to discuss direction for closing that gap.
This paper is organized in the following way. First, a general taxonomy of explanation
approaches for deep learning cancer detection systems is described. Second, current mea-
sures for explanation quality are examined and analyzed. Third, we discuss explanation
techniques applied to cancer detection deep learning models. Finally, it will conclude with
a discussion with suggested future directions in section four and five respectively.
2. Background
Although the research volume is expanding rapidly around explainability [35–62]
a review focusing on image intepretability within the context of deep learning cancer
detection systems is missing. Indeed, there are existing research studies of deep learning
explainability for medical applications [63–74] in the literature as well as review articles.
Notably, [75,76]. The survey of Tjoa and Guan reviews post-hoc explanation methods
for medical applications. The review looks at medical explainability as a whole. The
authors provide future directions such as including textual explanation and propose
creating a authoritative body to oversee algorithmic explainability. In the review by [76],
the authors provide a review of explainable deep learning models for medical imaging
analysis. The authors discuss medical applications of brain, breast, retinal, skin, CT, and
X-ray images. In contrast to the previously noted reviews, this review focuses solely on
analyzing medical images for cancer detection while discussing the gap between clinicians
and ML researchers.
2.1. Taxonomy
There are various proposed taxonomies to classify explanation methods [77–79]. Gen-
erally, these classification systems are task dependent therefore there is not an universally
accepted taxonomy for explanation methods. This section will outline how explanation
methods are grouped throughout this paper which is guided by the taxonomy by the
authors of [78].
explanations explainable in the context of verification. The authors conduct an user study
and identify what kind of increases in complexity have the greatest effect on the time it
takes for humans to verify the rationale, and which seem relatively insensitive. Another
factor to consider is the question who will be the end user of the algorithm. The end
user of the deep learning system should guide the approach to evaluation. Designing
explanation techniques for clinical settings require different explanations than designing
explanation for robotics. A paper published in 2018 by [92] discusses the different roles
involving explanation techniques and how the explanations differs depending on what
role you are providing the explanation for. They show that an explanation for one role
should be measured differently for other roles. This sounds trivial but is often ignored. For
example, measuring explanations for doctors should have different criteria than measuring
explanations for patients receiving the diagnosis or engineers designing the system. Con-
sequentially, this suggests the need to test explanations on a variety of human subjects to
evaluate for whom the explanations fail and for whom the explanation provides adequate
information. For example, the explanation can be thorough and complete for a clinicians
but fail for a patient when ideally it should address the needs of each end user. A limitation
of the current literature is the lack of studies that verify explanation methods with a popu-
lation of radiologists or doctors. To the best of our knowledge, there is not any studies that
incorporate radiologists into the design process of deep learning explanation algorithms.
The majority of human-studies involved general, non-specialized individuals recruited
through a crowd-sourcing service such as Amazon Mechanical Turk [93]. The authors
of [94] also highlight the importance of keeping the end-user in mind when defining what
is explainable and how to measure it. It is also essential to have quantitative measures
to be able to compare different methodologies quantitatively when qualitative measures
fail. A paper published by the authors of [95] propose Remove and Retrain (ROAR) to
quantitatively compute feature importance. ROAR ranks feature importance, removes
these the features ranked as most important, and then retrains the model and measures
the difference during testing. This method is widely used but there still lacks a standard
procedure for quantitatively measuring explanation performance. For measuring qualita-
tively, Doshi-Velez and Kim discuss factors related to qualitative assessment. First, what
are explanations composed of? For example, feature importance or examples from the
training? It other words, what units are being used to provide an explanation. Second,
what are the number of units that explanations are composed of? If the explanation finds
similar training examples to provide an explanation, does it just provide the most relevant
examples or all? Third, if there is only a select few examples how are these ordered and
what determines that criteria? If they are ranked in terms of similarity how is similarity
defined? Or if they are ranked in terms of feature importance, how do you measure feature
importance? Fourth, how do you show the interaction among these units in a human-
digestible way? Fifth, how do you quantitatively show the uncertainty in an explanation?
While there is no silver bullet for measuring explanation quality, these are all important
questions to consider when evaluating the quality of an explanation and choosing a metric
to measure results.
more human-verified explanation techniques. This section will discuss human-based and
proxy-based methods to measure explanation quality.
“Human-based” can be broken down into two more categories which are specific tasks
or general tasks. For specific tasks, studies can be designed to evaluate specific applications.
An example of this would be explaining predictions for ICU duration. The other category
is when the application is more general or not known. This is when the desire is to test
the explanation method in a general setting. An example of this is Grad-CAM++ [93],
SHAP [80], or LIME [85] which all were tested with randomized human studies using
Amazon Mechanical Turk. The authors produced heatmaps for 5 classes in the ImageNet
testing set totaling 250 images. These maps, along with their corresponding original image,
were shown to 13 human subjects and asked which explanation algorithm invoked more
trust in the underlying model. They consider the algorithm with more votes the model
with greater trustworthiness.
Sometimes designing and carrying out human studies is infeasible or inconvenient
due to various reasons such as IRB, recruitment, or funding. There are also cases where
human-based studies do not measure what is desired. In these studies a formal definition
of explanation is used as a proxy for explanation quality. The non-triviality here is choosing
what proxy is best to be used. Some examples are showing increased performance for a
method that has been deemed explainable based on human studies or showing that method
performs better with respect to certain regularizers. An example of this approach is Saliency
Maps which measure explanation quality using a localization task on benchmark datasets.
3.1. Post-Hoc
A majority of deep learning cancer detection models utilize post-hoc methodologies
due to their ease of implementation. Post-hoc methods can be used to analyze the learned
features of the model and to diagnose overfitting. This provides a feedback loop for the
researcher and allows them to improve the model. It also shows the discriminative areas of
the image. This section will review recent studies that apply post-hoc explanation methods
to breast cancer, prostate cancer, lung cancer, brain cancer, and liver cancer.
The predecessor for post-hoc techniques such as Respond-CAM, Pyramid-CAM, Grad-
CAM, Grad-CAM+, and Grad-CAM++ is class activation maps (CAM) which generates a
heatmap using global average pooling. In [99] the method shows the region of the image
used the identify the label by outputting the spatial average of the feature maps of each unit
at the last layer. It was tested as weakly supervised segmentation on the ILSVRC dataset
with a top 5 error rate of 37.1 percent which is only 3 percent away from the top score.
CAM is formally defined as the following. Let f k ( x, y) represent the activation of unit
k in the last convolutional layer at spatial location ( x, y). The result of performing global
average pooling is Fk = ∑ x,y f k ( x, y) therefore for a given class, c, the input to the softmax,
Sc is ∑k wkc Fk where wkc is the weight corresponding to class c for unit k. Essentially, wkc
indicates the importance of Fk for class c. Finally the output of the softmax for class c, Pc is
exp(S )
given by ∑ exp(cS ) By plugging Fk = ∑ x,y f k ( x, y) into the class score, Sc , we obtain
c c
The authors define Mc as the class activation map for class c, where each spatial
element is given by
Mc ( x, y) = ∑ wkc f k ( x, y) (2)
k
1 ∂yc
αck =
Z ∑ ∑ ∂Ak (3)
i j
The authors of [29] propose a modified version of CAM called pyramid gradient-
based class activation.they used PG-CAM a densely connected encoder decoder-based
feature pyramid network (DC-FPN) as a backbone structure, and extracts a multi-scale
Grad-CAM that captures hierarchical features of a tumor. Mathmatically, PG-CAM is
defined as follows.
5 5
∂Lc sp sp
PGCAMc = ∑ ReLU (∑ ∑ ∑ kp
f k p ( X, Y )) = ∑ GradCAMc (5)
p =1 kp i j ∂ f ij p =1
where k p is the unit of the feature map s p , GradCAMscp is a GradCAM generated from the
feature maps delivered from each scale p, and the PG-CAM aggregating the GradCAMs
will consequently contain information on multiple scales of the feature pyramid.
Integrated gradients has became popular due to its ability to explain models that use
different data modalities. This is in addition its ease of use, theoretical foundation, and
computational efficiency. Formally, Integrated Gradients is defined as follows
∂F ( x 0 + α × ( x − x 0 ))
Z 1
IG ( x ) = ( xi − xi0 ) × dα (6)
α =0 ∂xi
where i is the feature, x is the input, x 0 is the baseline, and α interpolation constant to
perturb features by. Note that this integral is often approximated due to computational
costs and numerical limits.
Appl. Sci. 2021, 11, 4573 9 of 21
Occlusion is a simple to perform model agnostic approach which reveals the feature
importance of a model. It can reveal if a model is overfitting and learning irrelevant
features as in the case of adversarial examples. Adversarial examples are inputs designed
to cause the model to make a false decision. In that case, the model misclassifies the image
(say malignant as benign) despite the presence of discriminating feature occluding all
features (pixels) one-by-one and running the forward pass each time can be computationally
expensive and can take several hours per image. It is common to use patches of sizes
such as 5 × 5, 10 × 10, or even larger depending on the size of the target features and
computational resources available.
CAM and Grad-CAM calculate the change in class score based on features extracted
by the model. These two methods then visualize the features that cause the largest change
in class score. On the other hand, occlusion occludes part of the image and then calculates
the change in class score using the occluded image. It repeats this for many patches of the
image and the patch that produces the largest change in class score is deemed the most
important. Both methods produce a heatmap highlighting the most discriminative areas of
the image. Therefore, the largest difference between the two techniques is how the they
compute the most discriminative area of the image. The disadvantage of occlusion is that
it is computationally expensive and takes longer to compute than CAM or Grad-CAM.
The advantage of Grad-CAM and CAM is that they can compute their results in one
pass instead of many passes like occlusion techniques. The disadvantage of Grad-CAM
and CAM is that they only take into account the feature maps therefore could produce
misleading results whereas occlusion takes into account different patches of the image.
A summary of these methods is presented in Table 1.
produce Grad-CAM heatmaps for post-hoc explanation. They report accuracy ranging
from 0.69 to 0.88.
of an average error of 6.93 pixels. The authors also made the qualitative observation that
different explanation methods highlight different aspects of the resulting classification.
3.2. Ad-Hoc
Ad-Hoc explanation techniques modify the training procedure and/or network archi-
tecture to learn explainable features. An additional proposed approach to learn explainable
features was the work by the authors of [130] using supervised iterative descent. The
authors’ approach learns explainable features for image registration by incorporating a
shared modality-agnostic feature space and using segmentations to derive a weak supervi-
sory signal for registration. The authors report their experiments “demonstrate that these
features can be plugged into conventional iterative optimizers and are more robust than
state-of-the-art hand-crafted features for aligning MRI and CT images”. This is significant
with respect to medical imaging because of the multi-modality aspect which is common to
medical imaging.
The authors of [131] propose a two part explainable deep learning network. One part
extracts features and predicts while the other part generates an explanation. The authors
create a mathematical model to extract high-level explainable features. The explanation
model reports information such as what are the most important features, what is their rank,
and how does each one influence the prediction.
Capsule networks have been recently utilized due to the impression they are more
explainable compared to CNN [132]. The authors of [133] state that convolutions neural
networks struggle with explainability therefore they propose explainable capsules, or
X-Caps. X-Caps utilize a capsule network architecture that is designed to be inherently
explainable. The authors design to capsule network to learn an explainable feature space
by encoding high-level visual attributes within the vectors of a capsule network to perform
explainable image-based diagnosis. The authors report accuracy similar to that of unex-
plainable models demonstrating accuracy and explainability are not mutually exclusive.
The authors define a routing sigmoid as follow.
exp(bi,j )
ri,j = (7)
exp(bi,j ) + 1
where ri,j are the routing coefficients determined by the dynamic routing algorithm for
child capsule i to parent capsule j and the initial logits, bi,j are the prior probabilities that
the prediction vector for capsule i should be routed to parent capsule j.
Appl. Sci. 2021, 11, 4573 12 of 21
W H
γ
∑ ∑ || Rx,y − Or
x,y
Lr = || with R x,y = I x,y × S x,y |S x,y ∈ 0, 1 (8)
H×W x y
Let the critical response maps A( x |c) for a given CT scan slice image x for each
prediction c be computed via back-propagation from the last layer of the last explainable
sequencing cell in the proposed SISC radiomic sequencer. The notation used in this study
is based on the study done by the authors of [134] for consistency. The last layer in the
explainable sequencing cell at the end of the proposed SISC radiomic sequencer contains
N = 2 nodes, equal to the number of possible predictions (i.e., benign and malignant).
The output activations of this layer are followed by global average pooling and then a
softmax output layer. So, to create the critical maps for each possible prediction, the back-
propagation starts with the individual prediction nodes in the last layer to the input space.
For a single layer l, The deconvoluted output response rl is given by,
K
hl = ∑ f k,l ∗ pk,l (9)
k =1
where f k is the feature map and pl is the kernel of layer l. the symbol ∗ represents the
convolution operator. Therefore, the critical response map A( x |c), for a given prediction c
is defined as,
A( x |c) = D10 M1 ...DL−1 M0L−1 DLc FL (10)
where M0 is the unpooling operation and DLc is the convolution operation at the last layer
with kernel p L replaced by zero except at the cth location corresponding to the prediction c.
The majority of ad-hoc explanation techniques attempt to learn an explainable fea-
ture space using modifications to traditional architectures that vary from modifying loss
functions to modifying the procedure for computing convolutional features. Some papers
argue the best approach is to move away from convolutional neural networks to capsule
networks due to greater explainability with similar predictive performance though data is
limited [133].
4. Discussion
There has been significant research progress with regards to explainable deep learning
cancer detection systems. The majority of methods utilized in the literature are local,
post-hoc methods that are usually model and data specific. Current research demonstrate a
trajectory heading towards explainable deep learning for cancer diagnosis, but there is still
work to be done to fully realize the implementation of these systems in a clinical setting.
When a methodology provides an obviously wrong explanation it is important to
be able to probe more to find out what went wrong. There is little research around
this area despite being important for clinical implementation. This would help alleviate
misclassification errors and provide greater insight to the clinician. Being able to quantify
uncertainity in explanations is important for clinicians as well. This would give the clinician
Appl. Sci. 2021, 11, 4573 14 of 21
a gauge for the amount of trust to have in an explanation. This would also help alleviate
potential errors introduce by explanation techniques.
Since the end user of explainability approaches for cancer detection will be clinicians
and their patients, it is essential to take into account what they consider to be explainable.
This is an important aspect that is often ignored by explanation studies. Most algorithms are
measured using quantitative methods such as log loss or human studies of non-clinicians
recruited from a service such as Amazon Mechanical Turk. There is a need to incorporate
clinicians into the design process of these algorithms and to evaluate the algorithms with
the doctors that will be using them. There are studies that attempt to design clinical
visualization development and evaluation with clinicians and researchers. The authors
of [139] introduce Clinical-Vis which is an EHR visualization based prototype system for
task-focused design evaluation of interactions between healthcare providers (HCPs) and
EHRs. They design a study where they recruit 14 clinicians. They compare the clinicians
interaction with their systems against a baseline system. They qualitatively analyzed the
differences between systems by having the clinicians think out loud during the assigned
tasks. The authors also conducted a survey after the task was complete and analyzed the
surveys. The measured time taken to arrive at a decision, accuracy of decision, confidence
in decision, and TLX scores which the authors define as “Self-reported mental demand,
physical demand, effort level, hurriedness, success and discouragement on a scale of 0–10”.
Overall, they found their system performed above baseline across all measures. The authors
of [59] perform an iterative design of an explanation system for recurrent neural networks
(RNN) including machine learning researchers and clinicians. They use feedback from
clinicians while designing the algorithm. They evaluate it using a mixture of techniques
including quantitative, qualitative, and case-studies. To the best of our knowledge, there
are no studies such as this for explanation techniques. There is a study that interviews a
cohort of clinicians to gain insight on what they consider to be explainable deep learning.
The authors of [86] interviewed 10 clinicians belonging to two groups—Intensive Care Unit
and Emergency Department. The goal of this study was to shed light on what clinicians
look for in explainable machine learning and developing a metric to measure it. The
authors interviewed each clinician for 45–60 min. They proposed the clinician’s notions of
explainability to determine what each clinician understood by the term and what he/she
expected from ML models in their specific clinical setting. They designed an hypothetical
scenario regarding the use of an explainable deep learning system to gain insights into
what the clinicians looked for in explainable systems. The authors found that the clinicians
viewed explainability as a means of justifying their clinical decision-making (for instance,
to patients and colleagues) in the context of the model’s prediction. The clinicians desired
to understand the clinically relevant model features that align with current evidence-
based medical practice. The implemented system/model needs to provide clinicians with
information about the context within which the model operates and promote awareness
of situations where the model may fall short (e.g., model did not use specific history or
did not have information around certain aspect of a patient). Models that lacked accuracy
were deemed suitable as long as an explanation of the prediction was provided. The
authors state a quote that stood out was ”the variables that have derived the decision of the
model” and mentioned it was brought up by 3 ICU, 1 ED clinicians which is also identified
in. The authors report that the clinical thought process for acting on predictions of any
assistive tool appears to consist of two primary steps following presentation of the model’s
prediction: (i) understanding, and (ii) rationalizing the predictions. The clinicians stated
they wanted to see both global and local explanations. Clinicians found finding similar
training examples to be useful is certain applications such as diagnosis but not as much in
other cases. The clinicians reported one of the most sought after features was a measure
for uncertainty. ‘Patient trajectories that are influential in driving model predictions’ was
reported to be an important aspect of model explanation. These studies provide direction
for future algorithmic development as well as a call for more research regarding the clinical
performance of current explanation techniques.
Appl. Sci. 2021, 11, 4573 15 of 21
Future Directions
There are numerous works that propose novel explanation methods but little focus
on what makes an explanation suitable for the context it will applied. This is essential for
domains such as cancer diagnosis. Furthermore, more studies are needed to identify what
clinicians deem explainable and use these insights as a measuring stick for explanation
methods for deep learning cancer detection systems. Another important direction is
using these insights to derive quantitative explanation metrics. The majority of metrics
used to measure explanation quality are designed without clinicians involved. To create
explanation techniques suitable for clinician integration is is necessary to measure them
in a way guided by end-user criteria. Explanation techniques also can be useful to extract
clinically relevant features. There are studies that use explanation techniques to localize
lesions but these models do not extract information such as shape, volume, area, and
other relevant characteristics. A future opportunity is to extract these features without
computationally expensive segmentation. With this, clinicians do not need to extract these
features manually. If explanation techniques are implemented into clinical settings, the
system can automatically extract these characteristics for the clinicians thus aiding in the
diagnosis process.
The common approach for providing an explanation for image classification is to
produce a heatmap showing the most discriminative region. More work needs to be done
to provide greater insight than using this direct approach. For example, it would provide
more insight if explanation methods could should why a classification was not made
and quantify the uncertainity of the explanation. Being able to reason for and against
is also important to provide greater insight and explainability. More studies should be
conducted on where current methods fail and why. This would provide the community
more insight on how to create more robust explanation methods. It would also provide
greater confidence the methods we currently have.
Currently, the majority of cancer detection studies utilize post-hoc techniques. As
stated in [33,140] it is advantageous to develop intrinsically explainable (i.e., ad-hoc)
deep learning models. Post-hoc explanation methods may be useful for diagnosing the
deep network model but the insight gained is unsubstantial. A model suited for clinical
implementation should be fully explainable with the predictive power of modern deep
learning approaches. Post-hoc methods only explain where the network is looking which is
not enough information for high stakes decision making. Post-hoc methods are fragile and
easily manipulated [141,142] and can provide misleading results which hinders clinical
implementation. Some approaches attempting ad-hoc models were discussed in the paper
with results that show that intrinsically explainable models can achieve accuracy similar
to that of unexplainable models. More research is needed to verify this and evaluate the
clinical performance.
5. Conclusions
A review on explainability techniques for cancer detection using MRI images was
presented. Different evaluation methods for these methods were discussed with their
respective significance. In addition to this, what clinicians deem to be explainable was
also discussed along with the gap between what current methods provide versus what
clinicians want. The importance of designing explanation methods with this in mind was
highlighted. A critical analysis and suggestions for future work were then presented to
conclude and future efforts were suggested based on the current state of the literature.
In summary, there is a need to incorporate clinicians in the design process for these
algorithms and a need to build intrinsically explainable deep learning algorithms for cancer
detection. The majority of explanation techniques are designed for general use and do
not incorporate context. This is important for clinical use cases due to the unique, high-
risk environment. It is important to evaluate proposed methods alongside clinicians to
determine the clinical strengths and weaknesses of the methods. This has the potential to
help guide future efforts. Furthermore, there is a need to go beyond basic visualization
Appl. Sci. 2021, 11, 4573 16 of 21
of the discriminative regions. Some promising future directions include quantifying the
explanation uncertainty, providing counter examples, and designing ad-hoc models that
are intrinsically explainable.
Author Contributions: All authors contributed equally to this work. All authors have read and
agreed to the published version of the manuscript.
Funding: The authors received no funding for this work.
Informed Consent Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interests.
References
1. Kooi, T.; van Ginneken, B.; Karssemeijer, N.; den Heeten, A. Discriminating solitary cysts from soft tissue lesions in mammography
using a pretrained deep convolutional neural network. Med. Phys. 2017, 44, 1017–1027. [CrossRef]
2. Akselrod-Ballin, A.; Karlinsky, L.; Alpert, S.; Hasoul, S.; Ben-Ari, R.; Barkan, E. A Region Based Convolutional Network for
Tumor Detection and Classification in Breast Mammography, in Deep Learning and Data Labeling for Medical Applications; Springer:
Berlin/Heidelberg, Germany, 2016; pp. 197–205.
3. Zhou, X.; Kano, T.; Koyasu, H.; Li, S.; Zhou, X.; Hara, T.; Matsuo, M.; Fujita, H. Automated assessment of breast tissue density
in non-contrast 3D CT images without image segmentation based on a deep CNN. In Medical Imaging 2017: Computer-Aided
Diagnosis; International Society for Optics and Photonics: Bellingham, WA, USA, 2017; Volume 10134.
4. Gao, F.; Wu, T.; Li, J.; Zheng, B.; Ruan, L.; Shang, D.; Patel, B. SD-CNN: A shallow-deep CNN for improved breast cancer
diagnosis. Comput. Med. Imaging Graph. 2018, 70, 53–62. [CrossRef] [PubMed]
5. Li, J.; Fan, M.; Zhang, J.; Li, L. Discriminating between benign and malignant breast tumors using 3D convolutional neural
network in dynamic contrast enhanced-MR images. In Medical Imaging 2017: Imaging Informatics for Healthcare, Research, and
Applications; International Society for Optics and Photonics: Bellingham, WA, USA, 2017; Volume 10138.
6. Becker, A.S.; Marcon, M.; Ghafoor, S.; Wurnig, M.C.; Frauenfelder, T.; Boss, A. Deep Learning in Mammography: Diagnostic
Accuracy of a Multipurpose Image Analysis Software in the Detection of Breast Cancer. Investig. Radiol. 2017, 52, 434–440.
[CrossRef] [PubMed]
7. Jiang, F.; Jiang, Y.; Zhi, H.; Dong, Y.; Li, H.; Ma, S.; Wang, Y.; Dong, Q.; Shen, H.; Wang, Y. Artificial intelligence in healthcare past,
present and future. Stroke Vasc. Neurol. 2017, 2. [CrossRef] [PubMed]
8. Chiang, T.-C.; Huang, Y.-S.; Chen, R.-T.; Huang, C.-S.; Chang, R.-F. Tumor Detection in Automated Breast Ultrasound Using 3-D
CNN and Prioritized Candidate Aggregation. IEEE Trans. Med. Imaging 2019, 38, 240–249. [CrossRef] [PubMed]
9. Langlotz, C.P.; Allen, B.; Erickson, B.J.; Kalpathy-Cramer, J.; Bigelow, K.; Flanders, T.S.C.A.E.; Lungren, M.P.; Mendelson, D.S.;
Rudie, J.D.; Wang, G.; et al. A Roadmap for Foundational Research on Artificial Intelligence in Medical Imaging. Radiology 2019,
291, 781–791. [CrossRef] [PubMed]
10. Gao, J.; Jiang, Q.; Zhou, B.; Chen, D. Convolutional neural networks for computer-aided detection or diagnosis in medical image
analysis: An overview. Math. Biosci. Eng. 2019, 16, 6536–6561. [CrossRef]
11. Kooi, T.; Litjens, G.; van Ginneken, B.; Gubern-Mérida, A.; Sánchez, C.I.; Mann, R.; den Heeten, A.; Karssemeijer, N. Large scale
deep learning for computer aided detection of mammographic lesions. Med. Image Anal. 2017, 35, 303–312. [CrossRef]
12. Elsisi, M.; Tran, M.-Q.; Mahmoud, K.; Lehtonen, M.; Darwish, M.M.F. Deep Learning-Based Industry 4.0 and Internet of Things
towards Effective Energy Management for Smart Buildings. Sensors 2021, 21, 1038. [CrossRef]
13. Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the 13th European Conference,
Zurich, Switzerland, 6–12 September 2014; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8689.
[CrossRef]
14. Zintgraf, L.M.; Cohen, T.S.; Adel, T.; Welling, M. Visualizing deep neural network decisions: Prediction difference analysis.
In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017.
15. Elsisi, M.; Mahmoud, K.; Lehtonen, M.; Darwish, M.M.F. Reliable Industry 4.0 Based on Machine Learning and IoT for Analyzing,
Monitoring, and Securing Smart Meters. Sensors 2021, 21, 487. [CrossRef]
16. Ali, M.N.; Mahmoud, K.; Lehtonen, M.; Darwish, M.M.F. Promising MPPT Methods Combining Metaheuristic, Fuzzy-Logic and
ANN Techniques for Grid-Connected Photovoltaic. Sensors 2021, 21, 1244. [CrossRef]
17. Khan, S.; Islam, N.; Jan, Z.; Din, I.U.; Rodrigues, J.J.P.C. A novel deep learning based framework for the detection and classification
of breast cancer using transfer learning. Pattern Recognition. Letters 2019, 125, 1–6.
18. Shakeel, P.M.; Burhanuddin, M.A. Mohamad Ishak Desa, Lung cancer detection from CT image using improved profuse clustering
and deep learning instantaneously trained neural networks. Measurement 2019, 145, 702–712. [CrossRef]
19. Nam, J.G.; Park, S.; Hwang, E.J.; Lee, J.H.; Jin, K.N.; Lim, K.Y.; Vu, T.H.; Sohn, J.H.; Hwang, S.; Goo, J.M.; et al. Development and
Validation of Deep Learning-based Automatic Detection Algorithm for Malignant Pulmonary Nodules on Chest Radiographs.
Radiology 2019, 290, 218–228. [CrossRef]
Appl. Sci. 2021, 11, 4573 17 of 21
20. Hoofnagle, C.J.; Sloot, B.V.; Borgesius, F.Z. The European Union general data protection regulation: What it is and what it means.
Inf. Commun. Technol. Law 2019, 28, 65–98. [CrossRef]
21. Holzinger, A.; Langs, G.; Denk, H.; Zatloukal, K.; Müller, H. Causability and explainability of artificial intelligence in medicine.
Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1312. [CrossRef]
22. Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model cards for model
reporting. arXiv 2018, arXiv:1810.03993.
23. Chen, R.; Rodriguez, V.; Grossman Liu, L.V.; Mitchell, E.G.; Averitt, A.; Bear Don’t Walk, O.; Bhave, S.; Sun, T.; Thangaraj, P.
Columbia DBMI CMS AI Challenge Team. Engendering Trust and Usability in Clinical Prediction of Unplanned Admissions:
The CLinically Explainable Actionable Risk (CLEAR) Model. In Proceedings of the Conference on Machine Learning for Health
(MLHC), Stanford, CA, USA, 16–18 August 2018.
24. Zhang, X.; Solar-Lezama, A.; Singh, R. Interpreting neural network judgments via minimal, stable, and symbolic corrections.
In Proceedings of the 32nd International Conference on Neural Information Processing Systems NIPS2018, Montreal, QC, Canada,
3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 4879–4890.
25. Nolan, M.E.; Cartin-Ceba, R.; Moreno-Franco, P.; Pickering, B.; Herasevich, V. A Multisite Survey Study of EMR Review Habits,
Information Needs, and Display Preferences among Medical ICU Clinicians Evaluating New Patients. Appl. Clin. Inform. 2017, 8,
1197–1207. [CrossRef]
26. Ahern, I.; Noack, A.; Guzman-Nateras, L.; Dou, D.; Li, B.; Huan, J. NormLime: A New Feature Importance Metric for Explaining
Deep Neural Networks. arXiv 2019, arXiv:1909.04200.
27. Hegselmann, S.; Volkert, T.; Ohlenburg, H.; Gottschalk, A.; Dugas, M.; Ertmer, C. An Evaluation of the Doctor-explainability of
Generalized Additive Models with Interactions. In Proceedings of the 5th Machine Learning for Healthcare Conference, PMLR,
Vienna, Austria, 7–8 August 2020; Volume 126, pp. 46–79.
28. Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text Classification Algorithms: A Survey.
Information 2019, 10, 150. [CrossRef]
29. Lee, S.; Lee, J.; Lee, J.; Park, C.; Yoon, S. Robust Tumor Localization with Pyramid Grad-CAM. arXiv 2018, arXiv:1805.11393.
30. Hwang, Y.; Lee, H.H.; Park, C.; Tama, B.A.; Kim, J.S.; Cheung, D.Y.; Chung, W.C.; Cho, Y.-S.; Lee, K.-M.; Choi, M.-G.; et al.
An Improved Classification and Localization Approach to Small Bowel Capsule Endoscopy Using Convolutional Neural Network.
Dig. Endosc. Off. J. Jpn. Gastroenterol. Endosc. Soc. 2021, 33, 598–607.
31. Sumeet ShindeTanay ChouguleJitender SainiMadhura Ingalhalika. HR-CAM: Precise Localization of Pathology Using Multi-
Level Learning in CNNs. Medical Image Computing and Computer Assisted Intervention. In Proceedings of the MICCAI 2019,
22nd International Conference, Shenzhen, China, 13–17 October 2019; Part IV, pp. 298–306.
32. Gulum, M.A.; Trombley, C.M.; Kantardzic, M. Multiple Explanations Improve Deep Learning Transparency for Prostate Lesion
Detection. In Proceedings of the DMAH 2020, Waikoloa, HI, USA, 31 August–4 September 2020.
33. Zachary, C.L. The mythos of model interpretability. Commun. ACM 2018, 61, 36–43. [CrossRef]
34. Tim, M. Explanation in Artificial Intelligence: Insights from the Social Sciences, arXiv e-prints, Computer Science—Artificial
Intelligence. arXiv 2017, arXiv:1706.07269.
35. Narayanan, M.; Chen, E.; He, J.; Kim, B.; Gershman, S.; Doshi-Velez, F. How do Humans Understand Explanations from Machine
Learning Systems? An Evaluation of the Human-explainability of Explanation. arXiv 2018, arXiv:1902.00006.
36. Hicks, S.; Riegler, M.; Pogorelov, K.; Anonsen, K.V.; de Lange, T.; Johansen, D.; Jeppsson, M.; Randel, K.R.; Eskeland, S.L.;
Halvorsen, P.; et al. Dissecting Deep Neural Networks for Better Medical Image Classification and Classification Understanding.
In Proceedings of the IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS), Karlstad, Sweden, 18–21
June 2018; pp. 363–368. [CrossRef]
37. Zhang, Z.; Beck, M.W.; Winkler, D.A.; Huang, B.; Sib, A.W.; Goyal, H. Opening the black box of neural networks: Methods for
interpreting neural network models in clinical applications. Ann. Transl. Med. 2018, 6, 216. [CrossRef] [PubMed]
38. Bhatt, U.; Ravikumar, P.; Moura, J.M.F. Building Human-Machine Trust via Interpretability. In Proceedings of the AAAI
Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9919–9920. [CrossRef]
39. Sun, Y.; Ravi, S.; Singh, V. Adaptive Activation Thresholding: Dynamic Routing Type Behavior for Interpretability in Convolu-
tional Neural Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul,
Korea, 27 October–3 November 2019; pp. 4937–4946.
40. Zhou, B.; Bau, D.; Oliva, A.; Torralba, A. Interpreting Deep Visual Representations via Network Dissection. IEEE Trans. Pattern
Anal. Mach. Intell. 2019, 41, 2131–2145. [CrossRef] [PubMed]
41. Jakab, T.; Gupta, A.; Bilen, H.; Vedaldi, A. Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos.
In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18
June 2020; pp. 8784–8794.
42. Donglai, W.; Bolei, Z.; Antonio, T.; William, F. Understanding Intra-Class Knowledge Inside CNN. Computer Science—Computer
Vision and Pattern Recognition. arXiv 2015, arXiv:1507.02379.
43. Chang, C.; Creager, E.; Goldenberg, A.; Duvenaud, D. Explaining Image Classifiers by Counterfactual Generation. In Proceedings
of the International Conference on Learning Representations ICLR, 2019, Ernest N. Morial Convention Center, New Orleans, LA,
USA, 6–9 May 2019.
44. Yang, Y.; Song, L. Learn to Explain Efficiently via Neural Logic Inductive Learning. arXiv 2020, arXiv:1910.02481.
Appl. Sci. 2021, 11, 4573 18 of 21
45. Oh, S.J.; Augustin, M.; Fritz, M.; Schiele, B. Towards Reverse-Engineering Black-Box Neural Networks. In Proceedings of the
ICLR: 2018, Vancouver Convention Center, Vancouver, BC, Canada, 30 April–3 May 2018.
46. Wang, T. Gaining Free or Low-Cost Transparency with explainable Partial Substitute. arXiv 2019, arXiv:1802.04346.
47. Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.J.; Wexler, J.; Viégas, F.B.; Sayres, R. Explainability Beyond Feature Attribution:
Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the ICML, Stockholm, Sweden, 10–15 July 2018.
48. Chen, J.; Song, L.; Wainwright, M.; Jordan, M. Learning to Explain: An Information-Theoretic Perspective on Model explanation.
arXiv 2018, arXiv:1802.07814.
49. Singla, S.; Wallace, E.; Feng, S.; Feizi, S. Understanding Impacts of High-Order Loss Approximations and Features in Deep
Learning explanation. arXiv 2019, arXiv:1902.00407.
50. Schwab, P.; Karlen, W. CXPlain: Causal Explanations for Model explanation under Uncertainty. arXiv 2019, arXiv:1910.12336.
51. Guo, W.; Huang, S.; Tao, Y.; Xing, X.; Lin, L. Explaining Deep Learning Models-A Bayesian Non-parametric Approach. arXiv
2018, arXiv:1811.03422.
52. Lage, I.; Ross, A.; Gershman, S.J.; Kim, B.; Doshi-Velez, F. Human-in-the-Loop explainability Prior. arXiv 2018, arXiv:1805.11571.
53. Alvarez-Melis, D.; Jaakkola, T. Towards Robust explainability with Self-Explaining Neural Networks. arXiv 2018, arXiv:1806.07538.
54. Chen, C.; Li, O.; Barnett, A.; Su, J.; Rudin, C. This Looks Like That: Deep Learning for Explainable Image Recognition.
In Proceedings of the NeurIPS: 2019, Vancouver Convention Center, Vancouver, BC, Canada, 8–14 December 2019.
55. Shrikumar, A.; Greenside, P.; Kundaje, A. Learning Important Features Through Propagating Activation Differences. arXiv 2017,
arXiv:1704.02685.
56. Webb, S.J.; Hanser, T.; Howlin, B.; Krause, P.; Vessey, J.D. Feature combination networks for the explanation of statistical machine
learning models: Application to Ames mutagenicity. J. Cheminform. 2014, 6, 8. [CrossRef]
57. Elish, M.C. The Stakes of Uncertainty: Developing and Integrating Machine Learning in Clinical Care. In Proceedings of the 2018
EPIC Proceedings, Oxford, UK, 7 July 2018; University of Oxford–Oxford Internet Institute: Oxford, UK.
58. Ahmad, A.M.; Teredesai, E.C. Interpretable Machine Learning in Healthcare. In Proceedings of the 2018 IEEE International
Conference on Healthcare Informatics (ICHI), New York, NY, USA, 4–7 June 2018; p. 447. [CrossRef]
59. Kwon, B.C.; Choi, M.J.; Kim, J.T.; Choi, E.; Kim, Y.B.; Kwon, S.; Sun, J.; Choo, J. RetainVis: Visual Analytics with explainable and
Interactive Recurrent Neural Networks on Electronic Medical Records. Proc. IEEE Trans. Vis. Comput. Graph. 2019, 25, 299–309.
[CrossRef]
60. Fong, R.C.; Vedaldi, A. Explainable explanations of black boxes by meaningful perturbation. In Proceedings of the 2017 IEEE
International Conference on Computer Vision (ICCV), Venice, Italy, 27–29 October 2017; pp. 3449–3457.
61. Binder, A.; Montavon, G.; Lapuschkin, S.; Muller, K.; Samek, W. Layer-Wise Relevance Propagation for Neural Networks with
Local Renormalization Layers. In Proceedings of the 25th International Conference on Artificial Neural Networks and Machine
Learning (ICANN 2016), Barcelona, Spain, 6–9 September 2016.
62. Montavon, G.; Bach, S.; Binder, A.; Samek, W.; Müller, K. Explaining nonlinear classification decisions with deep Taylor
decomposition. Pattern Recognit. 2017, 65, 211–222. [CrossRef]
63. Elshawi, R.; Al-Mallah, M.H.; Sakr, S. On the explainability of machine learning-based model for predicting hypertension. BMC
Med. Inform. Decis. Mak. 2019, 19, 146. [CrossRef]
64. Gale, W.; Oakden-Rayner, L.; Gustavo, C.; Andrew, P.B.; Lyle, J.P. Producing radiologist-quality reports for explainable artificial
intelligence. arXiv 2018, arXiv:1806.00340.
65. Xie, Y.; Chen, X.A.; Gao, G. Outlining the Design Space of Explainable Intelligent Systems for Medical Diagnosis. In Proceedings
of the ACM IUI 2019, Los Angeles, CA, USA, 20 March 2019.
66. Xie, P.; Zuo, K.; Zhang, Y.; Li, F.; Yin, M.; Lu, K. Interpretable Classification from Skin Cancer Histology Slides Using Deep
Learning: A Retrospective Multicenter Study. arXiv 2019, arXiv:1904.06156.
67. Cruz-Roa, A.; Arevalo, J.; Madabhushi, A.; González, F. A Deep Learning Architecture for Image Representation, Visual
explainability and Automated Basal-Cell Carcinoma Cancer Detection. In Proceedings of the MICCAI International Conference
on Medical Image Computing and Computer-Assisted Intervention, Nagoya, Japan, 22–26 September 2013.
68. Zhang, R.; Weber, C.; Grossman, R.; Khan, A.A. Evaluating and interpreting caption prediction for histopathology images.
In Proceedings of the 5th Machine Learning for Healthcare Conference, in PMLR, Online Metting, 7–8 August 2020; Volume 126,
pp. 418–435.
69. Hao, J.; Kim, Y.; Mallavarapu, T.; Oh, J.H.; Kang, M. Explainable deep neural network for cancer survival analysis by integrating
genomic and clinical data. BMC Med. Genom. 2019, 12, 189. [CrossRef]
70. Al-Hussaini, I.; Xiao, C.; Westover, M.B.; Sun, J. SLEEPER: Interpretable Sleep staging via Prototypes from Expert Rules.
In Proceedings of the 4th Machine Learning for Healthcare Conference, in PMLR, Ann Arbor, MI, USA, 9–10 August 2019;
Volume 106, pp. 721–739.
71. Essemlali, A.; St-Onge, E.; Descoteaux, M.; Jodoin, P. Understanding Alzheimer disease’s structural connectivity through
explainable AI. In Proceedings of the Third Conference on Medical Imaging with Deep Learning, PMLR, Montreal, QC, Canada,
6–8 July 2020; Volume 121, pp. 217–229.
72. Louis, D.N.; Perry, A.; Reifenberger, G.; von Deimling, A.; Figarella-Branger, D.; Cavenee, W.K.; Ohgaki, H.; Wiestler, O.D.;
Kleihues, P.; Ellison, D.W. The 2016 World Health Organization Classification of Tumors of the Central Nervous System:
A summary. Acta Neuropathol. 2016, 131, 803–820. [CrossRef]
Appl. Sci. 2021, 11, 4573 19 of 21
73. Li, X.; Cao, R.; Zhu, D. Vispi: Automatic Visual Perception and explanation of Chest X-rays. arXiv 2019, arXiv:1906.05190.
74. Grigorescu, I.; Cordero-Grande, L.; Edwards, A.; Hajnal, J.; Modat, M.; Deprez, M. Interpretable Convolutional Neural Networks
for Preterm Birth Classification. arXiv 2019, arXiv:1910.00071.
75. Tjoa, E.; Guan, C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Trans. Neural Netw. Learn.
Syst. 2020. [CrossRef]
76. Singh, A.; Sengupta, S.; Lakshminarayanan, V. Explainable Deep Learning Models in Medical Image Analysis. J. Imaging 2020, 6,
52. [CrossRef]
77. Arrieta, A.B.; Díaz-Rodríguez, N.; Ser, J.D.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.;
et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Inf.
Fusion 2020, 58, 82–115. [CrossRef]
78. Singh, C.; Murdoch, W.J.; Yu, B. Hierarchical explanations for neural network predictions. arXiv 2019, arXiv:1806.05337.
79. Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018.
[CrossRef]
80. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model classifications. In Proceedings of the 31st International
Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates
Inc.: Red Hook, NY, USA; pp. 4768–4777.
81. de Sousa, I.P.; Rebuzzi Vellasco, M.B.; da Silva, E.C. Local explainable Model-Agnostic Explanations for Classification of Lymph
Node Metastases. Sensors 2019, 19, 2969. [CrossRef] [PubMed]
82. Yang, C.; Rangarajan, A.; Ranka, S. Global Model Interpretation Via Recursive Partitioning. In Proceedings of the 2018 IEEE 20th
International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart
City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, UK, 28–30 June 2018;
pp. 1563–1570. [CrossRef]
83. Garson, G.D. Interpreting neural network connection weights. Artif. Intell. Expert 1991, 6, 46–51.
84. Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks
via Gradient-Based Localization. Int. J. Comput. Vis. 2019, 128, 336–359. [CrossRef]
85. Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should i trust you? Explaining the classifications of any classifier. In Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August
2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144.
86. Tonekaboni, S.; Joshi, S.; McCradden, M.D.; Goldenberg, A. What Clinicians Want: Contextualizing Explainable Machine Learning
for Clinical End Use. In Proceedings of the 4th Machine Learning for Healthcare Conference, Online Metting, 7–8 August 2020;
Volume 106, pp. 359–380.
87. Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and
saliency maps. arXiv 2013, arXiv:1312.6034.
88. Kim, B.; Seo, J.; Jeon, S.; Koo, J.; Choe, J.; Jeon, T. Why are Saliency Maps Noisy? Cause of and Solution to Noisy Saliency
Maps. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea,
27 October–2 November 2019; pp. 4149–4157. [CrossRef]
89. Zhao, G.; Zhou, B.; Wang, K.; Jiang, R.; Xu, M. Respond-CAM: Analyzing Deep Models for 3D Imaging Data by Visualizations; Medical
Image Computing and Computer Assisted Intervention—MICCAI: Granada, Spain, 2018; pp. 485–492.
90. Doshi-Velez, F.; Been, K. Towards A Rigorous Science of explainable Machine Learning. Machine Learning. arXiv 2017,
arXiv:1702.08608.
91. Narayanan, B.; Silva, M.S.; Hardie, R.; Kueterman, N.K.; Ali, R.A. Understanding Deep Neural Network Predictions for Medical
Imaging Applications. arXiv 2019, arXiv:1912.09621.
92. Tomsett, R.J.; Braines, D.; Harborne, D.; Preece, A.; Chakraborty, S. Interpretable to Whom? A Role-based Model for Analyzing
Interpretable Machine Learning Systems. arXiv 2018, arXiv:1806.07552.
93. Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized Gradient-Based Visual Explana-
tions for Deep Convolutional Networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision
(WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [CrossRef]
94. Preece, A.D.; Harborne, D.; Braines, D.; Tomsett, R.; Chakraborty, S. Stakeholders in Explainable AI. arXiv 2018, arXiv:1810.00184.
95. Hooker, S.; Erhan, D.; Kindermans, P.; Kim, B. A Benchmark for explainability Methods in Deep Neural Networks. In Proceedings
of the NeurIPS, Vancouver, BC, Canada, 8–14 December 2019.
96. Cassel, C.K.; Jameton, A.L. Dementia in the elderly: An analysis of medical responsibility. Ann. Intern. Med. 1981, 94, 802–807.
[CrossRef] [PubMed]
97. Croskerry, P.; Cosby, K.; Graber, M.L.; Singh, H. Diagnosis: Interpreting the Shadows; CRC Press: Boca Raton, FL, USA, 2017; 386p,
ISBN 9781409432333.
98. Kallianos, K.; Mongan, J.; Antani, S.; Henry, T.; Taylor, A.; Abuya, J.; Kohli, M. How far have we come? Artificial intelligence for
chest radiograph interpretation. Clin. Radiol. 2019, 74, 338–345. [CrossRef]
99. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 Jun–1 July 2016;
pp. 2921–2929. [CrossRef]
Appl. Sci. 2021, 11, 4573 20 of 21
100. Windisch, P.; Weber, P.; Fürweger, C.; Ehret, F.; Kufeld, M.; Zwahlen, D.; Muacevic, A. Implementation of model explainability for a
basic brain tumor detection using convolutional neural networks on MRI slices. Neuroradiology 2020. [CrossRef]
101. Wang, C.J.; Hamm, C.A.; Savic, L.J.; Ferrante, M.; Schobert, I.; Schlachter, T.; Lin, M.; Weinreb, J.C.; Duncan, J.S.; Chapiro, J.; et al.
Deep learning for liver tumor diagnosis part II: Convolutional neural network explanation using radiologic imaging features.
Eur. Radiol. 2019, 29, 3348–3357. [CrossRef]
102. Giger, M.L.; Karssemeijer, N.; Schnabel, J.A. Breast image analysis for risk assessment, detection, diagnosis, and treatment of
cancer. Annu. Rev. Biomed. Eng. 2013, 15, 327–357. [CrossRef]
103. Samala, R.K.; Chan, H.P.; Hadjiiski, L.M.; Cha, K.; Helvie, M.A. Deep-learning convolution neural network for computer-aided
detection of microcalcifications in digital breast tomosynthesis. In Proceedings of the Medical Imaging 2016: Computer-Aided
Diagnosis, San Diego, CA, USA, 27 February–3 March 2016; International Society for Optics and Photonics: Bellingham, WA,
USA, 2016; Volume 9785, p. 97850Y.
104. Posada, J.G.; Zapata, D.M.; Montoya, O.L.Q. Detection and Diagnosis of Breast Tumors using Deep Convolutional Neural
Networks. In Proceedings of the XVII Latin American Conference on Automatic Control, Medellín, Colombia, 13–15 October
2016; pp. 11–17.
105. Dhungel, N.; Carneiro, G.; Bradley, A.P. The Automated Learning of Deep Features for Breast Mass Classification from Mammo-
grams. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the MICCAI 2016 19th International Conference,
Athens, Greece, 17–21 October 2016; Part I; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2016;
Volume 9901. [CrossRef]
106. Samala, R.K.; Chan, H.-P.; Hadjiiski, L.; Helvie, M.A.; Richter, C.D.; Cha, K.H. Breast Cancer Diagnosis in Digital Breast
Tomosynthesis: Effects of Training Sample Size on Multi-Stage Transfer Learning Using Deep Neural Nets. IEEE Trans. Med.
Imaging 2019, 38, 686–696. [CrossRef]
107. Samala, R.K.; Chan, H.P.; Hadjiiski, L.; Helvie, M.A.; Wei, J.; Cha, K. Mass detection in digital breast tomosynthesis: Deep
convolutional neural network with transfer learning from mammography. Med. Phys. 2016, 43, 6654. [CrossRef]
108. Zhou, Y.; Xu, J.; Liu, Q.; Li, C.; Liu, Z.; Wang, M.; Zheng, H.; Wang, S. A Radiomics Approach with CNN for Shear-wave
Elastography Breast Tumor Classification. IEEE Trans. Biomed. Eng. 2018, 65, 1935–1942. [CrossRef]
109. Shen, Y.; Wu, N.; Phang, J.; Park, J.; Liu, K.; Tyagi, S.; Heacock, L.; Kim, S.G.; Moy, L.; Cho, K.; et al. An explainable classifier for
high-resolution breast cancer screening images utilizing weakly supervised localization. arXiv 2020, arXiv:2002.07613.
110. Saffari, N.; Rashwan, H.A.; Abdel-Nasser, M.; Kumar Singh, V.; Arenas, M.; Mangina, E.; Herrera, B.; Puig, D. Fully Automated
Breast Density Segmentation and Classification Using Deep Learning. Diagnostics 2020, 10, 988. [CrossRef]
111. Singh, V.K.; Abdel-Nasser, M.; Akram, F.; Rashwan, H.A.; Sarker, M.M.K.; Pandey, N.; Romani, S.; Puig, D. Breast Tumor
Segmentation in Ultrasound Images Using Contextual-Information-Aware Deep Adversarial Learning Framework. Expert Syst.
Appl. 2020, 162, 113870. [CrossRef]
112. Wang, H.; Feng, J.; Bu, Q.; Liu, F.; Zhang, M.; Ren, Y.; Lv, Y. Breast Mass Detection in Digital Mammogram Based on Gestalt
Psychology. J. Healthc. Eng. 2018, 2018, 4015613. [CrossRef]
113. Ha, R.; Chin, C.; Karcich, J.; Liu, M.Z.; Chang, P.; Mutasa, S.; Van Sant, E.P.; Wynn, R.T.; Connolly, E.; Jambawalikar, S. Prior to
Initiation of Chemotherapy, Can We Predict Breast Tumor Response? Deep Learning Convolutional Neural Networks Approach
Using a Breast MRI Tumor Dataset. J. Digit Imaging 2019, 32, 693–701. [CrossRef]
114. Li, R.; Shinde, A.; Liu, A.; Glaser, S.; Lyou, Y.; Yuh, B.; Wong, J.; Amini, A. Machine Learning—Based explanation and Visualization
of Nonlinear Interactions in Prostate Cancer Survival. JCO Clin. Cancer Inform. 2020, 4, 637–646. [CrossRef]
115. Chen, Q.; Xu, X.; Hu, S.; Li, X.; Zou, Q.; Li, Y. A transfer learning approach for classification of clinical significant prostate cancers
from mpMRI scans. In Proceedings of the Medical Imaging 2017: Computer-Aided Diagnosis, Orlando, FL, USA, 11–16 February
2017; International Society for Optics and Photonics: Bellingham, WA, USA, 2017; Volume 10134, p. 101344F.
116. Li, W.; Li, J.; Sarma, K.V.; Ho, K.C.; Shen, S.; Knudsen, B.S.; Gertych, A.; Arnold, C.W. Path R-CNN for Prostate Cancer Diagnosis
and Gleason Grading of Histological Images. IEEE Trans. Med. Imaging 2019, 38, 945–954. [CrossRef]
117. Song, Y.; Zhang, Y.D.; Yan, X.; Liu, H.; Zhou, M.; Hu, B.; Yang, G. Computer-aided diagnosis of prostate cancer using a deep
convolutional neural network from multiparametric MRI. J. Magn. Reson. Imaging 2018, 48, 1570–1577. [CrossRef]
118. Wang, Z.; Liu, C.; Cheng, D.; Wang, L.; Yang, X.; Cheng, K.T. Automated Detection of Clinically Significant Prostate Cancer in
mp-MRI Images Based on an End-to-End Deep Neural Network. IEEE Trans. Med. Imaging 2018, 37, 1127–1139. [CrossRef]
119. Ishioka, J.; Matsuoka, Y.; Uehara, S.; Yasuda, Y.; Kijima, T.; Yoshida, S.; Yokoyama, M.; Saito, K.; Kihara, K.; Numao, N.; et al.
Computer-aided diagnosis of prostate cancer on magnetic resonance imaging using a convolutional neural network algorithm.
BJU Int. 2018, 122, 411–417. [CrossRef]
120. Kohl, S.A.; Bonekamp, D.; Schlemmer, H.; Yaqubi, K.; Hohenfellner, M.; Hadaschik, B.; Radtke, J.; Maier-Hein, K. Adversarial
Networks for the Detection of Aggressive Prostate Cancer. arXiv 2017, arXiv:1702.08014.
121. Yang, X.; Liu, C.; Wang, Z.; Yang, J.; Min, H.L.; Wang, L.; Cheng, K.T. Co-trained convolutional neural networks for automated
detection of prostate cancer in multi-parametric MRI. Med. Image Anal. 2017, 42, 212–227. [CrossRef]
122. Jin, T.K.; Hewitt, S.M. Nuclear Architecture Analysis of Prostate Cancer via Convolutional Neural Networks. IEEE Access 2017,
5, 18526–18533.
Appl. Sci. 2021, 11, 4573 21 of 21
123. Wang, X.; Yang, W.; Weinreb, J.; Han, J.; Li, Q.; Kong, X.; Yan, Y.; Ke, Z.; Luo, B.; Liu, T.; et al. Searching for prostate cancer by
fully automated magnetic resonance imaging classification: Deep learning versus non-deep learning. Sci. Rep. 2017, 7, 15415.
[CrossRef] [PubMed]
124. Liu, S.; Zheng, H.; Feng, Y.; Li, W. Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI. In Proceedings
of the Medical Imaging 2017: Computer-Aided Diagnosis, Orlando, FL, USA, 11–16 February 2017; Volume 10134, p. 1013428.
125. Le, M.H.; Chen, J.; Wang, L.; Wang, Z.; Liu, W.; Cheng, K.T.; Yang, X. Automated diagnosis of prostate cancer in multi-parametric
MRI based on multimodal convolutional neural networks. Phys. Med. Biol. 2017, 62, 6497–6514. [CrossRef] [PubMed]
126. Akatsuka, J.; Yamamoto, Y.; Sekine, T.; Numata, Y.; Morikawa, H.; Tsutsumi, K.; Yanagi, M.; Endo, Y.; Takeda, H.; Hayashi, T.;
et al. Illuminating Clues of Cancer Buried in Prostate MR Image: Deep Learning and Expert Approaches. Biomolecules 2019, 9,
673. [CrossRef] [PubMed]
127. Yang, X.; Wang, Z.; Liu, C.; Le, H.M.; Chen, J.; Cheng, K.T.T.; Wang, L. Joint Detection and Diagnosis of Prostate Cancer in Multi-
Parametric MRI Based on Multimodal Convolutional Neural Networks. In Proceedings of the Medical Image Computing and
Computer Assisted Intervention, 20th International Conference, Quebec City, QC, Canada, 11–13 September 2017; pp. 426–434.
128. Venugopal, V.K.; Vaidhya, K.; Murugavel, M.; Chunduru, A.; Mahajan, V.; Vaidya, S.; Mahra, D.; Rangasai, A.; Mahajan, H.
Unboxing AI-Radiological Insights Into a Deep Neural Network for Lung Nodule Characterization. Acad. Radiol. 2020, 27, 88–95.
[CrossRef]
129. Papanastasopoulos, Z.; Samala, R.K.; Chan, H.P.; Hadjiiski, L.; Paramagul, C.; Helvie, M.A.; Neal, C. H Explainable AI for
medical imaging: Deep-learning CNN ensemble for classification of estrogen receptor status from breast MRI. In Proceedings of
the Medical Imaging 2020: Computer-Aided Diagnosis, Oxford, UK, 20–21 January 2020; International Society for Optics and
Photonics: Bellingham, WA, USA, 2020; Volume 11314, p. 113140Z. [CrossRef]
130. Blendowski, M.; Heinrich, M.P. Learning explainable multi-modal features for alignment with supervised iterative descent.
In Proceedings of the 2nd International Conference on Medical Imaging with Deep Learning, in PMLR, London, UK, 8–10 July
2019; Volume 102, pp. 73–83.
131. Pintelas, E.; Liaskos, M.; Livieris, I.E.; Kotsiantis, S.; Pintelas, P. Explainable Machine Learning Framework for Image Classification
Problems: Case Study on Glioma Cancer Prediction. J. Imaging 2020, 6, 37. [CrossRef]
132. Afshar, P.; Plataniotis, K.N.; Mohammadi, A. Capsule Networks’ explainability for Brain Tumor Classification Via Radiomics
Analyses. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September
2019; pp. 3816–3820. [CrossRef]
133. LaLonde, R.; Torigian, D.; Bagci, U. Encoding Visual Attributes in Capsules for Explainable Medical Diagnoses. In Proceedings of
the MICCAI: 2020, Online, 4–8 October 2020.
134. Sankar, V.; Kumar, D.; David, A.C.; Taylor, G.W.; Wong, A. SISC: End-to-End explainable Discovery Radiomics-Driven Lung
Cancer Prediction via Stacked explainable Sequencing Cells. IEEE Access 2019, 7, 145444–145454.
135. Shen, S.; Han, S.; Aberle, D.; Bui, A.; Hsu, W. An Interpretable Deep Hierarchical Semantic Convolutional Neural Network for
Lung Nodule Malignancy Classification. Expert Syst. Appl. 2019, 128, 84–95. [CrossRef]
136. Wu, J.; Zhou, B.; Peck, D.; Hsieh, S.; Dialani, V.; Lester, W. Mackey and Genevieve Patterson DeepMiner: Discovering explainable
Representations for Mammogram Classification and Explanation. arXiv 2018, arXiv:1805.12323.
137. Xi, P.; Shu, C.; Goubran, R. Abnormality Detection in Mammography using Deep Convolutional Neural Networks. In Proceedings
of the 2018 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Rome, Italy, 11–13 June 2018;
pp. 1–6. [CrossRef]
138. Zhen, S.H.; Cheng, M.; Tao, Y.B.; Wang, Y.F.; Juengpanich, S.; Jiang, Z.Y.; Jiang, Y.K.; Yan, Y.Y.; Lu, W.; Lue, J.M.; et al. Deep
Learning for Accurate Diagnosis of Liver Tumor Based on Magnetic Resonance Imaging and Clinical Data. Front. Oncol. 2020, 10,
680. [CrossRef]
139. Ghassemi, M.; Pushkarna, M.; Wexler, J.; Johnson, J.; Varghese, P. ClinicalVis: Supporting Clinical Task-Focused Design Evaluation.
arXiv 2018, arXiv:1810.05798.
140. Rudin, C. Please Stop Explaining Black Box Models for High Stakes Decisions. arXiv 2018, arXiv:1811.10154.
141. Arun, N.; Gaw, N.; Singh, P.; Chang, K.; Aggarwal, M.; Chen, B.; Hoebel, K.; Gupta, S.; Patel, J.; Gidwani, M.; et al. Assessing the
(Un)Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging. arXiv 2020, arXiv:2008.02766.
142. Zhang, X.; Wang, N.; Shen, H.; Ji, S.; Luo, X.; Wang, T. Interpretable Deep Learning under Fire. In Proceedings of the 29th USENIX
Security Symposium (USENIX Security 20), Boston, MA, USA, 12–14 August 2020; pp. 1659–1676.