An Overview of Explanation Approaches For Deep Neural Networks (Ongoing Work-This Is A Draft)
An Overview of Explanation Approaches For Deep Neural Networks (Ongoing Work-This Is A Draft)
An Overview of Explanation Approaches For Deep Neural Networks (Ongoing Work-This Is A Draft)
Vahdat Abdelzad
University of Waterloo, Waterloo, Canada
[email protected]
1
Explanation table
Interpre
Interpretable
natural lang.
visual rep.
Int
erp
ret
ab
le logical exp.
logical exp.
final users. We cover the future directions and colclude the article in
Approaches devised for explanation of DNNs (covered Section 7.
in Section 3) approach to the solution from different per-
spectives and so have distinct characteristics. These ap- 2. Concepts
proaches might be based on completely a standalone con-
cept, a combination of different concepts, or a advanced In this section, we elaborate the meaning of explanation
version of approaches used to explain none-deep ML mod- and interpretability and define what definition is adopted in
els. Understanding and evaluating these approaches can be this article. Then, we discuss what kinds of explanation can
challenging because of such a complexity involved. Fur- be requested for DNNs. Finally, we categorize explanation
thermore, there are not many research in which they could approaches based on when and where they are applied.
take the full advantages of these approaches for different 2.1. Explanation and interpretability
purposes such as improving already designed architectures,
having explanation as a default output for DNNs, utilizing We as humans have reasons for our decisions when it
explanation in safety-critical systems, and also learning new is asked by our peers or needed by ourselves to reassess
models based on explanation. our decisions. Such reasons are conceptually known as
We believe explaining these approaches in a structured explanation. The same applies to machines in which we
manner (like it is provided in this article) can help under- want to have a description of how an action has been
stand, evaluate, use them better and vastly. It can also mo- performed. For instance, a description of how a pedestrian
tivate the future research in this direction. Therefore, we has been recognized by a DNN in a given image [7].
attempt to tackle the aforementioned issues through answer-
ing the following questions in this article: Definition 1: explanation is any kind of description that
allows to disclose the logic behind a phenomenon.
• What does explanation mean?
• How does a specific explanation approach work? A phenomenon, for example, can include an action, a
concept, or a model. In other words, explanation can be re-
• What are characteristics (e.g. pros and cos) of a spe- quested for any thing happens in a system (or even in our
cific explanation approach? world). When an explanation can be understood by people
it is deemed as an interpretable one. However, understand-
• What are methods used for evaluating explanation ap- ing is a relative concept and depends on many factors, such
proaches? as cognitive capabilities of people, formats in which actions
• What are other applications of explanation? are explained (kinds of representations), and volumes of in-
formation involved in explanation. Therefore, this makes
The rest of this article is organized as follows. We first interpretability become a relative concept as well. In other
clarify the meaning of explanation and interpretability and words, explanation can be altered to be interpretable or un-
then discuss various explanation approaches in Section 3. interpretable once one of such factors is changed.
In section 4, we investigate different methods used to eval- Figure 1 depicts an abstract view of interpretability. In
uate characteristics of explanation approaches. Section 5 this view, we define that there is a set of explanation ex-
concentrates on applications arisen from these approaches. isted for an action (e.g. the prediction of DNNs). Expla-
2
nation could be simple or complex (left and right sides in 2.2. ML models
the ellipse). The boundaries between simple and complex
Explanation for any natural phenomena can be requested
cannot be defined rigidly because each one has a different
in two different levels: domain and model levels [38].
meaning for different people. However, it has been gener-
The domain level explanation deals with providing the true
ally accepted that an explanation with a lower volume of
causal relationship between inputs and outputs. For exam-
information involved is much simpler than an explanation
ple, it is not feasible to assign the exact relationship of each
with a higher volume.
pixel in an image to the existence of a pedestrians because
Furthermore, explanation is represented via techniques there is not a clear relationship between a pixel and a pedes-
such as visual representations, natural languages, or logi- trian. Achieving such a degree of explanation requires a
cal expressions. These representational techniques could be comprehensive information about the relationships of in-
used for simple and complex explanation. For example, a puts, outputs, and their probabilities in advance. Such in-
tree-based representation can technically be used to explain formation is hard to achieve except the domain is crafted
a simple or complex reasoning. The representational tech- with all known relationships.
niques have also their own complexity with respect to their The model level explanation aims at shedding lights on
understanding, which might fluctuate among people. For models built on top of the domains. These model are more
example, natural languages may be considered easier to un- abstract and optimized to have the most important causal
derstand than visual ones (and vice versa) or people might relationships of the domain, therefore, the explanation for
have different abilities to understand and explore an expla- these models will involve less complexity. Since ML mod-
nation based on rules [16]. els are built to represent complex domains, therefore, model
We could see complexity and representational tech- level explanation is applicable to them.
niques jointly have influences on how we could understand For more discussion in this subject, we adopt the follow-
an explanation. For instance, tree-based representations are ing definitions for ML models. Please note that there are
known to be an interpretable technique to represent expla- various interpretation about these definitions, but we believe
nation, but when a high volume of information is involved the following definitions are more accepted in ML domains
in the tree (a very deep tree) it might not be interpretable and should be more appropriate for our purposes.
anymore because understanding such a deep tree will be
challenging by itself. • White box models: are models that their internal
working parts are transparent (easy to interpret).
As another example, explanation associated with recog-
nizing a pedestrian (from an image) can be based on a visual It means the properties of the function modeled are
correlation of pixels or recognizable parts such as hands, known for us. For example, decision trees (with rea-
head, or legs. The second one would be more interpretable sonable depth), nearest neighbors-based methods, and
because of getting advantages of conceptual (abstract) terms decision sets [24] are white-box models due to their
that can be understood and consciously relate to the cate- self-explanatory characteristics.
gory of ”pedestrian” while the first is purely at perceptual
level (pre-conceptual) - i.e. it is not at the level we think • Black box models: are models that their internal
and so an explanation at this level may not be interpretable. working parts are opaque (complex, hard to interpret).
Of course, the decision regarding the second example would It means the relevant properties of function modeled
be arguably based on the person (a final user or a developer) are not know to us (partially because of nonlinearity in
who needs the explanation. this models). For example, DNNs are black box be-
Therefore, we could expect to have sets of interpretable cause we do not know why such weights are used by
explanation (internal ellipses inside the explanation ellipse models or what their interpretations are. Other exam-
which include representational techniques and levels of ples could be random forest and Support Vector Ma-
complexity) for different people. chines (SVM).
• Gray box models: are models that are still black box
Definitoin 2: interpretable explanation is a sweet spots but there are ways to express the inner working parts.
in the explanation set in which a right volume of informa-
It means they are a mixture of black and white box
tion is represented by a right technique for a right audience.
models. For example, some modification on SVM and
random forest could make them less opaque [17].
Definition 2 ensures that the explanation offered will be
interpretable. Not having a concrete boundary for inter- Black box and gray box models are major candidates
pretability has also been expressed by Lipton [28] in the to request explanation for them because we do not know
form that interpretability is not a monolithic concept. whether models learned the correct attributes for resolving
3
problems (to do inference). Therefore by explanation, we class. Furthermore, since there is a defined input for
hope we can shed lights on this opaque part. We could also this level of explanation, explanation approaches can
think of explanation for while box models as well. If white use the input as a media to represent the explanation to
box models such as decision trees become large, it will not users. For instance, an instance-level explanation for a
be feasible for humans to interpret them. In that case, ex- perception system will require an image and the class
planation might be needed to bring a simple form for better of object existed in the image (e.g. car, vehicle, pedes-
interpretation. trian). Explanation can be represented by highlighting
important features in the image.
2.3. Explanation levels
DNNs (e.g. ResNets [18] has over 200 layers) having The explanation obtained via instance-level explana-
billions of parameters that make them complex as the en- tion may not be valid for another instance et al. be-
tire domain they try to model (discussed in Section 2.2). In cause the explanation has been conditioned to that spe-
other words, the abstract form of a domain built by DNNs is cific instance [36]. Furthermore, instance-level expla-
still large such that it becomes infeasible to explain it in an nation will not completely disclose the whole behav-
interpretable manner due to limitations in the human cog- ior learned by the model, but it may still have infor-
nitive capacity and available representational techniques. mation to interpret for better analysis and debugging.
Therefore, it would be logical to have a sub-level of model Instance-level explanation is the major focus of ap-
explanation which is possible to understand and represent. proaches covered in this article.
We could have a different perspective on this by express-
ing that sometimes we are not interested in the entire be- 2.4. Need more than interpretability
havior of a model and we may want to have explanation
for a specific behavior of a model. For example, we might When humans specify explanation for their actions, we
just want to know how a perception system recognizes a may understand them but sometimes those are not real rea-
pedestrian in a specific image or why a specific transaction sons for the actions or they do not include all reasons. This
might be considered by a DNN as a fraudulent transaction. may happen because people might pretend they provided
Therefore, we distinguish different levels of explanation for true and complete explanation.
DNNs as follows. The same perspective is true for explanation approaches
for ML models. Being able to obtain explanation for ML
• Model-level or class-level explanation: is an expla- models and interpret them is essential, but we need to make
nation of the entire model’s behavior with respect to a sure those approaches accurately explain functions learned
specific output(class) but no assumption on the input. by the models. In other words, the explanation must be
ML models are generally trained for several classes based on correct causal and effect relationships existed in
and therefore the class is required to obtain the expla- models. This aspect is expressed under the concept of faith-
nation for the entire model [24, 54, 43]. If we request fulness or fidelity of explanation. In order to have faithful
explanation for the entire model without specifying the explanation, all relationships existed in the model must be
class, it becomes explanation over all classes. Thus, explained. However, this can be challenging because faith-
we consider model-level and class-level explanation ful explanation for DNNs can be a large one that causes
the same in this article. This level of explanation is to have uninterpretable explanation. Thus, it is required to
complex because it includes extensive rules about the consider a balance between faithfulness and interpretability
domain. It is also a question of how such deep knowl- in such complex models.
edge of a class should be explained in an interpretable With respect to levels of explanation discussed in section
format. 2.2, there should be levels of fidelity involved. This includes
fidelity of explanation for instances and classes (models).
• Instance-level explanation: is an explanation of a The model level (global) fidelity is about explaining how
model’s behavior with respect to a given input and out- the model behaves in the vicinity of the input domain (dis-
put. tribution) while the instance level (local) fidelity is to obtain
This emerges from the fact how ML models are uti- explanation for the vicinity of the instance being inferred
lized in real scenarios. An input is given to the (predicted) [37]. The concept of vicinity is used to make
model and an inference is performed on it. This level the problem of explanation tractable and interpretable. Fur-
is more aligned with the way people communicate thermore, the local fidelity does not ensure global fidelity
with models and involves more meaningful explana- while vice versa does. More discussions regarding how fi-
tion [37, 43, 59]. It is also less complex because it delity can be evaluated along with other factors will be in
does not cover the entire behavior of the model for a Section 4.
4
2.5. Categorization of explanation approaches even though the underlying models (architecture) might be
changed.
Explanation approaches can be approximately catego-
Most approaches in this category build explainable mod-
rized based on a) when explanation is obtained b) what as-
els, such as sparse linear models and decision trees, on top
sumptions explanation approaches have over models. The
of models and use those explainable models to explain the
early one is divided into post-hoc and joint approaches
behavior of models. Although linear models may not fully
while the latter one is divided into model-agnostic and
represent models (e.g. they becomes complex like the orig-
model-dependent approaches. An approach can belong to
inal model), they can at least establish valuable insights
both of the categories at the same time.
about models. Furthermore, these approaches are not time
efficient due to training another model and using it for ex-
2.5.1 Post-hoc approaches planation. These approaches are also fallen in the category
of post-hoc explanation approaches.
In this category, explanation approaches observe first the
output of the model and then begin to search for explana-
tion. These approaches generally require extra processing 2.5.4 Model-dependent approaches
time and may not be applicable for time-critical applica- Model-dependent approaches assume that they have full ac-
tions. The approaches might not be able to find the expla- cess to the input, output, and internal representations of
nation a) if they adopt a wrong method for searching or opti- models (and of course know the architecture). This capa-
mization b) the search domain is large so it is not possible to bility allows utilizing various exploration techniques to in-
explore all of possible explanation c) information involved fer the behavior of models. These approaches are theoret-
in inference process is not reachable (or vanished). These ically more precise than model-agnostic ones due to their
approaches are either developed or trained independently of flexibility in attaining required information. However, they
models and do not interfere with the models training and in- can solely be applied to defined architectures otherwise mi-
ferring processes. nor or major modifications may require. These approaches
cannot be applied to off-the-shelf components for which
2.5.2 Joint approaches only inputs and outputs are accessible. Model-dependent
approaches can be designed in a post-hoc or joint manner.
In this category, explanation approaches are part of the
model itself. It means there is no need to perform post- 3. Explanation and visualization approaches
processing to search for explanation. Once the output of
the model is ready, the explanation is also ready. In order We study diverse categories of approaches proposed to
to have this capability a) the model should be interpretable explain (or visualize) deep ML models in this section. Each
by itself or b) another model has to be trained or developed approach is briefly explained and its pros and cons are dis-
along with the model. The latter case is heavily intervened cussed. We also explore the key factors related to ap-
with the architecture and trained process and may or may proaches such as supporting architectures, methods used
not require extra processing time. It might even make the to evaluate the approach and last but not least what is
training process longer or cause to have more complicated a connection between this approach and others. We be-
architecture. Therefore, those cannot be easily reused for lieve such structured explanation about existing approaches
different architectures. However, it is expected to have bet- should help the final users of these approaches to choose the
ter accuracy with these approaches. right approach for the problem. Moreover, it is expected to
give a concise information to researchers in this area. Please
2.5.3 Model-agnostic approaches consider this section is by no means an exhaustive list but it
gives an indication of the many approaches.
Model-agnostic explanation approaches are fallen in a cate-
gory in which there is no need to have access to the internal 3.1. Based on perturbation
representations of models or have assumptions about the ar- 3.2. Based on gradients
chitecture of models so as to explain the behavior of models.
These requirements are mostly ones that cannot be attained based global average pooling
in off-the-shelf components. The main idea behind these
3.3. Based on probabilistic graphical models
approaches is that it should be possible to explain the be-
havior of the model (locally, in the vicinity of a given input) Based on message passing
via its input and outputs . Therefore, these approaches can One of the early research in direction of provid-
be technically applied to any ML models. This makes it eas- ing explanation for individual instances fed to classifiers
ier to compare different approaches and reuse approaches via marginalization was conducted by Robnik-Šikonja et
5
Simple Taylor
gradient-based deconvnet CAM Sensitivity Analysis perturbation-based
Decomposition
guided-
differential analysis Integrated Gradients
backpropagation
marginal winning
Grad-CAM++ DeepLIFT LIME CENs
probability (MWP)
SHAP
al. [38]. The approach is capable of being applied to any curacy and area under the ROC curve (AUC). Furthermore,
model that outputs the class probability. The explanation is different models built on the same data have different ex-
represented by the amount of contribution each attribute of planation (although there exist overlaps). The approach is a
an input has on the output of the model. This is achieved model-agnostic and post-hoc approach and we call it ”De-
by calculating the information difference metric, but other composition of the prediction” for future references.
metrics such as difference of probabilities and weight of ev- Zeiler and Fergus [56] proposed an approach based on
idence can also be applied in the same manner. Variations in multi-layer deconvolutional network (deconvnet) and oc-
explanation (or contribution values) of different metrics are clusion sensitivity to explain the behavior of Convolutional
in contribution scales. The effects, which are positive, no, Neural Networks (CNNs). A deconvnet is built and trained
and negative signs, stay the same for all metrics. It is also along with a CNN and used to project back the features
possible to have explanation at the model level by averaging maps to the input space. Indeed, through the deconvnet
all explanations for training data with respect to a specific it becomes feasible to visualize intermediate feature layers
class. This shows the importance of each attribute and its (activities). In order to explain the behavior of the network
value. Since marginal effects should be calculated for each at the instance level, the instance is perturbed by occlud-
attributes the approach is computationally expensive. ing different portions of the input (image) and finding the
To represent explanation, a visualization technique (tool) top feature map (the strongest response in the unoccluded
called explainVis is used which is capable of showing posi- image) at the top layer. Then, the feature map is project
tive and negative effects of attributes on individual instances backed via deconvnet to the input space as explanation (a
along with the average effect on the model. The visualiza- saliency map at the pixel space). Due to occlusion process
tion technique is more suitable for domains in which there the approach is computationally demanding.
are few attributes such as small medical prediction mod- One of the challenges in this approach is that the de-
els, but it is not suitable to be applied to high dimensional convnet used for explanation needs to be designed carefully
domains such as vision. Furthermore, this approach is not along with the main architecture. It also needs to have mod-
capable of providing explanation for cases in which more ifications in the main architecture, such as switches, to save
than one attributes need to be changed at once to see the ef- the location of maxima within each polling region. This re-
fect on the output of the model. This shortage was tackled quirement causes to continuously change the deconvnet as
by Štrumbelj and Kononenki [45] in which perturbation the main architecture changes, which makes the develop-
of input is involved during calculating information differ- ment process complicated. However, the extra model helps
ence so as to cover interactions and redundancies between to have a boost in the performance of getting explanation.
attributes. Furthermore, this approach can only be applied to CNNs
The approach has been evaluated by calculate euclidean and its convolutional layers (not to fully connected layers),
distance over the attributes with respect to the ground truth which is limiting its capabilities.
that comes from a model built based on a decision tree. The approach does not work well on networks in which
The final performance is calculated as average on all test there is no max-pooling layers [44]. Moreover, there is no
instances. The results reveal that a model with better ex- specific meaning for the assigned pixel values, expect those
planation (lower average distance) intends to have high ac- should provide a visually interpretable structure. The re-
6
sult of applying this approach revealed that it is possible to issues when gradients are zero or the output of the model
discover which patterns from the training set activates fea- is flat, it could provide explanation in a simple backward
ture maps. For example, by visualization of the first con- pass in the model. The approach has been evaluated quali-
volution layer of the network proposed by Krizhevsky et tatively via applying it to different datasets.This approach is
al. [21], they could find the right filter size for the first layer model-dependent and post-hoc and will be referred as gra-
and improves the performance of the network. This some- dient vector.
how demonstrates that getting explanation could indeed be Simonyan et al. [43] proposed an approach based on gra-
beneficial for developers as well. This approach is a model- dients [5] to visualize deep CNN. The approach supports
dependent and post-hoc approach and is referred as decon- both instance and class level visualizations. In both cases,
vnet for future references. a saliency map is used for visualization. The class level vi-
Although deconvnet [56] can be used to visualize CNNs, sualization is achieved by solving an optimization problem
they cannot be applied to fully-connected layers which in which the goal is to obtain the high score (coming out of
cause to have incomplete explanation. Furthemore, decon- the network) for a given class (optimization with respect to
vent still needs perturbation to obtain explanation. In or- the input image). The optimization is initialized with a zero
der to tackle this issue, Zhou et al. [58] took advantage of image. Although such visualization is barely interpretable,
the global average pooling (GAP) layer [27] to find bet- it could reveal patterns that matter for the class.
ter explanation for a given input. In this approach, fully- In order to obtain approximately the weight of each pixel
connected layers are replaced with convolutional layers plus for instance level explanation, they use first order Taylor ex-
a linear model on top it. It means this approach is still can- pansion. This means computing partial derivatives of the
not be applied to a network with fully-connected layers un- output of the model with respect to each pixel. Indeed, the
less the fully connected layers are replaced with convolu- magnitude of the derivative reveals the weight. To have bet-
tional ones. More concretely, this approach can only be ap- ter explanation, Simonyan et al. [43] cropped each instance
plied to penultimate (pre-softmax) layers of a CNN. There 10 times and the obtained the saliency map for each sub-
are similar approaches in which global max pooling [33] image and finally averaged them. The performance of the
and log-sum-exp pooling [35] have also been used instead approach is good for instance level explanation because of
of GAP. a single back-propagation pass, but it is time-consuming for
The importance of image regions can be obtained by pro- the model level one.
jecting back (upsampling to the size of the input) the weight The saliency map generated by the approach has the po-
of output layer on the convolutional feature maps (called tential to be used for object segmentation, which shows
Class Activation Map (CAM)). Since each feature map de- the visualization aims at important parts in the input space.
picts the existence of a specific visual pattern in the input, There is also a connection between the gradient-based and
therefore, CAM is simply a weighted linear sum of those deconvnet [56] visualizations. The gradient-based visual-
visual patterns at different spatial locations. This charac- ization techniques generalize the deconvnet reconstruction
teristic helps not to need to sweep back any information procedure. The difference can be expressed in handling of
along the entire layers to obtain explanation and so it re- the nonlinearity at rectified linear units (ReLUs). Further-
sults in having a fast approach in comparison to gradient- more, the gradient-based approach can be applied to fully
based approaches such as optimization-gradient [43] (de- connected layers of CNNs.
scribed later). However, the linear classifiers (models) after The weakly-supervised localization task has been per-
the penultimate layer should be trained for each class.The formed on ILSVRC-2013 for the evaluation purpose and it
approach has been evaluated via weakly-supervised local- achieved 46.4% top-5 error on the test set of ILSVRC-2013.
ization task (including pattern discover, text detector, and This demonstrates that explanation obtained from the gradi-
concept localization). It has also been verified that GAP ents includes discriminative information related to the class.
used for explanation does not impact adversely the classi- The approach is a model-dependent and post-hoc approach
fication accuracy, which might be a concern for the model and referred as optimization-gradient for future references.
itself. The CAM approach is a model-dependent and post- Layer-Wise Relevance Propagation (LRP) is a general
hoc approach. approach proposed by Batch et al. [4] to explain behavior
Using gradients of a model with respect to the input fea- of multilayered neural networks and Bags of Words (BoW).
tures at the prediction point is a clean and straightforward In this approach, it is assumed that a model can be decom-
approach to explain the importance of features. This ap- posed into several layers of computation. These layers, for
proach has been proposed by Baehrens et al. [5] and been instances, can be part of feature extraction process. Each
demonstrated on Bayes classifier, but it could be applied to message among units of these layers has a relevance score
SVM if a differentiable model could be trained on top of the utilized to calculate the relevance score for each unit in a
model wanted to be explained. Although the approach has backward pass. Indeed, the relevance score for each mes-
7
sage is the proportional contribution of it to the final output units in the layers (e.g. CNNs). DTD is a model-dependent
of the next unit in the feedforward pass. The backpropaga- and post-hoc approach.
tion of relevances is started from the output layer, which its Zintgraf et al. [59] improved the explanation approach
relevance is the output of the model itself. Once the rele- proposed by Robnik-Šikonja et al. [38] to visualize deep
vance scores are computed for the input layer’s units, they CNNs. The approach takes advantage of conditional sam-
become explanation in the form of contribution of each unit pling and multivariate analysis to approximate the weight
to the final decision of the model. (contribution) relevance of features (i.e. attributes) in a
A Taylor type decomposition has also been proposed as given input. The representational technique utilized for ex-
an approximate to the general concept of LRP. It is differ- planation is saliency maps (red and blue colors for posi-
ent than gradient vector [5] because partial derivatives are tive an negative effects respectively.). Since the approach
calculated at a root point. A root point is in fact an input is based on perturbation and optimization, it is expensive
in which the model has the maximum uncertainty about its time-wise.
output (e.g. classification probability). Using a root point The approach has been evaluated by comparing its re-
causes that explanation becomes more accurate. The gradi- sults with the ground truth. It has also been compared qual-
ent vector calculates the partial derivatives at the prediction itatively with the results of the optimization-gradient ap-
point which is the nearest local optimum. This local opti- proach proposed by Simonyan et al. [43] through randomly
mum might have the same sign as the model and therefore selected images from Imagenet dataset and a medical imag-
it is not a good reference point to obtain explanation. ing dataset over several pre-trained architectures. An inter-
One issue with the simple Taylor decomposition is that esting result of the evaluation is that changes in pixels can
finding a root point in the vicinity of the input can be time- have a measurable effect on the class scores of the model,
consuming. This is also not necessarily solvable because although they expected not to observe it (expect to see the
of the non-convexity of the optimization problem. The ref- effect more based on patches based on the nature of convo-
erence points can jump to a different mode (have different lution). This finding matches with the concept of adversar-
values) for two nearly equivalent data points with nearly ial examples [48].
equivalent predictions that results in having two different The results also reveal that the explanation of the out-
explanation. It means, if we consider explanation as a func- put layer and penultimate layers of networks are different.
tion, it will not be continuous. The first one gives a better description for unique cases (dif-
The direct implementation of LRP is different than Tay- ferent breeds of dogs) while the latter one is more about
lor series and gradient approaches in a sense that there is general explanation (just dogs). With respect to the per-
no need to calculate the root point and gradients. How- formance, the approach is computationally demanding and
ever, Shrikumar et al. [41] demonstrated that if all activa- requires several minutes to compute the explanation for a
tions are piecewise linear, LRP reduces to gradient*input. given input. The approach is a model-dependent and pos-
LRP is also different than deconvnet [56] because LRP uses hoc approach and referred as prediction difference analysis
the signed activations of neurons from the layer below for for future references.
weighting the relevance of the layer above. However, de- Local Interpretable Model-agnostic Explanations
convnet uses rectified linear units to project back informa- (LIME) is the name of an approach proposed by Ribeiro
tion to the inputs and to ensure that feature maps are non- et al. [37] to obtain the behavior of any classification
negative. LRP has been evaluated qualitatively via applying model. The approach provides instance level explanation
it to low and high dimensional datasets. Although this ap- by building a local interpretable model in the vicinity
proach is general, it is computationally demanding and has of a given instance. In order to achieve this, the input
issues with numerical stability. LRP is a model-dependent is transformed to an interpretable format such as binary
and post-hoc approach. vector and then the vector is perturbed based on a locality
A more formally-defined approach based on LRP and measure. The data collected via the perturbation is used
Taylor series expansion called Deep Taylor decomposition to train the interpretable model. Finally, the explanation
(DTD) proposed by Montavon et al. [32]. The execution given by the vector is projected back to the input space to
process in DTD is almost equal to LRP and it uses Tay- represent explanation.
lor expansion to calculate the relevance scores for layers. LIME also allow to explain the behavior of model by
Therefore, it estimates the root point for each unit during explanation of multiple instances selected judiciously. The
calculating relevance scores. In order to find root points, instances, selected by solving an optimization problem, are
DTD proposes two methods based on constraining the input diverse and do not have redundant explanation so they could
domain. DTD is also provide a way called relevance model be a good candidate to explore and understand how the
to calculate the relevance scores for models (or part of mod- model behaves globally. LIME needs to build a model for
els) in which there is no well-defined connection between each explanation it offers and so it is computationally ex-
8
pensive. The approach is also able to have pros (positive) the generated attention map. Furthermore, it is possible to
and cons (negative) facts in its explanation. LIME is evalu- obtain visualization for the decision of a model whether or
ated by getting explanation for models that are interpretable not the decision is correct which is a great benefit of this ap-
by itself and then calculating recall based on providing cor- proach. Excitation backprop is equal to the CAM approach
rect explanation. It has also evaluated based on several user- [58] if it is only applied to the feature maps before the GAP
involved methods so as to check whether or not the expla- layer. However, c-MWP is more general and it can work
nation is helpful for final users. LIME is a model-agnostic with models that do not use the GAP layer. The accuracy of
and post-hoc approach. weakly-supervised localization tasks has been considered as
Springenberg et al. [44] proposed an approach for vi- a way to evaluate the performance of the approach against
sualization of features maps, which is a combination of others such as CAM [58], LRN [4], deconvnet [56], and
optimization-gradient [43] and deconvnet [56] approaches. optimization-gradient [43] approaches. Furthermore, the
In this approach, instead of filtering out values based on approach demonstrated empirically a good performance in
negative entries coming from the top gradient (deconvnet) the phrase-to-region task of a tag classifier. This approach
or bottom data (backpropagation), the values are filtered is model-dependent and post-hoc approach and referred as
out when at least one of these values is negative. The excitation backprop for future references.
approach is called guided backpropagation due to provid- Selvaraju et al. [39] proposed an approach, called
ing additional signals from higher layers to normal back- Gradient-weighted Class Activation Mapping (Grad-CAM)
propagation. The explanation of this approach, which is a based on the CAM approach [58] that can be applied to
pixel-space gradient visualization, is much sharper than the CNNs with fully-connected layers. The approach simply
optimization-gradient and deconvnet approaches. However, calculates gradients (of the penultimate layer score) with
Selvaraju et al. [39] indicated in their evaluation that the respect to feature maps in the last convolutional layer in or-
deconvnet is more class discriminative than guided back- der to obtain the weights of global average pooling neurons.
propagation, although guided backpropagation is more aes- The rest of process to obtain explanation is the same as the
thetically pleasing. one described in CAM approach. Although this approach is
This approach can be applied, like gradient-based ap- highly class-discriminative, it is not high-resolution (cannot
proaches, to any layers in DNNs and does not need to show fine-grained detained). Therefore, it has been com-
have switches needed by deconvnet to provide explanation. bined with guided-backpropagation [44] to have a class-
They compared their results against gradient and deconvo- discriminative and high-resolution visualization. It could
lutional approaches through several examples which shows be combined with deconvnet approach as well, but the re-
the superiority of their work. A more detailed discussion sult would be much noisy (have artifacts in the output).
among gradient-based, guided-propagation, and deconvnet This approach is faster than excitaion backprop [57] due
approaches can be found in [31]. The guided backpropaga- to using portion of gradient and GAP. It is also capable of
tion is a model-dependent and post-hoc approach. being applied to problems other than classification such as
Zhang et al. [57] have proposed an approach based on models used for structural outputs (e.g. captioning), multi-
the selective tuning attention model [53] for explanation of modal inputs (e.g. visual question answering), or reinforce-
CNNs. In the selective turning model, there is one button- ment learning. The approach is capable of showing negative
up sweep in the network to process the input stimuli and influences but this has not been demonstarted empirically in
then one top-down Winner-Take-All (WTA) process to lo- the original paper. The weakly-supervised localization task,
calize the most relevant neurons. The output of WTA is a human verdict, faithfulness, and several other experiments
binary map, but Zhang et al. extended it to be probabilistic related to image captioning and visual question answering
WTA, called marginal winning probability (MWP). MWP have been used as a way to evaluate the approach. Grad-
is calculated through a back propagation process called ex- CAM is a model-dependent and post-hoc approach.
citation backprop. In this approach, there is no need to do Chattopadhay et al. [13] improved the Grad-CAM ap-
the full backpropagation to obtain the discriminative fea- proach [39], which is called Grad-CAM++, by weighing
tures for explanation purposes. The interpretable attention the derivatives of features maps. This allows having bet-
map can be obtained by the excitation backprop at the in- ter explanation (and localization) for inputs that have sin-
termediate convolutional layers. This helps not to have a gle or multiple instances of a class in a given image. In
backward sweep up to the input level, which reduces the the approach, it is necessary to use a smooth function (e.g.
computational demand. exponential function) after the penultimate layer while in
The approach also uses the concept of contrastive top- Grad-CAM the penultimate layer representation is used as
down attention named c-MWP in which the differential ef- a score to obtain gradients.
fect between a pair of contrastive top-down signals is calcu- Lengerich et al. [25] implemented an approach based on
lated. This significantly improves the discriminativeness of information flown in the network to explain the behavior of
9
neural networks. The approach provides instance level ex- tioned issues. Therefore, it is necessary to define a refer-
planation and performs it via perturbation. It first does per- ence point for each input. For example, all zero input can
turbation over an instance n times (50 times in the paper) be a reference point for an image. However, finding such
and then calculates two metrics based on these perturbed a reference point might not straightforward for some input
inputs named activation-out correlation and activation pre- domains. The execution performance of this approach is
cision. These metrics allow having a unique understand- like other gradient-based approaches. The DeepLIFT ap-
ing of the internal units of a trained model. For example, proach is a model-dependent and post-hoc approach.
Lengerich et al. discovered via these metrics that the inter- A unified approach called SHapley Additive exPlana-
mediate layers of VGG16 (layers 5-7) include more precise tions (SHAP) has been proposed by Lundberg and Lee
class-discriminative features than others. Afterward, neu- [30, 29] for additive feature attribution approaches. Ad-
rons in each layer are ranked based on these metrics and ditive feature attribution approaches have an explanation
then input-related patches are obtained via deconvolution model that is a linear function of binary variables, which
process. make it interpretable (e.g. LIME [37], DeepLIFT [40]).
Evaluation of the approach was performed on INRIA The issue with Additive feature attribution approaches is
person dataset with focusing on the weakly-supervised lo- that there is no guarantee for a unique solution (explana-
calization task. An interesting result of Lengerich et al.’s tion). Therefore, SHAP values are defined as a unified mea-
research is that it is possible to discover the causal features sure of feature importance that various methods try to ap-
of inputs and then train a model on top of those features. proximate them. With such values, there is a convergence
This new model can have an accuracy close to the original in a unique explanation. The SHAP value estimation based
model that has been trained with raw data. This approach is approaches are better aligned with human intuition (as mea-
a model-dependent and post-hoc approach and referred as sured by user studies) and also can more effectually dis-
activation correlation for future references. criminate classes of models. This approach can be consid-
Contextual explanation network (CENs) is an approach ered model-agnostic and model-dependent if shapely values
proposed by Al-Shedivat et al. [1, 2] allows having explana- are used in moder-agnostic or model-dependent approaches
tion generated along with the prediction. It means there is respectively. It is also a post-hoc approach.
no need to do post-processing on the prediction to obtain ex- 3.4. Model-agnostic
planation. Therefore, there is no computational overload for
the explanation of the normal behvaior of the model. The Characteristics of these approaches are as follows:
approach uses black box models such as neural networks
• There is no specific restriction of types of models that
to obtain parameters of a linear model used for explana-
can be used.
tion. The explanation is then used along with other input
features to perform prediction. The interesting result is that • No assumption on information on the architecture,
this combination does not limit the capacity of classification number of layers, and other parameters used for build-
and indeed helps the model to converge faster. ing and training a ML model.
The explanation produced by CENs is provably equiva-
3.5. Integrated Gradients
lent to those generated by post-hoc approaches under cer-
tain conditions, but there are cases when post-hoc expla- summary for me: Sundararajan et al. [46] proposed an
nation is misleading. There are limitations associated with approach called integrated gradient to obtain explanation
this approach including 1) it is not possible to interpret the for the decision of a model. Although the approach is
process of conditioning on the context used for explana- based on gradient, which has been used in several other
tion 2) all explanation obtained by the model comes from approaches, it tries to satisfy two axioms called sensitivity
the same graphical structure and parameterization (a simple and implementation invariants that are considered critical
sparse dictionary) that limits the explanatory power of the for any explanation approach.
approach. The CENs approach is a model-dependent and The approach has been applied to image, text, chemistry
joint approach. networks. This approach is capable of specifying positive
The approach called DeepLIFT has been proposed by and negative contributions.
Shrikumar et al. [40] to explain the prediction of a model [46] The same issue existed for simple Taylor decompo-
without having the issues of the gradient saturation (when sition in which a reference point need to be identified.
the gradient is zero or inferred as zero) and discontinuities [32] Its complexity is in order of number of edges (con-
(because of biases) existed in gradient-based approaches. nections).
Although DeepLIFT is a gradient-based approach, it uses
3.6. Deconvolution-based approaches
finite differences instead of infinitesimal ones for calculat-
ing partial derivatives, which helps not to have aforemen- ...
10
Perturbation or Probabilistic graphical
Gradient-based Global Average Pooling Excitation
Sensitivity Analysis modeling
Prediction Difference
Analysis
Guided Backpropagation GradCAM*
LIME GradCAM++*
SHAP DeepLIFT
Integrated Gradient
11
4.1.1 Occlusion of main features label of an object and its coordination are given in the train-
ing set and a model is trained based on them to predict a
Since evaluation of exact faithfulness of explanation be- label and bounding box for an unforeseen image. However,
comes hard, one solution would be to use the local faith- in the weakly-supervised localization, only the label of an
fulness. In order to measure it, we can use image occlu- object is given in the training set and the goal is to predict
sion [56] in which the most important patches of the image both label and bounding box for an unforeseen image. The
which are known to us are masked and then the explanation accuracy of localization is calculated via the intersection of
is requested for the image. Since the salient parts of the unions (IoU) mean.
input have been removed, therefore, the explanation should This metric has been repurposed to demonstrate how
not include those patches (give less important to them). This good an explanation method is in providing information just
might be hard to quantify, but if those patches are tailed with from the classification out so that it can be used for cal-
their importance it should be possible to quantify how ex- culating bounding boxes. In fact, the idea is to use class-
planation behaves. discriminative regions proposed by explanation approaches
Another variation would be to train an interpretable to estimate the bounding boxes (better class-discriminative
model, such as a sparse logistic regression or decision tree, explanation, better localization). However, the value calcu-
on top of most important features of inputs and then apply lated for localization cannot be an accurate metric to show
the explanation approaches on these models [37]. Since how discriminative the explanation approach is. The rea-
valid explanation can be observed in interpretable models, son is that calculating bounding boxes is still performed by
faithfulness of explanation approaches to models can be cal- another algorithm (e.g. segmentation [39]) and explanation
culated through number of correct features covered by ex- approaches just specify most important regions. In the case
planation. of comparison among approaches, the one with better loca-
tion score is the one that has better explanation.
4.1.2 Increase and decrease in confidence
4.3. Human Verdict or evaluating class discrimina-
If explanation could be represented in the input space tion
(which is the case with most vision-based systems), it would
Explanation approaches should provide information
be possible to evaluate how much precious information is
which is in a direct association to the decision of a model. It
embedded in explanation by feeding back the explanation
means, the explanation should have valid information that
to the model and measuring the increase and decrease in
will help us to infer the same output of the model on the
the confidence level of the model in comparison to the in-
same input without knowing the output of the model. For
put. These two factors are expressed as the average drop in
example, if explanation for an object recognition model on
confidence level and increase in confidence level [13].
a given input is specified, a human user should be able to
The idea behind the drop in confidence level is since the have the correct recognition as the model would predict.
explanation occludes some parts of the input to demonstrate This capability can be quantified through bring humans in
other important features, then there should be a drop in the the evaluation loop.
confidence level of the model. If the drop in the confidence
In this evaluation, a set of data for which a model has the
level is low, it means the explanation is class-discriminative.
right output is selected. Then, the explanation of the model
In other words, the occluded features were not important
is presented to human participants. They are asked to se-
that much. This metric is represented as a average percent-
lect the best option (class) among all possible outputs of
age drop in the model’s confidence level for a specific class
the model. By following this process, it would be possible
over the entire dataset.
to make a connection between the accuracy of human par-
The increase in confidence level has opposite perspec- ticipants in recognizing the correct answers and how much
tive. It assumes that models might learn to solely pay atten- explanation approaches are class-discriminative.
tion to specific patterns. In this case, if only those patterns
Another form of this evaluation, which is qualitative,
are exposed to model via explanation, the confidence level
would be to represent to participants explanation and the
of the model should be increased. In other words, mod-
correct answer (class) of the model and then ask partici-
els learned from the most discriminative parts. This metric
pants about which explanation contains more information
is quantified as the average number of times in the entire
about the decision of the model.In both forms, one could
dataset that the confidence level has been increased.
keep the model unchanged and compare the result of differ-
ent explanation approaches. However, it is possible to keep
4.2. Weakly-supervised localization
the explanation approach unchanged and then use different
Object localization is about to predict an object and its models. In this case, the evaluation will not be cover the
boundaries in an image. In the supervised localization, the explanation approach and in fact, it will cover which model
12
is more class-discriminative. exploit right reasons for the right answer it might put out.
For instance, in a picture with a man and tennis racquet,
4.4. Trust in a model the model might capture correctly the existence of a man
The goal of this type of evaluation method is to observe gender, but when the explanation is explored it is revealed
if the explanation approach represents information that hu- that the man gender has been detected due to tennis racquet
mans can use them to trust the model. In this regard, two and not because of the man. These kinds of applications
models which have known levels of trust are chosen or obviously emphasizes the strong potential of explanation in
crafted. Then, samples (inputs) that have the correct clas- improving the data and architecture of current DNNs.
sification or close scores in these two models are selected.
Explanation for the samples are represented to human par- 5.2. Explaining anomalies
ticipants and they are asked to specify which model is more ML models are trained on top of a dataset (a specific
trustworthy from their own point of view. The results which distribution) to perform inferences, for example, recognize
are scores of models regarding their trustworthiness com- vehicles and pedestrians. The expectation from these mod-
pared with the ground truth to observe if explanation help els is to perform their inferences on data coming from the
human participants to find the right one (higher score de- same distribution (inliers) and avoid inferring others (out-
notes the more trustworthy mode). liers, anomalies, novelties). However, the trained model
When it comes to comparing two explanation ap- always assume that inputs will come from the trained do-
proaches, the same process is followed for both approaches. main and so this assumption causes models to misbehave.
The approach which helps to select the right model is the For example, the model trained to recognize vehicles and
winner. However, it is possible that both explanation ap- pedestrians may start recognizing trees as pedestrians and a
proaches choose the correct model. In that case, the expla- bulk of concrete as vehicles. The goal of anomaly detection
nation approach with a higher score is the winner. approaches is to detect such anomalous input and stop the
model from inferring them [11]. In this situation, the model
5. Applications would say it is not capable of classification instead of doing
Explanation of ML models discloses rich insides about a wrong classification.
patterns related to inputs and the way models interpret in- There are plenty of techniques [34, 11] proposed to de-
puts. These are generally used to make sure the model be- tect outliers in different domains and some of them have
haves correctly, but they can also be used to tackle different shown a great progress in this regard. However, when these
challenges that might not be feasible just by the explained techniques recognize an input as outlier, they ungracefully
model. In the following, we look at such applications. fail to explain why they have considered this as outlier. Ex-
planation approaches discussed in this article used to pro-
5.1. Finding biases in dataset vide reasons for the right classification (inliers) performed
The quality of a dataset has a direct impact on the gen- by the model and not for the case in which they fail to out-
eralizability of trained models and bias in a dataset is what put the decision. In a recent work, Kauffmann et al. [19]
reduces the quality. Dataset bias is caused mainly because presented an approach used to provide explanation for out-
training samples are finite and they cannot cover all aspects liers. In their approach, they provide different scores for
of the domain. In computer vision dataset, for instance, bi- inliers and outliers and then apply deep Taylor Decomposi-
ases can be due to the way pictures have captured (capture tion on top of the outlier score to provide explanation. The
bias) or how classes are labeled (label bias) [52]. There result demonstrates that explanation can disclose reasons in
are methods to deal with dataset bias, but they do not al- the input which are invisible to human eyes. This subtle
ways resolve the biases and might make it even worse [51]. reasons might be the cause for adversarial examples [48].
Therefore, it would be beneficial to have approaches to dis-
5.3. Extracting relevant features
close (identify) biases in a dataset. This becomes critical
when there is no access to the trained dataset. Selvaraju et Models trained on data, such as images to detect objects
al. [39] took advantage of explanation for DNNs to detect or to steer a car, expect to achieve their goals by avoiding
biases in a dataset. In their visualization approach applied paying attention to the background and noise. This happens
to the dataset crafted to classify nurses and doctors, they no- if there are enough variation in a training set and strong fea-
ticed that the model looks at the person’s face and hairstyle tures in classes. However, this is not feasible in high dimen-
to distinguish nurses from doctors. This illustrates that the sional domains and so we could observe that models might
dataset was gender biased. want to make a correlation between noise and true classes,
Burns et al. [9] also utilized explanation in the form which may affect its performance in unforeseen cases. One
of heatmaps to detect biases in image captioning models. way to deal with this case is to make sure we would give
Through explanation, they showed that a model might not models right features as much as possible.
13
With such a perspective, kim et al. [20] proposed an ap- Knowledge Discovery and Data Mining, KDD ’15, pages
proach in which they use visual attention (some kind of ex- 1721–1730, New York, NY, USA, 2015. ACM.
planation) to extract relevant features (in their case, which [11] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detec-
features are important while driving a car) and then train tion: A survey. ACM Comput. Surv., 41(3):1–58, July 2009.
a steering model on top of those explanations. The result [12] B. Chandrasekaran, M. C. Tanner, and J. R. Josephson. Ex-
shows that 1) this is not degrading the performance of end- plaining control strategies in problem solving. IEEE Expert,
to-end learning 2) it shows causal effects on the output of 4(1):9–15, Spring 1989.
the model. Such an approach can be adapted to other ar- [13] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Bala-
subramanian. Grad-cam++: Generalized gradient-based vi-
eas such as image classification in which a trained model
sual explanations for deep convolutional networks. In 2018
can be queried for explanation and then explanation could IEEE Winter Conference on Applications of Computer Vision
be used to train much strong and shadow model that could (WACV), pages 839–847, March 2018.
outperform the first one. [14] M. W. Craven and J. W. Shavlik. Extracting tree-structured
representations of trained networks. In Proceedings of the
5.4. Detect adversarial samples
8th International Conference on Neural Information Pro-
[NOTE] In this work [49], authors used interpretability cessing Systems, NIPS’95, pages 24–30, Cambridge, MA,
capability of the network to build a model that can detect USA, 1995. MIT Press.
adversarial samples. [15] DARPA. Explainable artificial intelligence (xai). DARPA-
BAA-16-53, pages 1–52, 2016.
6. Future Directions [16] A. A. Freitas. Comprehensible classification models: A posi-
tion paper. SIGKDD Explor. Newsl., 15(1):1–10, Mar. 2014.
... [17] I. Grau, D. Sengupta, M. M. G. Lorenzo, and A. Nowe. Grey-
box model: An ensemble approach for addressing semi-
7. Conclusion supervised classi
cation problems. BENELEARN 2016, 2016.
...
[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In 2016 IEEE Conference on Com-
References puter Vision and Pattern Recognition (CVPR), pages 770–
[1] M. Al-Shedivat, A. Dubey, and E. P. Xing. Contextual ex- 778, June 2016.
planation networks. CoRR, abs/1705.10301, 2017. [19] J. Kauffmann, K.-R. Müller, and G. Montavon. Towards
[2] M. Al-Shedivat, A. Dubey, and E. P. Xing. The intriguing explaining anomalies: A deep taylor decomposition of one-
properties of model explanations. CoRR, abs/1801.09808, class models. CoRR, abs/1805.06230, 2018.
2018. [20] J. Kim and J. F. Canny. Interpretable learning for self-driving
[3] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schul- cars by visualizing causal attention. CoRR, abs/1703.10631,
man, and D. Man. Concrete problems in ai safety. 2017.
arXiv:1606.06565, pages 1–29, 2016. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
[4] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. classification with deep convolutional neural networks. In
Mller, and W. Samek. On pixel-wise explanations for non- Proceedings of the 25th International Conference on Neural
linear classifier decisions by layer-wise relevance propaga- Information Processing Systems - Volume 1, NIPS’12, pages
tion. PLOS ONE, 10(7):1–46, 07 2015. 1097–1105, USA, 2012. Curran Associates Inc.
[5] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
K. Hansen, and K.-R. Müller. How to explain individual clas- classification with deep convolutional neural networks. In
sification decisions. J. Mach. Learn. Res., 11:1803–1831, Advances in neural information processing systems, page
Aug. 2010. 10971105, 2015.
[6] BBC. Google apologises for Photos app’s racist blunder. [23] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander.
https://www.bbc.com/news/technology-33347866, 2015. Joint 3d proposal generation and object detection from view
[7] O. Biran, K. McKeown, and B. McKeown. Justification nar- aggregation. CoRR, abs/1712.02294, 2017.
ratives for individual classifications. In ICML 2014 AutoML [24] H. Lakkaraju, S. H. Bach, and J. Leskovec. Interpretable
Workshop, 2014. decision sets: A joint framework for description and pre-
[8] B. G. Buchanan and E. H. Shortliffe. diction. In Proceedings of the 22Nd ACM SIGKDD In-
[9] K. Burns, L. A. Hendricks, T. Darrell, and A. Rohrbach. ternational Conference on Knowledge Discovery and Data
Women also snowboard: Overcoming bias in captioning Mining, KDD ’16, pages 1675–1684, New York, NY, USA,
models. CoRR, abs/1803.09797, 2018. 2016. ACM.
[10] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. El- [25] B. J. Lengerich, S. Konam, E. P. Xing, S. Rosenthal,
hadad. Intelligible models for healthcare: Predicting pneu- and M. M. Veloso. Toward visual explanations for con-
monia risk and hospital 30-day readmission. In Proceed- volutional neural networks via input resampling. CoRR,
ings of the 21th ACM SIGKDD International Conference on abs/1707.09641, 2017.
14
[26] B. Letham, C. Rudin, T. H. McCormick, and D. Madigan. M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.
Interpretable classifiers using rules and bayesian analysis: Mastering the game of go with deep neural networks and
Building a better stroke prediction model. Ann. Appl. Stat., tree search. Nature, 529:484–489, 2016.
9(3):1350–1371, 09 2015. [43] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep in-
[27] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, side convolutional networks: Visualising image classifica-
abs/1312.4400, 2013. tion models and saliency maps. CoRR, abs/1312.6034, 2013.
[28] Z. C. Lipton. The mythos of model interpretability. CoRR, [44] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried-
abs/1606.03490, 2016. miller. Striving for simplicity: The all convolutional net.
[29] S. Lundberg and S. Lee. An unexpected unity among CoRR, abs/1412.6806, 2014.
methods for interpreting model predictions. CoRR, [45] E. Štrumbelj and I. Kononenko. Explaining prediction mod-
abs/1611.07478, 2016. els and individual predictions with feature contributions.
[30] S. Lundberg and S. Lee. A unified approach to interpreting Knowledge and Information Systems, 41(3):647–665, Dec
model predictions. CoRR, abs/1705.07874, 2017. 2014.
[31] A. Mahendran and A. Vedaldi. Salient deconvolutional net- [46] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution
works. In European Conference on Computer Vision, 2016. for deep networks. CoRR, abs/1703.01365, 2017.
[32] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.- [47] W. R. Swartout and J. D. Moore. Second generation expert
R. Mller. Explaining nonlinear classification decisions with systems. chapter Explanation in Second Generation Expert
deep taylor decomposition. Pattern Recognition, 65:211 – Systems, pages 543–585. Springer-Verlag New York, Inc.,
222, 2017. Secaucus, NJ, USA, 1993.
[33] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object local- [48] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,
ization for free? - weakly-supervised learning with convolu- I. J. Goodfellow, and R. Fergus. Intriguing properties of neu-
tional neural networks. In 2015 IEEE Conference on Com- ral networks. CoRR, abs/1312.6199, 2013.
puter Vision and Pattern Recognition (CVPR), pages 685–
[49] G. Tao, S. Ma, Y. Liu, and X. Zhang. Attacks meet inter-
694, June 2015.
pretability: Attribute-steered detection of adversarial sam-
[34] M. A. F. Pimentel, D. A. Clifton, L. Clifton, and ples. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
L. Tarassenko. Review: A review of novelty detection. Sig- N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neu-
nal Process., 99:215–249, June 2014. ral Information Processing Systems 31, pages 7727–7738.
[35] P. H. O. Pinheiro and R. Collobert. Weakly supervised se- Curran Associates, Inc., 2018.
mantic segmentation with convolutional networks. CoRR,
[50] TheGuardian. Tesla car that crashed and killed
abs/1411.6228, 2014.
driver was running on Autopilot, firm says.
[36] M. T. Ribeiro, S. Singh, and C. Guestrin. Nothing else mat- https://www.theguardian.com/technology/2018/mar/31/tesla-
ters: Model-agnostic explanations by identifying prediction car-crash-autopilot-mountain-view, 2018.
invariance. CoRR, abs/1611.05817, 2016.
[51] T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars. A
[37] M. T. Ribeiro, S. Singh, and C. Guestrin. ”why should i trust deeper look at dataset bias. CoRR, abs/1505.01257, 2015.
you?”: Explaining the predictions of any classifier. In Pro-
[52] A. Torralba and A. A. Efros. Unbiased look at dataset bias.
ceedings of the 22Nd ACM SIGKDD International Confer-
In CVPR 2011, pages 1521–1528, June 2011.
ence on Knowledge Discovery and Data Mining, KDD ’16,
pages 1135–1144, New York, NY, USA, 2016. ACM. [53] J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y. Lai, N. Davis,
[38] M. Robnik-ikonja and I. Kononenko. Explaining classifica- and F. Nuflo. Modeling visual attention via selective tuning.
tions for individual instances. IEEE Transactions on Knowl- Artif. Intell., 78(1-2):507–545, Oct. 1995.
edge and Data Engineering, 20(5):589–600, May 2008. [54] B. Ustun and C. Rudin. Supersparse linear integer mod-
[39] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, els for optimized medical scoring systems. Mach. Learn.,
D. Parikh, and D. Batra. Grad-cam: Visual explanations from 102(3):349–391, Mar. 2016.
deep networks via gradient-based localization. In 2017 IEEE [55] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,
International Conference on Computer Vision (ICCV), pages O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and
618–626, Oct 2017. K. Kavukcuoglu. Wavenet: A generative model for raw au-
[40] A. Shrikumar, P. Greenside, and A. Kundaje. Learning im- dio. arXiv:1609.03499, pages 1–15, 2016.
portant features through propagating activation differences. [56] M. D. Zeiler and R. Fergus. Visualizing and understanding
CoRR, abs/1704.02685, 2017. convolutional networks. In Computer Vision – ECCV 2014,
[41] A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje. pages 818–833. Springer International Publishing, 2014.
Not just a black box: Learning important features through [57] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and
propagating activation differences. CoRR, abs/1605.01713, S. Sclaroff. Top-down neural attention by excitation back-
2016. prop. International Journal of Computer Vision, Dec 2017.
[42] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, [58] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-
G. van den Driessche, J. Schrittwieser, I. Antonoglou, ralba. Learning deep features for discriminative localization.
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, In 2016 IEEE Conference on Computer Vision and Pattern
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, Recognition (CVPR), pages 2921–2929, June 2016.
15
[59] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling. Visu-
alizing deep neural network decisions: Prediction difference
analysis. CoRR, abs/1702.04595, 2017.
16
Model Post Instance Model Neg.&Pos.
Approach Based on architectures
agnostic hoc level level Exp.
Decomposition of
Perturbation Yes Yes - Yes partially both
the Prediction [38]
Deconvnet [56] Occlusion sensitivity No Yes CNNs without FC Yes No pos.
Gradient
Gradients No Yes Bayes/SVM Yes No both.
vector [5]
Optimization
Gradients No Yes CNNs Yes Partially pos.
Gradient [43]
Prediction difference
Perturbation No Yes CNNs Yes No both
analysis [59]
LIME [37] Perturbation Yes Yes - Yes Partially both
CAM [58] GAP No Yes Fully CNN Yes No pos.
17
Guided
Gradients No Yess CNN Yes No pos.
backpropagation [44]
Excitation
Selective tuning No Yes CNN Yes No pos.
backprop [57]
Grad-CAM [39] GAP,
No Yes CNN, RI Yes No pos.
Grad-CAM++ [13] Gradients
Activation
Perturbation No Yes CNN Yes No pos.
correlation [25]
CENs [1] Linear model No No CNN Yes No both
DeepLIFT [40] Gradients No Yes CNN Yes No pos.
SHAP [30] Perturbation Yes Yes CNN Yes No Both
LRP [4] message passing No Yes DNN, BoW Yes No Both