Benchmarking Machine Reading Comprehension: A Psychological Perspective - arXiv.org

 
CONTINUE READING
Benchmarking Machine Reading Comprehension:
                                                                   A Psychological Perspective

                                                              Saku Sugawara1 , Pontus Stenetorp2 , Akiko Aizawa1
                                                          1
                                                            National Institute of Informatics, 2 University College London
                                                      {saku,aizawa}@nii.ac.jp, [email protected]

                                                               Abstract                           required by the datasets and is actually acquired by
                                                                                                  models. Although benchmarking MRC is related to
                                             Machine reading comprehension (MRC) has              the intent behind questions and is critical to test hy-
                                             received considerable attention as a bench-
arXiv:2004.01912v2 [cs.CL] 26 Jan 2021

                                                                                                  potheses from a top-down viewpoint (Bender and
                                             mark for natural language understanding.
                                             However, the conventional task design of             Koller, 2020), its theoretical foundation is poorly
                                             MRC lacks explainability beyond the model            investigated in the literature.
                                             interpretation, i.e., reading comprehension by          In this position paper, we examine the prerequi-
                                             a model cannot be explained in human terms.          sites for benchmarking MRC based on the follow-
                                             To this end, this position paper provides a the-     ing two questions: (i) What does reading compre-
                                             oretical basis for the design of MRC datasets        hension involve? (ii) How can we evaluate it? Our
                                             based on psychology as well as psychometrics,
                                                                                                  motivation is to provide a theoretical basis for the
                                             and summarizes it in terms of the prerequisites
                                             for benchmarking MRC. We conclude that fu-           creation of MRC datasets. As Gilpin et al. (2018)
                                             ture datasets should (i) evaluate the capability     indicate, interpreting the internals of a system is
                                             of the model for constructing a coherent and         closely related to only the system’s architecture
                                             grounded representation to understand context-       and is insufficient for explaining how the task is ac-
                                             dependent situations and (ii) ensure substan-        complished. This is because even if the internals of
                                             tive validity by shortcut-proof questions and        models can be interpreted, we cannot explain what
                                             explanation as a part of the task design.            is measured by the datasets. Therefore, our study
                                                                                                  focuses on the explainability of the task rather than
                                         1   Introduction
                                                                                                  the interpretability of models.
                                         Evaluation of natural language understanding                We first overview MRC and review the analytical
                                         (NLU) is a long-standing goal in the field of artifi-    literature that indicates that existing datasets might
                                         cial intelligence. Machine reading comprehension         fail to correctly evaluate their intended behavior
                                         (MRC) is a task that tests the ability of a machine to   (Section 2). Subsequently, we present a psycholog-
                                         read and understand unstructured text and could be       ical study of human reading comprehension in Sec-
                                         the most suitable task for evaluating NLU because        tion 3 for answering the what question. We argue
                                         of its generic formulation (Chen, 2018). Recently,       that the concept of representation levels can serve
                                         many large-scale datasets have been proposed, and        as a conceptual hierarchy for organizing the tech-
                                         deep learning systems have achieved human-level          nologies in MRC. Section 4 focuses on answering
                                         performance for some of these datasets.                  the how question. Here, we implement psychomet-
                                            However, analytical studies have shown that           rics to analyze the prerequisites for the task design
                                         MRC models do not necessarily achieve human-             of MRC. Furthermore, we introduce the concept of
                                         level understanding. For example, Jia and Liang          construct validity, which emphasizes validating the
                                         (2017) use manually crafted adversarial examples         interpretation of the task’s outcome. Finally, in Sec-
                                         to show that successful systems are easily dis-          tion 5, we explain the application of the proposed
                                         tracted. Sugawara et al. (2020) show that a sig-         concepts into practical approaches, highlighting po-
                                         nificant part of already solved questions is solvable    tential future directions toward the advancement of
                                         even after shuffling the words in a sentence or drop-    MRC. Regarding the what question, we indicate
                                         ping content words. These studies demonstrate that       that datasets should evaluate the capability of the
                                         we cannot explain what type of understanding is          situation model, which refers to the construction
Question              Foundation                     Requirements                                 Future direction
    What is reading       Representation levels in       (A) Linguistic-level sentence understand-    (C) Dependence of con-
    comprehension?        human reading compre-          ing, (B) comprehensiveness of skills         text on defeasibility and
                          hension: (A) surface           for inter-sentence understanding, and        novelty, and grounding to
                          structure, (B) textbase,       (C) evaluation of coherent representation    non-textual information
                          and (C) situation model.       grounded to non-textual information.         with a long passage.
    How can we evalu-     Construct validity in psy-     (1) Wide coverage of skills, (2) evalu-      (2) Creating shortcut-
    ate reading compre-   chometrics: (1) content,       ation of the internal process, (3) struc-    proof questions by
    hension?              (2) substantive, (3) struc-    tured metrics, (4) reliability of metrics,   filtering and ablation,
                          tural, (4) generalizability,   (5) comparison with external variables,      and designing a task for
                          (5) external, and (6) con-     and (6) robustness to adversarial attacks    validating the internal
                          sequential aspects.            and social biases.                           process.

Table 1: Overview of theoretical foundations, requirements, and future directions of MRC discussed in this paper.

of a coherent and grounded representation of text                   sage (MCTest (Richardson et al., 2013)), a set of
based on human understanding. Regarding the how                     passages (HotpotQA (Yang et al., 2018)), a longer
question, we argue that among the important as-                     document (CBT (Hill et al., 2016)), or open domain
pects of the construct validity, substantive validity               (Chen et al., 2017). In some datasets, a context
must be ensured, which requires the verification of                 includes non-textual information such as images
the internal mechanism of comprehension.                            (RecipeQA (Yagcioglu et al., 2018)).
   Table 1 provides an overview of the perspectives
taken in this paper. Our answers and suggestions to                 Question Styles A question can be an interrog-
the what and how questions are summarized as fol-                   ative sentence (in most datasets), a fill-in-the-
lows: (1) Reading comprehension is the process of                   blank sentence (cloze) (CLOTH (Xie et al., 2018)),
creating a situation model that best explains given                 knowledge base entries (QAngaroo (Welbl et al.,
texts and the reader’s background knowledge. The                    2018)) and search engine queries (MSMARCO
situation model should be the next focal point in                   (Nguyen et al., 2016)).
future datasets for benchmarking the human-level                    Answering Styles An answer can be (i) chosen
reading comprehension. (2) To evaluate reading                      from a text span of the given document (answer
comprehension correctly, the task needs to provide                  extraction) (NewsQA (Trischler et al., 2017)), (ii)
a rubric (scoring guide) for sufficiently covering                  chosen from a candidate set of answers (multiple
the aspects of the construct validity. In particular,               choice) (MCTest (Richardson et al., 2013)), or (iii)
the substantive validity should be ensured by cre-                  generated as a free-form text (description) (Narra-
ating shortcut-proof questions and by designing a                   tiveQA (Kočiský et al., 2018)). Some datasets op-
task formulation that is explanatory itself.                        tionally allow answering by a yes/no reply (BoolQ
                                                                    (Clark et al., 2019)).
2     Task Overview
                                                                    Sourcing Methods Initially, questions in small-
2.1    Task Variations and Existing Datasets                        scale datasets are created by experts (QA4MRE
MRC is a task in which a machine is given a docu-                   (Sutcliffe et al., 2013)). Later, fueling the devel-
ment (context) and it answers the questions based                   opment of neural models, most published datasets
on the context. Burges (2013) provides a general                    have more than a hundred thousand questions that
definition of MRC, i.e., a machine comprehends a                    are automatically created (CNN/Daily Mail (Her-
passage of text if, for any question regarding that                 mann et al., 2015)), crowdsourced (SQuAD v1.1
text that can be answered correctly by a majority of                (Rajpurkar et al., 2016)), and collected from exam-
native speakers, that machine can provide a string                  inations (RACE (Lai et al., 2017)).
which those speakers would agree both answers
that question. We overview various aspects of the                   Domains The most popular domain is Wikipedia
task along with representative datasets as follows.                 articles (Natural Questions (Kwiatkowski et al.,
Existing datasets are listed in Appendix A.                         2019)), but news articles are also used (Who-did-
                                                                    What (Onishi et al., 2016)). CliCR (Suster and
Context Styles A context can be given in various                    Daelemans, 2018) and emrQA (Pampari et al.,
forms with different lengths such as a single pas-                  2018) are datasets in the clinical domain. DuoRC
(Saha et al., 2018) uses movie scripts.                  sion. This issue may be attributed to the low in-
                                                         terpretability of black-box neural network models.
Specific Skills Several recently proposed                However, a problem is that we cannot explain what
datasets require specific skills including unanswer-     is measured by the datasets even if we can inter-
able questions (SQuAD v2.0 (Rajpurkar et al.,            pret the internals of models. We speculate that this
2018)), dialogues (CoQA (Reddy et al., 2019),            benchmarking issue in MRC can be attributed to
DREAM (Sun et al., 2019)), multiple-sentence             the following two points: (i) we do not have a com-
reasoning (MultiRC (Khashabi et al., 2018)),             prehensive theoretical basis of reading comprehen-
multi-hop reasoning (HotpotQA (Yang et al.,              sion for specifying what we should ask (Section 3)
2018)), mathematical and set reasoning (DROP             and (ii) we do not have a well-established method-
(Dua et al., 2019)), commonsense reasoning               ology for creating a dataset and for analyzing a
(CosmosQA (Huang et al., 2019)), coreference             model based on it (Section 4).1 In the remainder
resolution (QuoRef (Dasigi et al., 2019)), and           of this paper, we argue that these issues can be ad-
logical reasoning (ReClor (Yu et al., 2020)).            dressed by using insights from the psychological
                                                         study of reading comprehension and by implement-
2.2   Benchmarking Issues
                                                         ing psychometric means of validation.
In some datasets, the performance of machines has
already reached human-level performance. How-            3        Reading Comprehension from
ever, Jia and Liang (2017) indicate that models can               Psychology to MRC
easily be fooled by manual injection of distract-
                                                         3.1       Computational Model in Psychology
ing sentences. Their study revealed that questions
simply gathered by crowdsourcing without careful         Human text comprehension has been studied in
guidelines or constraints are insufficient to evaluate   psychology for a long time (Kintsch and Rawson,
precise language understanding.                          2005; Graesser et al., 1994; Kintsch, 1988). Con-
   This argument is supported by further studies         nectionist and computational architectures have
across a variety of datasets. For example, Min et al.    been proposed for such comprehension including a
(2018) find that more than 90% of the questions          mechanism pertinent to knowledge activation and
in SQuAD (Rajpurkar et al., 2016) require obtain-        memory storing. Among the computational mod-
ing an answer from a single sentence despite being       els, the construction–integration (CI) model is the
provided with a passage. Sugawara et al. (2018)          most influential and provides a strong foundation
show that large parts of twelve datasets are eas-        of the field (McNamara and Magliano, 2009).
ily solved only by looking at a few first question          The CI model assumes three different represen-
tokens and attending the similarity between the          tation levels as follows:
given questions and the context. Similarly, Feng             • Surface structure is the linguistic information
et al. (2018) and Mudrakarta et al. (2018) demon-              of particular words, phrases, and syntax ob-
strate that models trained on SQuAD do not change              tained by decoding the raw textual input.
their predictions even when the question tokens are
partly dropped. Kaushik and Lipton (2018) also               • Textbase is a set of propositions in the text,
observe that question- and passage-only models                 where the propositions are locally connected
perform well for some popular datasets. Min et al.             by inferences (microstructure).
(2019) and Chen and Durrett (2019) concurrently              • Situation model is a situational and coherent
indicate that for multi-hop reasoning datasets, the            mental representation in which the propositions
questions are solvable only with a single paragraph            are globally connected (macrostructure), and it
and thus do not require multi-hop reasoning over               is often grounded to not only texts but also to
multiple paragraphs. Zellers et al. (2019b) report             sounds, images, and background information.
that their dataset unintentionally contains stylistic
biases in the answer options which are embedded             The CI model first decodes textual information
by a language-based model.                               (i.e., the surface structure) from the raw textual
   Overall, these investigations highlight a grave            1
                                                             These two issues loosely correspond to the plausibility
issue of the task design, i.e., even if the models       and faithfulness of explanation (Jacovi and Goldberg, 2020).
                                                         The plausibility is linked to what we expect as an explanation,
achieve human-level accuracies, we cannot prove          whereas the faithfulness refers to how accurately we explain
that they successfully perform reading comprehen-        models’ reasoning process.
input, then creates the propositions (i.e., textbase)   knowledge and reasoning for scientific questions
and their local connections occasionally using the      in MRC (Clark et al., 2018). A limitation of both
reader’s knowledge (construction), and finally con-     these studies is that the proposed sets of knowl-
structs a coherent representation (i.e., situation      edge and inference are limited to the domain of
model) that is organized according to five dimen-       elementary-level science. Although some existing
sions including time, space, causation, intention-      datasets for MRC have their own classifications of
ality, and objects (Zwaan and Radvansky, 1998),         skills, they are coarse and only cover a limited ex-
which provides a global description of the events       tent of typical NLP tasks (e.g., word matching and
(integration). These steps are not exclusive, i.e.,     paraphrasing). In contrast, for a more generalizable
propositions are iteratively updated in accordance      definition, Sugawara et al. (2017) propose a set of
with the surrounding ones with which they are           13 skills for MRC. Rogers et al. (2020) pursue this
linked. Although the definition of successful text      direction by proposing a set of questions with eight
comprehension can vary, Hernández-Orallo (2017)        question types. In addition, Schlegel et al. (2020)
indicates that comprehension implies the process        propose an annotation schema to investigate requi-
of creating (or searching for) a situation model that   site knowledge and reasoning. Dunietz et al. (2020)
best explains the given text and the reader’s back-     propose a template of understanding that consists
ground knowledge (Zwaan and Radvansky, 1998).           of spatial, temporal, causal, and motivational ques-
We use this definition to highlight that the creation   tions to evaluate precise understanding of narratives
of a situation model plays a vital role in human        with reference to human text comprehension.
reading comprehension.                                     In what follows, we describe the three represen-
   Our aim in this section is to provide a basis for    tation levels that basically follow the three repre-
explaining what reading comprehension is, which         sentations of the CI model but are modified for
requires terms for explanation. In the computa-         MRC. The three levels are shown in Figure 1. We
tional model above, the representation levels appear    emphasize that we do not intend to create exhaus-
to be useful for organizing such terms. We ground       tive and rigid definitions of skills. Rather, we aim
existing NLP technologies and tasks to different        to place them in a hierarchical organization, which
representation levels in the next section.              can serve as a foundation to highlight the missing
                                                        aspects in the current MRC.
3.2   Skill Hierarchy for MRC
Here, we associate the existing NLP tasks with the      Surface Structure This level broadly covers the
three representation levels introduced above. The       linguistic information and its semantic meaning,
biggest advantage of MRC is its general formu-          which can be based on the raw textual input. Al-
lation, which makes it the most general task for        though these features form a proposition according
evaluating NLU. This emphasizes the importance          to psychology, it should be viewed as sentence-
of the requirement of various skills in MRC, which      level semantic representation in computational lin-
can serve as the units for the explanation of reading   guistics. This level includes part-of-speech tagging,
comprehension. Therefore, our motivation is to          syntactic parsing, dependency parsing, punctua-
provide an overview of the skills as a hierarchical     tion recognition, named entity recognition (NER),
taxonomy and to highlight the missing aspects in        and semantic role labeling (SRL). Although these
existing MRC datasets that are required for com-        basic tasks can be accomplished by some recent
prehensively covering the representation levels.        pretraining-based neural language models (Liu
                                                        et al., 2019), they are hardly required in NLU tasks
Existing Taxonomies We first provide a brief            including MRC. In the natural language inference
overview of the existing taxonomies of skills in        task, McCoy et al. (2019) indicate that existing
NLU tasks. For recognizing textual entailment           datasets (e.g., Bowman et al. (2015)) may fail to
(Dagan et al., 2006), several studies present a clas-   elucidate the syntactic understanding of given sen-
sification of reasoning and commonsense knowl-          tences. Although it is not obvious that these basic
edge (Bentivogli et al., 2010; Sammons et al., 2010;    tasks should be included in MRC and it is not easy
LoBue and Yates, 2011). For scientific question         to circumscribe linguistic knowledge from con-
answering, Jansen et al. (2016) categorize knowl-       crete and abstract knowledge (Zaenen et al., 2005;
edge and inference for an elementary-level dataset.     Manning, 2006), we should always care about the
Similarly, Boratko et al. (2018) propose types of       capabilities of basic tasks (e.g., use of checklists
Construct the global structure of propositions.
                                       Skills: creating a coherent representation and grounding it to other media.
             Situation
              model                    Construct the local relations of propositions.
                                       Skills: recognizing relations between sentences such as coreference resolu-
              Textbase                 tion, knowledge reasoning, and understanding discourse relations.

          Surface structure            Creating propositions from the textual input.
                                       Skills: syntactic and dependency parsing, POS tagging, SRL, and NER.

            Figure 1: Representation levels and corresponding natural language understanding skills.

(Ribeiro et al., 2020)) when the performance of a            Passage: The princess climbed out the window of the high
model is being assessed.                                     tower and climbed down the south wall when her mother
                                                             was sleeping. She wandered out a good way. Finally, she
                                                             went into the forest where there are no electric poles.
Textbase This level covers local relations of
propositions in the computational model of reading           Q1: Who climbed out of the castle? A: Princess
comprehension. In the context of NLP, it refers              Q2: Where did the princess wander after escaping?
                                                             A: Forest
to various types of relations linked between sen-            Q3: What would happen if her mother was not sleeping?
tences. These relations not only include the typical         A: the princess would be caught soon (multiple choice)
relations between sentences (discourse relations)
but also the links between entities. Consequently,          Figure 2: Example questions of the different represen-
this level includes coreference resolution, causality,      tation levels. The passage is taken from MCTest.
temporal relations, spatial relations, text structuring
relations, logical reasoning, knowledge reasoning,          Example The representation levels in the exam-
commonsense reasoning, and mathematical reason-             ple shown in Figure 2 are described as follows.
ing. We also include multi-hop reasoning (Welbl             Q1 is at the surface-structure level where a reader
et al., 2018) at this level because it does not neces-      only needs to understand the subject of the first
sarily require a coherent global representation over        event. We expect that Q2 requires understanding
a given context. For studying the generalizabil-            of relations among described entities and events at
ity of MRC, Fisch et al. (2019) propose a shared            the textbase level; the reader may need to under-
task featuring training and testing on multiple do-         stand who she means using coreference resolution.
mains. Talmor and Berant (2019) and Khashabi                Escaping in Q2 also requires the reader’s common-
et al. (2020) also find that training on multiple           sense to associate it with the first event. However,
datasets leads to robust generalization. However,           the reader might be able to answer this question
unless we make sure that datasets require various           only by looking for a place (specified by where)
skills with sufficient coverage, it might remain un-        described in the passage, thereby necessitating the
clear whether we evaluate a model’s transferability         validity of the question to correctly evaluate the
of the reading comprehension ability.                       understanding of the described events. Q3 is an
Situation Model This level targets the global               example that requires imagining a different situa-
structure of propositions in human reading com-             tion at the situation-model level, which could be
prehension. It includes a coherent and situational          further associated with a grounding question such
representation of a given context and its grounding         as which figure best depicts the given passage?
to the non-textual information. A coherent repre-              In summary, we indicate that the following fea-
sentation has well-organized sentence-to-sentence           tures might be missing in existing datasets:
transitions (Barzilay and Lapata, 2008), which are
                                                              • Considering the capability to acquire basic un-
vital for using procedural and script knowledge
                                                                derstanding of the linguistic-level information.
(Schank and Abelson, 1977). This level also in-
cludes characters’ goals and plans, meta perspec-             • Ensuring that the questions comprehensively
tive including author’s intent and attitude, thematic           specify and evaluate textbase-level skills.
understanding, and grounding to other media. Most
existing MRC datasets seem to struggle to target the          • Evaluating the capability of the situation model
situation model. We discuss further in Section 5.1.             in which propositions are coherently organized
and are grounded to non-textual information.            4.2   Construct Validity in MRC

Should MRC models mimic human text com-                         Table 2 also raises MRC features corresponding to
prehension? In this paper, we do not argue that                 the six aspects of construct validity. In what fol-
MRC models should mimic human text compre-                      lows, we elaborate on these correspondings and dis-
hension. However, when we design an NLU task                    cuss the missing aspects that are needed to achieve
and create datasets for testing human-like linguistic           the construct validity of the current MRC.
generalization, we can refer to the aforementioned              Content Aspect As discussed in Section 3, suffi-
features to frame the intended behavior to evaluate             ciently covering the skills across all the representa-
in the task. As Linzen (2020) discusses, the task de-           tion levels is an important requirement of MRC. It
sign is orthogonal to how the intended behavior is              may be desirable that an MRC model is simultane-
realized at the implementation level (Marr, 1982).              ously evaluated on various skill-oriented examples.
4       MRC on Psychometrics                                    Substantive Aspect This aspect appraises the ev-
                                                                idence for the consistency of model behavior. We
In this section, we provide a theoretical foundation
                                                                consider that this is the most important aspect for
for the evaluation of MRC models. When MRC
                                                                explaining reading comprehension, a process that
measures the capability of reading comprehension,
                                                                subsumes various implicit and complex steps. To
validation of the measurement is crucial to obtain a
                                                                obtain a consistent response from an MRC system,
reliable and useful explanation. Therefore, we fo-
                                                                it is necessary to ensure that the questions correctly
cus on psychometrics—a field of study concerned
                                                                assess the internal steps in the process of reading
with the assessment of the quality of psychological
                                                                comprehension. However, as stated in Section 2.2,
measurement (Furr, 2018). We expect that the in-
                                                                most existing datasets fail to verify that a question
sights obtained from psychometrics can facilitate a
                                                                is solved by using an intended skill, which implies
better task design. In Section 4.1, we first review
                                                                that it cannot be proved that a successful system
the concept of validity in psychometrics. Subse-
                                                                can actually perform intended comprehension.
quently, in Section 4.2, we examine the aspects that
correspond to construct validity in MRC and then                Structural Aspect Another issue in the current
indicate the prerequisites for verifying the intended           MRC is that they only provide simple accuracy
explanation of MRC in its task design.                          as a metric. Given that the substantive aspect ne-
                                                                cessitates the evaluation of the internal process of
4.1      Construct Validity in Psychometrics
                                                                reading comprehension, the structure of metrics
According to psychometrics, construct validity is               needs to reflect it. However, a few studies have at-
necessary to validate the interpretation of outcomes            tempted to provide a dataset with multiple metrics.
of psychological experiments.2 Messick (1995)                   For example, Yang et al. (2018) not only ask for
report that construct validity consists of the six              the answers to questions but also provide sentence-
aspects shown in Table 2.                                       level supporting facts. This metric can also evaluate
   In the design of educational and psychological               the process of multi-hop reasoning whenever the
measurement, these aspects collectively provide                 supporting sentences need to be understood for an-
verification questions that need to be answered for             swering a question. Therefore, we need to consider
justifying the interpretation and use of test scores.           both substantive and structural aspects.
In this sense, the construct validation can be viewed
as an empirical evaluation of the meaning and con-              Generalizability Aspect The generalizability of
sequence of measurement. Given that MRC is in-                  MRC can be understood from the reliability of met-
tended to capture the reading comprehension abil-               rics and the reproducibility of findings. For the
ity, the task designers need to be aware of these               reliability of metrics, we need to take care of the re-
validity aspects. Otherwise, users of the task can-             liability of gold answers and model predictions. Re-
not justify the score interpretation, i.e., it cannot be        garding the accuracy of answers, the performance
confirmed that successful systems actually perform              of the model becomes unreliable when the answers
intended reading comprehension.                                 are unintentionally ambiguous or impractical. Be-
    2
                                                                cause the gold answers in most datasets are only
     In psychology, a construct is an abstract concept, which
facilitates the understanding of human behavior such as vo-     decided by the majority vote of crowd workers,
cabulary, skills, and comprehension.                            the ambiguity of the answers is not considered. It
Validity aspects      Definition in psychometrics                             Correspondence in MRC

 1. Content            Evidence of content relevance, representativeness,      Questions require reading comprehension skills
                       and technical quality.                                  with sufficient coverage and representativeness
                                                                               over the representation levels.
 2. Substantive        Theoretical rationales for the observed consisten-      Questions correctly evaluate the intended inter-
                       cies in the test responses including task perfor-       mediate process of reading comprehension and
                       mance of models.                                        provide rationales to the interpreters.
 3. Structural         Fidelity of the scoring structure to the structure of   Correspondence between the task structure and
                       the construct domain at issue.                          the score structure.
 4. Generalizability   Extent to which score properties and interpretations    Reliability of test scores in correct answers and
                       can be generalized to and across population groups,     model predictions, and applicability to other
                       settings, and tasks.                                    datasets and models.
 5. External           Extent to which the assessment scores’ relationship     Comparison of the performance of MRC with
                       with other measures and non-assessment behaviors        that of other NLU tasks and measurements.
                       reflect the expected relations.
 6. Consequential      Value implications of score interpretation as a basis   Considering the model vulnerabilities to adver-
                       for the consequences of test use, especially regard-    sarial attacks and social biases of models and
                       ing the sources of invalidity related to issues of      datasets to ensure the fairness of model outputs.
                       bias, fairness, and distributive justice.

         Table 2: Aspects of the construct validity in psychometrics and corresponding features in MRC.

may be useful if such ambiguity can be reflected                   MRC, this refers to the use of a successful model
in the evaluation (e.g., using the item response the-              in practical situations other than tasks, where we
ory (Lalor et al., 2016)). As for model predic-                    need to ensure the robustness of a model to adver-
tions, an issue may be the reproducibility of results              sarial attacks and the accountability for unintended
(Bouthillier et al., 2019), which implies that the                 model behaviors. Wallace et al. (2019) highlight
reimplementation of a system generates statistically               this aspect by showing that existing NLP models
similar predictions. For the reproducibility of mod-               are vulnerable to adversarial examples, thereby gen-
els, Dror et al. (2018) emphasize statistical testing              erating egregious outputs.
methods to evaluate models. For the reproducibil-
                                                                   Summary: Design of Rubric Given the validity
ity of findings, Bouthillier et al. (2019) stress it as
                                                                   aspects, our suggestion is to design a rubric (scor-
the transferability of findings in a dataset/task to
                                                                   ing guide used in education) of what reading com-
another dataset/task. In open-domain question an-
                                                                   prehension we expect is evaluated in a dataset; this
swering, Lewis et al. (2021) point out that success-
                                                                   helps to inspect detailed strengths and weaknesses
ful models might only memorize dataset-specific
                                                                   of models that cannot be obtained only by simple
knowledge. To facilitate this transferability, we
                                                                   accuracy. The rubric should not only cover various
need to have units of explanation that can be used
                                                                   linguistic phenomena (the content aspect) but also
in different datasets (Doshi-Velez and Kim, 2018).
                                                                   involve different levels of intermediate evaluation
External Aspect This aspect refers to the rela-                    in the reading comprehension process (the substan-
tionship between a model’s scores on different                     tive and structural aspects) as well as stress testing
tasks. Yogatama et al. (2019) point out that current               of adversarial attacks (the consequential aspect).
models struggle to transfer their ability from a task              The rubric is in a similar motivation with dataset
originally trained on (e.g., MRC) to different un-                 statements (Bender and Friedman, 2018; Gebru
seen tasks (e.g., SRL). To develop a general NLU                   et al., 2018); however, taking the validity aspects
model, one would expect that a successful MRC                      into account would improve its substance.
model should show sufficient performance on other                  5    Future Directions
NLU tasks as well. To this end, Wang et al. (2019)
propose an evaluation framework with ten different                 This section discusses future potential directions
NLU tasks in the same format.                                      toward answering the what and how questions in
                                                                   Sections 3 and 4. In particular, we infer that the
Consequential Aspect This aspect refers to the                     situation model and substantive validity are critical
actual and potential consequences of test use. In                  for benchmarking human-level MRC.
5.1 What Question: Situation Model                      texts. Similarly to the textbook questions (Kemb-
As mentioned in Section 3, existing datasets fail       havi et al., 2017), a possible approach would be to
to fully assess the ability of creating the situation   create questions for understanding of texts through
model. As a future direction, we suggest that the       showing figures. We might also need to account
task should deal with two features of the situation     for the scope of grounding (Bisk et al., 2020), i.e.,
model: context dependency and grounding.                ultimately understanding human language in a so-
                                                        cial context beyond simply associating texts with
5.1.1 Context-dependent Situations                      perceptual information.
A vital feature of the situation model is that it is
conditioned on a given text, i.e., a representation     5.2 How Question: Substantive Validity
is constructed distinctively depending on the given     Substantive validity requires us to ensure that the
context. We elaborate it by discussing the two key      questions correctly assess the internal steps of read-
features: defeasibility and novelty.                    ing comprehension. We discuss two approaches for
Defeasibility The defeasibility of a constructed        this challenge: creating shortcut-proof questions
representation implies that a reader can modify and     and ensuring the explanation by design.
revise it according to the newly acquired informa-      5.2.1   Shortcut-proof Questions
tion (Davis and Marcus, 2015; Schubert, 2015).
                                                        Gururangan et al. (2018) reveal that NLU datasets
The defeasibility of NLU has been tackled in the
                                                        can contain unintended dataset biases embedded
task of if-then reasoning (Sap et al., 2019a), ab-
                                                        by annotators. If machine learning models exploit
ductive reasoning (Bhagavatula et al., 2020), coun-
                                                        such biases for answering questions, we cannot
terfactual reasoning (Qin et al., 2019), or contrast
                                                        evaluate the precise NLU of models. Following
sets (Gardner et al., 2020). A possible approach
                                                        Geirhos et al. (2020), we define shortcut-proof
in MRC is that we ask questions against a set of
                                                        questions as ones that prevent models from exploit-
modified passages that describe slightly different
                                                        ing dataset biases and learning decision rules (short-
situations, where the same question can lead to
                                                        cuts) that perform well only on i.i.d. test examples
different conclusions.
                                                        with regard to its training examples. Gardner et al.
Novelty An example showing the importance               (2019) also point out the importance of mitigating
of contextual novelty is Could a crocodile run a        shortcuts in MRC. In this section, we view two
steeplechase? by Levesque (2014). This question         different approaches for this challenge.
poses a novel situation where the solver needs to
combine multiple commonsense knowledge to de-           Removing Unintended Biases by Filtering
rive the correct answer. If non-fiction documents,      Zellers et al. (2018) propose a model-based ad-
such as newspaper and Wikipedia articles, are only      versarial filtering method that iteratively trains an
used, some questions require only the reasoning         ensemble of stylistic classifiers and uses them to
of facts already known in web-based corpus. Fic-        filter out the questions. Sakaguchi et al. (2020)
tional narratives may be a better source for creating   also propose filtering methods based on both ma-
questions on novel situations.                          chines and humans to alleviate dataset-specific and
                                                        word-association biases. However, a major issue
5.1.2 Grounding to Other Media                          is the inability to discern knowledge from bias in
In MRC, grounding texts to non-textual informa-         a closed domain. When the domain is equal to a
tion is not fully explored yet. Kembhavi et al.         dataset, patterns that are valid only in the domain
(2017) propose a dataset based on science text-         are called dataset-specific biases (or annotation
books, which contain questions with passages, di-       artifacts in the labeled data). When the domain
agrams, and images. Kahou et al. (2018) propose         covers larger corpora, the patterns (e.g., frequency)
a figure-based question answering dataset that re-      are called word-association biases. When the do-
quires the understanding of figures including line      main includes everyday experience, patterns are
plots and bar charts. Although another approach         called commonsense. However, as mentioned in
could be vision-based question answering tasks          Section 5.1, commonsense knowledge can be de-
(Antol et al., 2015; Zellers et al., 2019a), we can-    feasible, which implies that the knowledge can be
not directly use them for evaluating NLU because        false in unusual situations. In contrast, when the
they focus on understanding of images rather than       domain is our real world, indefeasible patterns are
called factual knowledge. Therefore, the distinc-       performance by modeling the generation of the ex-
tion of bias and knowledge depends on where the         planation. Although we must take into account
pattern is recognized. This means that a dataset        the faithfulness of explanation, asking for intro-
should be created so that it can evaluate reasoning     spective explanations could be useful in inspecting
on the intended knowledge. For example, to test         the internal reasoning process, e.g., by extending
defeasible reasoning, we must filter out questions      the task formulation so that it includes auxiliary
that are solvable by usual commonsense only. If we      questions that consider the intermediate facts in a
want to investigate the reading comprehension abil-     reasoning process. For example, before answering
ity without depending on factual knowledge, we          Q2 in Figure 2, a reader should be able to answer
can consider counterfactual or fictional situations.    who escaped? and where did she escape from? at
                                                        the surface-structure level.
Identifying Requisite Skills by Ablating Input
Features Another approach is to verify shortcut-        Creating Dependency Between Questions An-
proof questions by analyzing the human answer-          other approach for improving the substantive va-
ability of questions regarding their key features.      lidity is to create dependency between questions
We speculate that if a question is still answerable     by which answering them correctly involves an-
by humans even after removing the intended fea-         swering some other questions correctly. For exam-
tures, the question does not require understanding      ple, Dalvi et al. (2018) propose a dataset that re-
of the ablated features (e.g., checking the necessity   quires a procedural understanding of scientific facts.
of resolving pronoun coreference after replacing        In their dataset, a set of questions corresponds to
pronouns with dummy nouns). Even if we can-             the steps of the entire process of a scientific phe-
not accurately identify such necessary features, by     nomenon. Therefore, this set can be viewed as
identifying partial features of them in a sufficient    a single question that requires a complete under-
number of questions, we could expect that the ques-     standing of the scientific phenomenon. In CoQA
tions evaluate the corresponding intended skill. In     (Reddy et al., 2019), it is noted that questions often
a similar vein, Geirhos et al. (2020) argue that a      have pronouns that refer back to nouns appearing
dataset is useful only if it is a good proxy for the    in previous questions. These mutually-dependent
underlying ability one is actually interested in.       questions can probably facilitate the explicit vali-
                                                        dation of the models’ understanding of given texts.
5.2.2 Explanation by Design
Another approach for ensuring the substantive va-       6   Conclusion
lidity is to include explicit explanation in the task   In this paper, we outlined current issues and future
formulation. Although gathering human explana-          directions for benchmarking machine reading com-
tions is costly, the following approaches can facil-    prehension. We visited the psychology study to
itate the explicit verification of a model’s under-     analyze what we should ask of reading comprehen-
standing using a few test examples.                     sion and the construct validity in psychometrics to
Generating Introspective Explanation Inoue              analyze how we should correctly evaluate it. We
et al. (2020) classify two types of explanation         deduced that future datasets should evaluate the
in text comprehension: justification explanation        capability of the situation model for understanding
and introspective explanation. The justification        context-dependent situations and for grounding to
explanation only provides a collection of support-      non-textual information and ensure the substantive
ing facts for making a certain decision, whereas        validity by creating shortcut-proof questions and
the introspective explanation provides the deriva-      designing an explanatory task formulation.
tion of the answer for making the decision, which
                                                        Acknowledgments
can cover linguistic phenomena and commonsense
knowledge not explicitly mentioned in the text.         The authors would like to thank Xanh Ho for help-
They annotate multi-hop reasoning questions with        ing create the dataset list and the anonymous re-
introspective explanation and propose a task that       viewers for their insightful comments. This work
requires the derivation of the correct answer of a      was supported by JSPS KAKENHI Grant Num-
given question to improve the explainability. Ra-       ber 18H03297, JST ACT-X Grant Number JPM-
jani et al. (2019) collect human explanations for       JAX190G, and JST PRESTO Grant Number JP-
commonsense reasoning and improve the system’s          MJPR20C4.
References                                               Michael Boratko, Harshit Padigela, Divyendra Mikki-
                                                           lineni, Pritish Yuvraj, Rajarshi Das, Andrew McCal-
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-        lum, Maria Chang, Achille Fokoue-Nkoutche, Pavan
   garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,       Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik
   and Devi Parikh. 2015. VQA: Visual question an-         Talamadupula, and Michael Witbrock. 2018. A sys-
   swering. In Proceedings of the IEEE International       tematic classification of knowledge, reasoning, and
   Conference on Computer Vision, pages 2425–2433.         context within the ARC dataset. In Proceedings
                                                           of the Workshop on Machine Reading for Question
Max Bartolo, Alastair Roberts, Johannes Welbl, Sebas-      Answering, pages 60–70. Association for Computa-
 tian Riedel, and Pontus Stenetorp. 2020. Beat the         tional Linguistics.
 AI: Investigating adversarial human annotation for
 reading comprehension. Transactions of the Associ-      Xavier Bouthillier, César Laurent, and Pascal Vincent.
 ation for Computational Linguistics, 8:662–678.           2019. Unreproducible research is reproducible. In
                                                           Proceedings of the 36th International Conference
Regina Barzilay and Mirella Lapata. 2008. Modeling         on Machine Learning, volume 97 of Proceedings of
  local coherence: An entity-based approach. Compu-        Machine Learning Research, pages 725–734, Long
  tational Linguistics, 34(1):1–34.                        Beach, California, USA. PMLR.

Emily M. Bender and Batya Friedman. 2018. Data           Samuel R. Bowman, Gabor Angeli, Christopher Potts,
  statements for natural language processing: Toward       and Christopher D. Manning. 2015. A large an-
  mitigating system bias and enabling better science.      notated corpus for learning natural language infer-
  Transactions of the Association for Computational        ence. In Proceedings of the 2015 Conference on
  Linguistics, 6:587–604.                                  Empirical Methods in Natural Language Processing,
                                                           pages 632–642. Association for Computational Lin-
Emily M. Bender and Alexander Koller. 2020. Climb-         guistics.
  ing towards NLU: On meaning, form, and under-
  standing in the age of data. In Proceedings of the     Christopher J.C. Burges. 2013. Towards the machine
  58th Annual Meeting of the Association for Compu-        comprehension of text: An essay. Technical re-
  tational Linguistics, pages 5185–5198, Online. As-       port, Microsoft Research Technical Report MSR-
  sociation for Computational Linguistics.                 TR-2013-125.

Luisa Bentivogli, Elena Cabrio, Ido Dagan, Danilo        Vittorio Castelli, Rishav Chakravarti, Saswati Dana,
  Giampiccolo, Medea Lo Leggio, and Bernardo               Anthony Ferritto, Radu Florian, Martin Franz, Di-
  Magnini. 2010. Building textual entailment special-       nesh Garg, Dinesh Khandelwal, Scott McCarley,
  ized data sets: a methodology for isolating linguis-      Michael McCawley, Mohamed Nasr, Lin Pan, Cezar
  tic phenomena relevant to inference. In Proceed-          Pendus, John Pitrelli, Saurabh Pujar, Salim Roukos,
  ings of the Seventh International Conference on Lan-     Andrzej Sakrajda, Avi Sil, Rosario Uceda-Sosa,
  guage Resources and Evaluation (LREC’10), Val-           Todd Ward, and Rong Zhang. 2020. The TechQA
  letta, Malta. European Language Resources Associ-         dataset. In Proceedings of the 58th Annual Meet-
  ation (ELRA).                                             ing of the Association for Computational Linguistics,
                                                            pages 1269–1278, Online. Association for Computa-
Chandra Bhagavatula, Ronan Le Bras, Chaitanya               tional Linguistics.
  Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han-
                                                         Danqi Chen. 2018. Neural Reading Comprehension
  nah Rashkin, Doug Downey, Wen tau Yih, and Yejin
                                                           and Beyond. Ph.D. thesis, Stanford University.
  Choi. 2020. Abductive commonsense reasoning. In
  International Conference on Learning Representa-       Danqi Chen, Adam Fisch, Jason Weston, and Antoine
  tions.                                                   Bordes. 2017. Reading wikipedia to answer open-
                                                           domain questions. In Proceedings of the 55th An-
Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob          nual Meeting of the Association for Computational
  Andreas, Yoshua Bengio, Joyce Chai, Mirella Lap-         Linguistics (Volume 1: Long Papers), pages 1870–
  ata, Angeliki Lazaridou, Jonathan May, Aleksandr         1879. Association for Computational Linguistics.
  Nisnevich, Nicolas Pinto, and Joseph Turian. 2020.
  Experience grounds language. In Proceedings of the     Jifan Chen and Greg Durrett. 2019. Understanding
  2020 Conference on Empirical Methods in Natural           dataset design choices for multi-hop reasoning. In
  Language Processing (EMNLP), pages 8718–8735,             Proceedings of the 2019 Conference of the North
  Online. Association for Computational Linguistics.        American Chapter of the Association for Compu-
                                                            tational Linguistics: Human Language Technolo-
Michael Boratko, Xiang Li, Tim O’Gorman, Rajarshi           gies, Volume 1 (Long and Short Papers), pages
  Das, Dan Le, and Andrew McCallum. 2020. Pro-              4026–4032, Minneapolis, Minnesota. Association
  toQA: A question answering dataset for prototypi-         for Computational Linguistics.
  cal common-sense reasoning. In Proceedings of the
  2020 Conference on Empirical Methods in Natural        Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fer-
  Language Processing (EMNLP), pages 1122–1136,            nandez, and Doug Downey. 2019. CODAH: An
  Online. Association for Computational Linguistics.       adversarially-authored question answering dataset
for common sense. In Proceedings of the 3rd Work-       Bhuwan Dhingra, Kathryn Mazaitis, and William W.
  shop on Evaluating Vector Space Representations           Cohen. 2017. Quasar: Datasets for question answer-
  for NLP, pages 63–69, Minneapolis, USA. Associ-           ing by search and reading.
  ation for Computational Linguistics.
                                                          Finale Doshi-Velez and Been Kim. 2018. Consider-
Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan                   ations for Evaluation and Generalization in Inter-
 Xiong, Hong Wang, and William Yang Wang. 2020.              pretable Machine Learning, 1st edition. Springer In-
  HybridQA: A dataset of multi-hop question answer-          ternational Publishing.
  ing over tabular and textual data. In Findings of the
 Association for Computational Linguistics: EMNLP         Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re-
 2020, pages 1026–1036, Online. Association for             ichart. 2018. The hitchhiker’s guide to testing statis-
  Computational Linguistics.                                tical significance in natural language processing. In
                                                            Proceedings of the 56th Annual Meeting of the As-
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-
                                                            sociation for Computational Linguistics (Volume 1:
  tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-
                                                            Long Papers), pages 1383–1392, Melbourne, Aus-
  moyer. 2018. QuAC: Question answering in con-
                                                            tralia. Association for Computational Linguistics.
  text. In Proceedings of the 2018 Conference on
  Empirical Methods in Natural Language Processing,
                                                          Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
  pages 2174–2184, Brussels, Belgium. Association
                                                            Stanovsky, Sameer Singh, and Matt Gardner. 2019.
  for Computational Linguistics.
                                                            DROP: A reading comprehension benchmark requir-
Christopher Clark, Kenton Lee, Ming-Wei Chang,              ing discrete reasoning over paragraphs. In Proceed-
  Tom Kwiatkowski, Michael Collins, and Kristina            ings of the 2019 Conference of the North American
  Toutanova. 2019. BoolQ: Exploring the surprising          Chapter of the Association for Computational Lin-
  difficulty of natural yes/no questions. In Proceed-       guistics: Human Language Technologies, Volume 1
  ings of the 2019 Conference of the North American         (Long and Short Papers), pages 2368–2378, Min-
  Chapter of the Association for Computational Lin-         neapolis, Minnesota. Association for Computational
  guistics: Human Language Technologies, Volume 1           Linguistics.
  (Long and Short Papers), pages 2924–2936, Min-
  neapolis, Minnesota. Association for Computational      Jesse Dunietz, Greg Burnham, Akash Bharadwaj,
  Linguistics.                                               Owen Rambow, Jennifer Chu-Carroll, and Dave Fer-
                                                             rucci. 2020. To test machine comprehension, start
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,        by defining comprehension. In Proceedings of the
  Ashish Sabharwal, Carissa Schoenick, and Oyvind            58th Annual Meeting of the Association for Compu-
  Tafjord. 2018. Think you have solved question an-          tational Linguistics, pages 7839–7859, Online. As-
  swering? Try ARC, the AI2 reasoning challenge.             sociation for Computational Linguistics.
  CoRR, abs/1803.05457.
                                                          Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur
Ido Dagan, Oren Glickman, and Bernardo Magnini.            Guney, Volkan Cirik, and Kyunghyun Cho. 2017.
   2006. The pascal recognising textual entailment         SearchQA: A new Q&A dataset augmented with
   challenge. In Machine Learning Challenges Work-         context from a search engine.
   shop, pages 177–190. Springer.
                                                          Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer,
Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen-tau
                                                            Pedro Rodriguez, and Jordan Boyd-Graber. 2018.
  Yih, and Peter Clark. 2018. Tracking state changes
                                                            Pathologies of neural models make interpretations
  in procedural text: a challenge dataset and models
                                                            difficult. In Proceedings of the 2018 Conference on
  for process paragraph comprehension. In Proceed-
                                                            Empirical Methods in Natural Language Processing,
  ings of the 2018 Conference of the North American
                                                            pages 3719–3728. Association for Computational
  Chapter of the Association for Computational Lin-
                                                            Linguistics.
  guistics: Human Language Technologies, Volume
  1 (Long Papers), pages 1595–1604. Association for
                                                          James Ferguson, Matt Gardner, Hannaneh Hajishirzi,
  Computational Linguistics.
                                                            Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A
Pradeep Dasigi, Nelson F. Liu, Ana Marasovic,               dataset of incomplete information reading compre-
  Noah A. Smith, and Matt Gardner. 2019. Quoref:            hension questions. In Proceedings of the 2020 Con-
  A reading comprehension dataset with questions re-        ference on Empirical Methods in Natural Language
  quiring coreferential reasoning. In Proceedings of        Processing (EMNLP), pages 1137–1147, Online. As-
  the 2019 Conference on Empirical Methods in Nat-          sociation for Computational Linguistics.
  ural Language Processing and the 9th International
  Joint Conference on Natural Language Processing         Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-
  (EMNLP-IJCNLP), pages 5927–5934, Hong Kong,               nsol Choi, and Danqi Chen. 2019. MRQA 2019
  China. Association for Computational Linguistics.         shared task: Evaluating generalization in reading
                                                            comprehension. In Proceedings of the 2nd Work-
Ernest Davis and Gary Marcus. 2015. Commonsense             shop on Machine Reading for Question Answering,
  reasoning and commonsense knowledge in artificial         pages 1–13, Hong Kong, China. Association for
  intelligence. Commun. ACM, 58(9):92–103.                  Computational Linguistics.
R Michael Furr. 2018. Psychometrics: an introduction.      Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao,
  Sage Publications.                                        Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu,
                                                             Qiaoqiao She, Xuan Liu, Tian Wu, and Haifeng
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan         Wang. 2018. DuReader: a Chinese machine read-
 Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi,              ing comprehension dataset from real-world appli-
 Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,              cations. In Proceedings of the Workshop on Ma-
 Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,         chine Reading for Question Answering, pages 37–
 Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-            46, Melbourne, Australia. Association for Computa-
 son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer             tional Linguistics.
 Singh, Noah A. Smith, Sanjay Subramanian, Reut
 Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou.         Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
 2020. Evaluating models’ local decision boundaries          stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
 via contrast sets. In Findings of the Association           and Phil Blunsom. 2015. Teaching machines to read
 for Computational Linguistics: EMNLP 2020, pages            and comprehend. In C. Cortes, N. D. Lawrence,
 1307–1323, Online. Association for Computational            D. D. Lee, M. Sugiyama, and R. Garnett, editors,
 Linguistics.                                                Advances in Neural Information Processing Systems
                                                             28, pages 1693–1701. Curran Associates, Inc.
Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi,
 Alon Talmor, and Sewon Min. 2019. On making               José Hernández-Orallo. 2017. The measure of all
 reading comprehension more comprehensive. In                 minds: evaluating natural and artificial intelligence.
 Proceedings of the 2nd Workshop on Machine Read-             Cambridge University Press.
 ing for Question Answering, pages 105–112, Hong
 Kong, China. Association for Computational Lin-           Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia
 guistics.                                                   Polosukhin, Andrew Fandrianto, Jay Han, Matthew
                                                             Kelcey, and David Berthelot. 2016. WikiReading: A
Timnit Gebru, Jamie Morgenstern, Briana Vec-                 novel large-scale language understanding task over
  chione, Jennifer Wortman Vaughan, Hanna Wal-               wikipedia. In Proceedings of the 54th Annual Meet-
  lach, Hal Daumé III, and Kate Crawford. 2018.             ing of the Association for Computational Linguistics
  Datasheets for datasets. ArXiv preprint 1803.09010.        (Volume 1: Long Papers), pages 1535–1545, Berlin,
                                                             Germany. Association for Computational Linguis-
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio               tics.
  Michaelis, Richard Zemel, Wieland Brendel,
  Matthias Bethge, and Felix A. Wichmann. 2020.            Felix Hill, Antoine Bordes, Sumit Chopra, and Jason
  Shortcut learning in deep neural networks. Nature          Weston. 2016. The goldilocks principle: Reading
  Machine Intelligence, 2(11):665–673.                       children’s books with explicit memory representa-
                                                             tions. In International Conference on Learning Rep-
Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Ba-          resentations.
  jwa, Michael Specter, and Lalana Kagal. 2018. Ex-
  plaining explanations: An overview of interpretabil-     Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,
  ity of machine learning. In 2018 IEEE 5th Interna-         and Akiko Aizawa. 2020. Constructing a multi-
  tional Conference on data science and advanced an-         hop QA dataset for comprehensive evaluation of
  alytics (DSAA), pages 80–89. IEEE.                         reasoning steps. In Proceedings of the 28th Inter-
                                                             national Conference on Computational Linguistics,
Arthur C. Graesser, Murray Singer, and Tom Trabasso.         pages 6609–6625, Barcelona, Spain (Online). Inter-
  1994. Constructing inferences during narrative text        national Committee on Computational Linguistics.
  comprehension. Psychological review, 101(3):371.
                                                           Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
Suchin Gururangan, Swabha Swayamdipta, Omer                  Yejin Choi. 2019. Cosmos QA: Machine reading
  Levy, Roy Schwartz, Samuel Bowman, and Noah A.              comprehension with contextual commonsense rea-
  Smith. 2018. Annotation artifacts in natural lan-           soning. In Proceedings of the 2019 Conference on
  guage inference data. In Proceedings of the 2018            Empirical Methods in Natural Language Processing
  Conference of the North American Chapter of the             and the 9th International Joint Conference on Natu-
  Association for Computational Linguistics: Human            ral Language Processing (EMNLP-IJCNLP), pages
  Language Technologies, Volume 2 (Short Papers),             2391–2401, Hong Kong, China. Association for
  pages 107–112. Association for Computational Lin-           Computational Linguistics.
  guistics.
                                                           Naoya Inoue, Pontus Stenetorp, and Kentaro Inui. 2020.
Ivan Habernal, Henning Wachsmuth, Iryna Gurevych,            R4C: A benchmark for evaluating RC systems to get
   and Benno Stein. 2018. The argument reasoning             the right answer for the right reason. In Proceedings
   comprehension task: Identification and reconstruc-        of the 58th Annual Meeting of the Association for
   tion of implicit warrants. In Proceedings of the 2018     Computational Linguistics, pages 6740–6750, On-
   Conference of the North American Chapter of the           line. Association for Computational Linguistics.
   Association for Computational Linguistics: Human
   Language Technologies, Volume 1 (Long Papers),          Alon Jacovi and Yoav Goldberg. 2020. Towards faith-
   pages 1930–1940, New Orleans, Louisiana. Associ-          fully interpretable NLP systems: How should we de-
   ation for Computational Linguistics.                      fine and evaluate faithfulness? In Proceedings of the
You can also read