Faithful Reasoning Using Large Language Models: Antonia Creswell and Murray Shanahan
Faithful Reasoning Using Large Language Models: Antonia Creswell and Murray Shanahan
Although contemporary large language models (LMs) demonstrate impressive question-answering capa-
bilities, their answers are typically the product of a single call to the model. This entails an unwelcome
degree of opacity and compromises performance, especially on problems that are inherently multi-step.
To address these limitations, we show how LMs can be made to perform faithful multi-step reasoning via
a process whose causal structure mirrors the underlying logical structure of the problem. Our approach
works by chaining together reasoning steps, where each step results from calls to two fine-tuned LMs,
arXiv:2208.14271v1 [cs.AI] 30 Aug 2022
one for selection and one for inference, to produce a valid reasoning trace. Our method carries out a
beam search through the space of reasoning traces to improve reasoning quality. We demonstrate the
effectiveness of our model on multi-step logical deduction and scientific question-answering, showing
that it outperforms baselines on final answer accuracy, and generates humanly interpretable reasoning
traces whose validity can be checked by the user.
Figure 1 | Example input and output from our Faithful Reasoning model.
Two further fine-tuned language models com- framework, adhering closely to textbook defini-
plete our architecture. First, the halter is used to tions from formal logic (e.g. Hamilton (1988)).
terminate the reasoning process and return an an-
swer in the required format. If the trace does not Definition 1. A reasoning step is a pair h𝑠, 𝑖i,
terminate within a specified number of steps then where 𝑠 (the selection) is a set of statements and 𝑖
the answer is considered to be ‘Unknown’, allow- (the inference) is a statement.
ing us to filter model answers and increase answer Definition 2. A reasoning trace is a pair hC , T i
precision. Second, a learned value function, which where C (the context) is a set of statements and T
assesses the quality of the current reasoning step is a sequence of reasoning steps.
is deployed to guide a beam search over reason-
ing traces to enhance their quality and further Definition 3. A reasoning trace hC , T i, where T =
boost overall performance. h𝑠0 , 𝑖0 i , h𝑠1 , 𝑖1 i , . . . h𝑠𝑛 , 𝑖𝑛 i, is connected iff for every
reasoning step h𝑠𝑘 , 𝑖𝑘 i, for every statement 𝑞 in the
We evaluate our model on two datasets,
set 𝑠𝑘 either 𝑞 ∈ C or 𝑞 = 𝑖 𝑗 for some 𝑗 < 𝑘.
Proof Writer (Tafjord et al., 2021) and a question-
answering version of EntailmentBank (Dalvi et al., Definition 4. A reasoning trace hC , T i, where T =
2021). We shown that our model outperforms 𝑟0 , 𝑟1 , . . . , 𝑟𝑛 , is valid if it is connected and each
baseline models on final answer accuracy and reasoning step 𝑟𝑘 = h𝑠, 𝑖i is correct (in the sense
that our proposed halter and search methods also that 𝑖 logically follows from 𝑠).
lead to compounding boosts in performance (Ta-
bles 2 and 1). We show that in most cases SI In the next section, we introduce the compo-
produces higher quality reasoning traces than nents of our architecture and show how it satisfies
baseline models (Figures 9 and 8). It is less likely the requirements of faithful reasoning, under cer-
to “hallucinate” facts (Table 5), is better able to tain assumptions.
utilise the context (Table 3) and is more likely to
use its trace to answer questions (Table 4). Fi-
nally, our model can accurately predict when it 3. Components of a Faithful Reason-
knows the answer (Figure 6). ing Model
We begin by introducing Selection-Inference (SI),
2. Defining a Valid Reasoning Trace the step-wise forward reasoning backbone whose
causal structure (see Figure 2) satisfies the re-
In this Section, we formally define the concept quirements for producing valid reasoning traces.
of valid forward reasoning in the context of our We then describe a component for halting, which
2
Faithful Reasoning Using Large Language Models
Figure 2 | Comparing dependencies between inputs and outputs for SI and related models.
Inputs - blue circles, LM outputs - purple circles. Order of the letters indicates the order in which the
values are predicted. Arrows indicate the dependencies between inputs – the context, C, and question,
Q – intermediate outputs – the selection, S, and inference, I, and the final answer, A. SI is the only
model where the answer does not have a direct dependency on the question. Note, EntailmentWriter
takes the hypothesis and context as input, where the hypothesis depends on the question and answer.
3
Faithful Reasoning Using Large Language Models
Figure 4 | The Selection model. The role of the Selection model is to take the context and question
and select a number of statements from the context to feed to the inference model. It is crucial that the
Selection model is not able to ‘hallucinate‘ facts. To achieve this we fine-tune a LM to predict sentence
labels, as show in (i). We then extract the only the sentence labels (ii) and compose a sentence (iii).
The statements from the context are then substituted back in (iv) resulting in a sentence composed
of statements from the context.
4
Faithful Reasoning Using Large Language Models
ded in its weights. For example, if we are answer- otherwise considered “incorrect”.
ing a multiple-choice question, we may provide
Assuming that the sum of probabilities for “cor-
the choices alongside the final inference, and use
rect” and “incorrect” is close to one, we can
the model to output the choice that most closely
use log 𝑝value (“correct”|reasoning trace) to score
matches that inference.
reasoning traces as they are being constructed,
To determine if the system is ready to answer where 𝑙𝑜𝑔 𝑝value denotes the distribution over to-
the question, we provide the Halter with a sen- kens learned by the language model, LMvalue .
tence of the following form: ‘Question:{question}
We use the value function to guide a beam
Given {inference}. Do you know the answer?’.
search. Starting from a single empty trace we
The output of the Halter LM is then either ‘Yes’
use SI to produce 𝑃 candidate steps. We evaluate
or ‘No’. If the output is ‘Yes’, the Halter LM is
each of these steps using the value function and
then prompted again to answer the question with
keep the top 𝐵 <= 𝑃 . We use SI again to generate
a prompt of the following form: ‘Given {infer-
𝑃 candidate next steps for each of the 𝐵 traces,
ence}. Which of the following most closely matches:
resulting in 𝐵 × 𝑃 traces. These are evaluated
{choices}? Answer:’. The output is one of the
using the value function and the best 𝐵 traces are
choices.
kept. We continue this process until all the traces
The Halter is applied after each step of SI to the have halted.
resulting inference. If the output of the Halter is
‘Unknown’ then we proceed to another iteration
of SI. If the output of the Halter is an answer, 4. Experimental Setup and Evaluation
then the process is terminated and the answer is of Components
returned. If, after a pre-specified number of SI
iterations, the system has not halted, it returns In this section, we detail how each component
the answer ‘Unknown’ (Alg. 2). An additional in our faithful reasoning model is trained, and
benefit of this is that it allows the model to say evaluate each component is isolation, where pos-
that it cannot answer the question, rather than sible. We use two challenging reasoning datasets,
making up an answer. We see a notable increase Proof Writer (Tafjord et al., 2021) and a modified
in performance when we remove questions that – more challenging – question-answering version
the model “thinks” it cannot answer (Figure 6). of EntailmentBank (Dalvi et al., 2021). We use a
This has significant implications for trust, safety 7B parameter Chinchilla language model in each
and the deployment of systems in the real world, of our components (Hoffmann et al., 2022).
where precision (rather than recall) is a priority.
4.1. Datasets
3.3. Search: Finding the Best Trace We fine-tune language models on examples of
ground truth reasoning traces. Two datasets
The selection module is non-deterministic, in the that provide reasoning traces are Entailment-
sense that it samples from multiple candidate Bank (Dalvi et al., 2021) and Proof Writer (PW)
statements, and this induces a tree of potential (Tafjord et al., 2021) (See Section B.1.1 for de-
reasoning traces. We use beam search to explore tails). Proof Writer is a dataset of logical reason-
this tree in order to find high quality traces. To ing problems that ask a question whose answer is
enable this, we introduce a value function which True or False given a context and provide step-by-
computes the value of adding a reasoning step step reasoning traces. Problems require 1, 2, 3 or
to the current trace. The value function is a lan- 5 steps of reasoning. EntailmentBank is derived
guage model, LMvalue , fine-tuned on examples of from the ARC (Clark et al., 2018) dataset of grade
partial reasoning traces that culminate in a “cor- school science questions. Dalvi et al. (2021) pro-
rect” or “incorrect” next step. A step is considered vide a dataset of hcontext, hypothesis, entailment
“correct” if it is both logically valid and is on the treei triples. Dalvi et al. (2021) propose three
ground truth (shortest) reasoning path. A step is tasks, Task 1 where the context consists of facts
5
Faithful Reasoning Using Large Language Models
Figure 5 | The two-stage Halter. First the model determines if a question is answerable given the
current inference. If it is, the model combines minimal additional information (that could not be used
on its own to answer the question) and predicts the answer.
6
Faithful Reasoning Using Large Language Models
Unknown’ while the final inference has the target and 78.1% final answer accuracy on PW and EB
‘ True’ or ‘ False’. respectively significantly outperforming baseline
models (Table 1 and 2). We also perform an ab-
To evaluate each halter independently of the
lation to demonstrate the key role of search in
Proof Only baseline or SI model, it is applied
our model (Table 1 and 2). Compared to base-
it to the ground truth proofs from the test split.
line models, we show that our model often has
Tables 1 and 2 show results for PW and EB respec-
higher reasoning trace accuracy; this is most evi-
tively. We see that the PW halter performs almost
dent on the more challenging tasks, for example
perfectly while the EB halter achieves 88.8% ac-
PW depth-5 and EB Task 2 (Figure 8 and 9). Fi-
curacy.
nally, we evaluate reasoning trace validity (Sec-
The Halter endows our model with the desir- tion 5.4) showing that baseline model are less
able property of predicting when it does not know likely to leverage the context when answering
the answer. Figure 6 shows that our halter model questions (Table 4 and 3) and are more likely
can reliably predict when the answer is known. to “hallucinate” statements than SI (Table 6 and
When we filter out the problems where the model 5). All results in this paper were obtained using
does not know the answer, we obtain nearly per- 7B parameter Chinchilla language model models
fect accuracy on the PW dataset for all depths (Hoffmann et al., 2022).
and 87.5% & 83.7% accuracy on Task 1 and 2
of EB dataset respectively. This has significant
implications for the deployment of such models 5.1. Baselines
in scenarios where precision matters. We consider three baseline models. A Proof +
Answer baseline where the LM is trained to pre-
4.4. Search dict the whole proof followed by the answer. A
Proof Only baseline where the model is trained
The Value LM is trained to predict whether the to predict only the proof. We use the Proof Only
current step of a reasoning trace is ‘ correct’ or ‘ in- baseline to ablate the SI model by pairing it with
correct’. Again, we use the ground truth reasoning our halter and search methods (see Tables 2 and
traces to construct examples of correct and incor- 1). Finally, we include EntailmentWriter + An-
rect partial reasoning traces. Constructing the cor- swer. This is the entailment model of Dalvi et al.
rect examples is simple; we take a ground truth (2021), which is fine-tuned to predict an entail-
trace with 𝑁 steps and construct the following ment tree alone, extended for question-answering
input for all 𝑛 ∈ [1, 2, ..., 𝑁 ], ‘Context:{context} by training the model to predict the answer after
Question:{question} Reason:{reason[1:n]} The the final conclusion.
above reasoning steps are’. The target is ‘ correct’
for all of these examples. To create the negative While EntailmentWriter + Answer and Proof
examples we take each positive example and re- + Answer tend to be very good at predicting the
place one of the correct supporting statements intermediate inferences (See Figures 17a and 9b)
with a different, randomly chosen statement from they tend to be less good at selecting the correct
the context and use our Inference LM to predict statements (see Figures 8a and 9a) and overall
the entailment. These training examples have the they perform less well on final answer accuracy
target ‘ incorrect’. Examples for both Proof Writer (see Table 1 and 2). This suggests that the models
and EntailmentBank are shown in Figures 15 and are predicting the correct intermediate outputs,
14. without selecting the correct supporting state-
ments and that the models are unable to use the
reasoning trace to answer the question. We also
5. Experiments and Results see that baseline models, with the exception of En-
tailmentWriter, often make-up facts when reason-
We present results on both Proof Writer (PW) ing (see Table 5), suggesting that their traces are
(Tafjord et al., 2021) and EntailmentBankQA not connected and therefore are not valid (Defn.
(EB). We show that our model achieves 88.1% 3). Finally, baseline models leverage information
7
Faithful Reasoning Using Large Language Models
Figure 6 | Our model accurately predicts when it ‘knows’ the answer. The ‘known only’ accuracy
is computed after filtering out the answers that are ‘Unknown’. The ‘all’ accuracy is computed on all
problems. This property is beneficial for applications that require high precision.
Figure 7 | The value function. Given the context, question and a partial reasoning trace the model
predicts the log probability that the current step is correct.
in the context less well than our model, see Ta- + Answer and Proof + Answer baseline mod-
ble 3, and Table 4 suggests that on Proof Writer, els on all PW and EB tasks. We also show that
SI is the only model to consistently leverage the search improves both baseline and SI perfor-
reasoning trace to answer questions. mance, providing the most significant improve-
ment for problems that require more reasoning
On inspection of EntailmentWriter (Dalvi et al.,
steps (PW, depth-5) and on problems with dis-
2021) outputs on the Proof Writer dataset we see
tractors in the context (EB, Task 2)
that the model often ‘cheats’, where the final in-
ference helps to answer the question, but does not On the EB dataset we see that SI model + Hal-
follow from the previously selected statements. ter + Search yields similar performance to Proof
See Section E.1. Our inference model does not Only + Halter + Search while also providing
have access to the question and therefore does faithful reasoning traces, which the Proof Only
not have the ability to cheat in this way. models do not. In fact, Table 5 shows that the
Proof Only models are prone to hallucinating
facts in up to 40% of problems, while SI has made
5.2. Final Answer Accuracy up facts to only 1% of problems2 . In the next sec-
tion we look at reasoning trace accuracy.
Tables 1 and 2 show final answer accuracy on the
Proof Writer (PW) and EntailmentBankQA (EB)
datasets respectively. Each table shows a com-
parison to baselines as well as an ablation; com-
paring both SI + Halter and the Proof Only +
Halter baseline model with and without search. 2 This
is likely a failure of the Selection model to produce
We see that SI outperforms EntailmentWriter an output with the correct syntax and could be filtered for.
8
Faithful Reasoning Using Large Language Models
Table 1 | Proof Writer ablation and comparison to baselines. Note that the baseline model does
not produce faithful reasoning traces and has access to the question when answering. By contrast, in
SI the reasoning is faithful and the answer depends on the reasoning trace. We show results using
search with a value function trained on Proof Writer, PW Search, and with a value function trained on
EntailmentBank, EB Search.
Model Task 1 Task 2 gests that these models are correctly predicting
Ground truth proof + Halter 88.8% 88.8% the intermediate outputs, but not via the correct
Proof + Answer 64.6% 7.8% reasoning. Note that this evaluation does not
EntailmentWriter* + Answer 50.0% 35.0% consider the ordering of the proof steps which
Proof Only + Halter 78.5% 60.3% may be inflating the perceived performance of
Proof Only + Halter + Search 82.9% 76.2% the baseline models since the baseline models are
SI model + Halter 72.4% 55.9% able to cheat by predicting later reasoning steps
SI model + Halter + Search 83.2% 72.9% without computing earlier reasoning steps.
Overall, on the EB dataset, we see that SI
Table 2 | EntailmentBankQA ablation and com-
outperforms EntailmentWriter + Answer and
parison to baselines. Note that the baseline
Proof + Answer baselines on the more challeng-
models are not causal. We use 7B parameter LMs
ing task, Task 2, which has distractors in the con-
for all models. *(Dalvi et al., 2021)
text. Figure 9 shows the Jaccard similarity be-
tween predicted and ground-truth leaves (i.e. the
selection) and the as well as the rouge scores be-
5.3. Evaluating Reasoning Trace Accuracy tween predicted and target intermediate outputs
Here we evaluate the reasoning trace accuracy on the EB dataset (additional results in Figure
of each model on the PW and EB datasets, see 18).
Figures 8, 17 and 9. Note that high baseline performance on the
Evaluating reasoning trace accuracy on PW is intermediate outputs (Figures 17 and 9) also sug-
straightforward since we are able to use exact gests that the baseline models have been trained
string match to check whether two strings are well and means that their poor final answer ac-
the same. We show the Jaccard similarity be- curacy cannot be attributed to poor training but
tween predicted and ground truth leaves (i.e the rather to the baseline models’ inability to use the
selection, Figure 8a), intermediate outputs (i.e. reasoning trace to answer the question.
the inferences, Figure 17a) and steps (i.e. selec-
tion and inference, Figure 8b). Results show that 5.4. Trace Validity
SI had the highest Jaccard similarity for leaves
and full traces while Proof + Answer and En- While requirements for Defn. 1-3 are satisfied by
tailment Writer + Answer have highest Jaccard the causal structure of our underling model (Fig-
similarity for intermediate outputs 17a. This sug- ure 2). Requirements of correctness, for Defn. 4,
9
Faithful Reasoning Using Large Language Models
(a) Jaccard similarity between the pre- (b) Jaccard similarity between the pre-
dicted and ground-truth selection, re- dicted and ground-truth reasoning steps.
ferred to as leaves, used to reason. We see We see that the SI models perform better
that the SI models perform better than Base- than Baseline models on the more challeng-
line models. ing Task 2.
Figure 8 | Evaluating proof steps for Proof Writer. We compute the above values only on problems
where the model predicts that the answer is not "Unknown". Additional analysis in Figure 17.
are less strongly enforced. Never the less we show 5.4.3. SI uses its reasoning trace to answer the
bellow that, unlike baseline models, our model question
is not able to cheat and therefore, the correctness
assumption is more likely to hold. First, however, Unlike baseline models, SI’s causal structure (see
we demonstrate that while SI satisfies the require- Figure 2) forces it to use the reasoning trace to
ment of being connected, other baseline models answer the question. On the other hand, some
fail to do so. baseline models are able to ‘cheat’, answering
questions without reasoning properly over the
context. In other words, they depend more on
5.4.1. SI produces connected traces the knowledge embedded in their weights than
on the context provided and the reasoning trace
For a reasoning trace to be connected it must not
constructed. To investigate this, we evaluate per-
hallucinate facts (Defn. 3). Tables 5 and 6 show
formance of a model that is given an incorrect
that while some baseline models fail to satisfy
context (a context different from the one needed
this requirement and often hallucinate facts. For
to solve the problem) and compare this to per-
example, the Proof + Answer baseline makes up
formance when the model is given the correct
facts to solve 60% of EntailmentBankQA prob-
context. If a model’s answer depends on careful
lems. On the other hand, SI makes up facts <
reasoning over the context, then it should be un-
1% of the time, suggesting that >99% of traces
able to answer the question when provided with
produced by SI are connected reasoning traces.
a random context.
On the EntailmentBankQA dataset, we use a
5.4.2. SI produces correct inferences
random context sampled from another problem in
Following Defn. 4 for a trace to be valid it must the dataset. Table 3 shows that both the Proof +
be connected (as above) and the steps must be Answer and EntailmentWriter + Answer mod-
correct; the inference must follow from the selec- els are still able to answer 30% and 23% of ques-
tion. Tables 7 shows that when fed with a valid tions respectively, while SI + Halter is only able
selection the inference model reliably produces to answer 9%. We also see that while almost half
the correct inference. It is harder to evaluate in- of the final accuracy could be accounted for by
ference accuracy on EntailmentBankQA, however ‘cheating’ or chance in the baseline models, less
Table 8 suggests that the inference model is accu- that 12.5% of SI + Halter final accuracy could
rate, with a RougeL score if 0.69. be attributed to ‘cheating’ or chance.
10
Faithful Reasoning Using Large Language Models
11
Faithful Reasoning Using Large Language Models
(a) Jaccard similarity between the ground (b) Rouge score on the intermediate out-
truth leaves (e.g. selection) and those puts (or inferences) from each step (ig-
used by the model. We see that SI outper- noring order). The baseline models that do
forms all of the baseline models on the more not use search or the halter perform poorly
challenging task, Task 2. on Task 2.
Figure 9 | Evaluating reasoning steps for EntailmentBankQA. We compute the above values only
on problems where the model predicts that the answer is not "Unknown". Note, none of these metrics
account for order of the reasoning steps.
12
Faithful Reasoning Using Large Language Models
et al., 2021; Tafjord et al., 2021). In Proof Writer, a whole trace (Cobbe et al., 2021; Jhamtani and
Tafjord et al. (2021) train an LM to enumerate Clark, 2020; Nye et al., 2022), we train a model
implications (and corresponding reasoning steps) on partial reasoning traces, resulting in a model
given a hypothesis. A valid reasoning trace can more similar to a value function which assesses
be constructed from these outputs. However, this the “value” of the current reasoning step, which
approach is limited to answering questions whose can be used for step-level search.
answer is True, False or Unknown, and a reason-
Bostrom et al. (2022) also use step-level search
ing trace must be constructed post-hoc.
to determine whether a hypothesis is entailed
Finally, while several works have informally by a set of statements. While we perform a
introduced the notion of faithful reasoning beam search, using a learned value function, to
(Bostrom et al., 2022; Gupta et al., 2022; Ku- find high-quality reasoning traces, Bostrom et al.
mar and Talukdar, 2020), we have related this (2022) depend on exhaustive search to evaluate
more precisely to the definition of valid reasoning all possible pairs of statements to use for selection.
in logic. Unlike Bostrom et al. (2022), our selection step is
not limited to selecting just two statements. This
allows us to more efficiently solve Proof Writer
6.3. Using Search for Reasoning Problems tasks whose rules may be conditioned on multi-
ple statements.
The notion of valid and invalid reasoning traces
have also been explored in the context of search.
Jhamtani and Clark (2020) develop datasets of 6.4. The Problem of When to Stop Reasoning
valid and invalid reasoning traces for grade school
science questions. These can be used to train mod- The problem of when to “stop” rarely features in
els to detect valid reasoning traces. However, it the deep learning literature because our models
can be expensive to collect both valid and invalid typically answer problems in a single step. How-
reasoning traces hence they collect only shallow ever, there are some exceptions. A simple exam-
traces and their traces do not include interme- ple is text synthesis with large language models
diate inferences. Instead, we show how, given where the model has to determine when to stop
a valid reasoning trace, we can generate many producing tokens. This is often handled by a spe-
invalid reasoning traces that can be used to fine- cial ‘End Of Sequence’ token (Graves, 2013). Other
tune a value function and used to guide search. examples in the deep learning literature draw ran-
Also, rather than learning a verifier that evaluates dom variables from a parameterised distribution
13
Faithful Reasoning Using Large Language Models
to predict when to stop reasoning (Banino et al., ple, if the Selection model selects two unrelated
2021; Graves, 2016). statements, then the Inference model may draw a
nonsensical conclusion. We also mitigate this by
Related work by Kadavath et al. (2022) also
introducing a learned value function (Section 3.3)
investigates when LMs “know” the answer. Their
that filters out poor reasoning traces, although
model proposes a number of candidates, and pre-
this still cannot guarantee the correctness of ev-
dicts whether each candidate is the answer to
ery step. Examples of both correct and incorrect
the question or not. Additionally, Bostrom et al.
reasoning traces, along with their value (accord-
(2022) tackle the less challenging problem of de-
ing to the value function), are shown in Section
termining whether an inference matches a goal
D.
state.
In this paper we have focused on developing
In summary, current work focuses on True/-
models that answer questions using valid reason-
False/NLI tasks (Bostrom et al., 2022; Dalvi et al.,
ing. For now we have assumed access to a context,
2021; Tafjord et al., 2021) while our work tack-
over which to reason. However, while there are
les question-answering. This is not a trivial dif-
some settings where such a context may be pro-
ference. In question-answering, there is less in-
vided, in most real world settings this is unlikely.
formation with which to construct the reasoning
In this paper we have chosen to focus on the chal-
trace, since the “goal” is not known, and learning
lenging problem of multi-step reasoning within a
when to terminate is also more challenging. More-
given context. However, in future work we hope
over, current work leverages reasoning traces to
to incorporate retrieval to populate the context,
boost performance – rather than to aid explain-
and there is already interesting research in this
ability or build trust – allowing for hallucination
direction (Dalvi et al., 2021; Ribeiro et al., 2022;
of “knowledge” during reasoning (Kojima et al.,
Xie et al., 2020).
2022; Wei et al., 2022). Furthermore, some ex-
isting approaches still allow the opportunity for
“cheating” Dalvi et al. (2021); Wei et al. (2022) by
8. Discussion
providing the answering part of the model with
direct access to the question3 . Finally, unlike most Language models are being applied, with great
other models (Dalvi et al., 2021; Wei et al., 2022), success, to many different problems Alayrac et al.
the causal structure of our model (see Figure 10) (2022); Nakano et al. (2021); Nye et al. (2022);
mirrors the requirements for validity, see Table Rae et al. (2021); Zeng et al. (2022). However,
10. Other approaches that do satisfy validity have they largely remain black boxes; we do not know
their own limitations, as detailed above. how the models produce their responses. One
solution to this is to develop models that can pro-
duce faithful reasoning traces. We characterise
7. Limitations faithful reasoning in terms of logical validity (Sec-
tion 2), and propose Selection-Inference, a model
The causal structure of our model mirrors the
that mirrors the structure of this definition, and
requirements for producing a valid trace (Defn.
is guaranteed to produce valid reasoning traces
4). Requirements for a connected reasoning trace
under the assumption that individual steps are
(Defn. 3) are guaranteed by design (Section
correct (Defn. 4). By fine-tuning an Inference
3.1.1). Unavoidably, given our use of LMs, we
model specifically for this task and preventing it
cannot guarantee that all reasoning steps will be
from “cheating”, we increase the likelihood that
logically correct (Defn. 4). However, our architec-
this assumption holds (Tables 7and 8). Finally, to
ture is designed to encourage logical correctness
find high-quality reasoning traces, we introduce a
by preventing models from ‘cheating’. For exam-
value function, and use it to guide a beam search
3 Specifically in these cases, the question itself con- through the tree of potential traces induced by
tains sufficient information to supply the answer, unlike the non-determinism of selection.
in Proof Writer where the question is also not sufficient for
answering correctly. The resulting model achieves higher final an-
14
Faithful Reasoning Using Large Language Models
swer accuracy than baseline models on both parrots: Can language models be too big?
Proof Writer Tafjord et al. (2021) and Entailment- In Proceedings of the 2021 ACM Conference
BankQA Dalvi et al. (2021) tasks. We see that on Fairness, Accountability, and Transparency,
both Proof Only and SI benefit from search (Ta- FAccT ’21, page 610–623, New York, NY,
bles 1 and 2). When compared to baseline models, USA, 2021. Association for Computing Machin-
our model is less likely to hallucinate facts while ery. ISBN 9781450383097. doi: 10.1145/
reasoning (Table 6 and 5). We see that the SI + 3442188.3445922. URL https://doi.org/
Halter model is far more likely than baseline mod- 10.1145/3442188.3445922.
els to pay attention to the context (Table 3) and
G. Betz, C. Voigt, and K. Richardson. Critical
to leverage the reasoning trace (Table 4). Over-
thinking for language models. In Proceed-
all, we see that SI + Halter (+ Search) models
ings of the 14th International Conference on
achieves superior reasoning trace accuracy, espe-
Computational Semantics (IWCS), pages 63–
cially on the more challenging tasks (Figures 8
75, Groningen, The Netherlands (online), June
and 9).
2021. Association for Computational Linguis-
Our approach exemplifies a trend towards algo- tics. URL https://aclanthology.org/
rithmic prompting, a form of automated prompt 2021.iwcs-1.7.
engineering in which querying a language model
K. Bostrom, Z. Sprague, S. Chaudhuri, and G. Dur-
becomes a computational primitive. The re-
rett. Natural language deduction through
sponses of the language model can be manip-
search over statement compositions. arXiv
ulated to construct new prompts that are then
preprint arXiv:2201.06028, 2022.
used to make further queries. Model queries
and prompt construction are composed into algo- P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sab-
rithms with the usual computational constructs: harwal, C. Schoenick, and O. Tafjord. Think
sequence, choice, and iteration. Algorithmic you have solved question answering? try arc,
prompting can be used to elicit more sophisti- the ai2 reasoning challenge. arXiv preprint
cated and nuanced behaviour from a language arXiv:1803.05457, 2018.
model than would otherwise be possible. For ex-
ample, as our work shows, this approach can be K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton,
used to develop models capable of faithful rea- R. Nakano, C. Hesse, and J. Schulman. Training
soning, without compromising performance. In verifiers to solve math word problems. arXiv
future work we aim to leverage advancements preprint arXiv:2110.14168, 2021.
in retrieval to populate the context, rather than A. Creswell, M. Shanahan, and I. Higgins.
relying on the context being provided in the ques- Selection-inference: Exploiting large language
tion. models for interpretable logical reasoning.
arXiv preprint arXiv:2205.09712, 2022.
15
Faithful Reasoning Using Large Language Models
16
Faithful Reasoning Using Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, Uesato, Jordi Grau-Moya, Ramana Kumar and
E. Chi, Q. Le, and D. Zhou. Chain of thought Irina Higgins for insightful discussions.
prompting elicits reasoning in large language
models, 2022.
Acknowledgements
The authors would like to thank Angeliki Lazari-
dou, Charles Blundell and Christopher Summer-
field for feedback on our paper as well as Jonathan
17
Faithful Reasoning Using Large Language Models
11 𝑎 ← LMhalt ( 𝑖, 𝑐);
𝑖𝑘 = LMInference ( 𝑠𝑘 ) (2) 12 return 𝑎;
B.1. Datasets
B.3. Halter
B.1.1. Proof Writer
Figure 13 shows how training data is constructed
The Proof Writer dataset (Tafjord et al., 2021) for training the halting model.
contains both a Closes and Open World Assump-
tion version (CWA and OWA respectively). We
use a subset of the OWA dataset. This is because
for the CWA dataset everything that cannot be B.4. Search
proven is considered False. This means that prob-
Figures 14 and 15 show examples of data points
lems whose answer is False, do not have reasoning
used to train the value function. The targets for
traces. On the other hand the OWA dataset con-
the value function are either ‘ correct’ or ‘ incor-
tains proofs for problems whose answers are True
rect’.
and False. Those without proofs are "Unknown".
Since we need proofs for training and evaluation,
we use the problems from the OWA dataset that
18
Faithful Reasoning Using Large Language Models
(a) Example of hinput, targeti pairs used to train In Table 5 we saw that baseline models, with
the Selection LLM.
the exception of EntailmentWriter, were more
likely to hallucinate facts while reasoning on the
EntailmentBank dataset than SI. Interestingly, Ta-
ble 6 shows that Proof + Answer and Proof Only
(b) Example of hinput, targeti pairs used to train baseline models have learned not to make up facts
the Inference LLM.
while reasoning on the Proof Writer dataset. Note
Figure 11 | Examples of Proof Writer training that both EntailmentWriter and SI (ours) are de-
pairs for Selection and Inference LLMs signed not to make up facts.
Figure 18 shows the Rouge 1 scores between
the predicted and ground truth, ordered, interme-
C. Additional Results diate inferences. We see that EntailmentWriter
is very good at single step inference on Task 1
C.1. Halter problems, but performance quickly declines for
Figure 16 shows qualitative results from the Hal- problems requiring multiple steps of reasoning.
ter model trained on EntailmentBankQA. In general SI models and models using halting
and search outperform the baseline models.
Tables 7 and 8 show the accuracy of the Infer-
C.2. Reasoning Trace Accuracy ence LM when fed valid selections.
Figure 17 show additional evaluation of reasoning
traces on the Proof Writer dataset.
19
Faithful Reasoning Using Large Language Models
Figure 13 | Example of how data is generated for the halter. Above are examples for four training
data points. The first three, Do you know the answer? examples, are used to train the halting model
to learn when to halt. The final datum is used to train the halter to select an answer from the available
choices.
Model depth
depht-1 depth-2 depth-3 depth-5
Proof + Answer 0% 0% 1% 1%
EntailmentWriter + Answer 0% 0% 0% 0%
Proof Only + Halter 4% 1% 1% 0%
Proof Only + Halter + Search 0% 0% 0% 0%
SI + Halter 0% 0% 0% 0%
SI + Halter + Search 0% 0% 0% 0%
Table 6 | Proof Writer: Proportion of problems on which models made-up facts that were not in
the context. We see that the Proof + Answer and Proof Only baseline models have learned not to
make up facts, while EntailmentWriter and SI are designed not to make up facts.
Metric Inference Accuracy that the rabbit needs the rabbit and the rabbit
Rouge1 0.69 likes the bald eagle. Therefore, the bald eagle is
Rouge2 0.55 blue.
RougeL 0.69 If something is blue then it sees the rabbit. We
BLEURT > 0.28 64% know that the bald eagle is blue. Therefore, the
bald eagle sees the rabbit.
Table 8 | EntailmentBankQA inference accu- Answer: False
racy. We report Rouge scores as well as accu- Target Proof:
racy using a BLEURT (Sellam et al., 2020) with If something sees the mouse then it needs the
a threshold of 0.28 determine accuracy (as done rabbit. We know that the rabbit sees the mouse.
by Dalvi et al. (2021). Therefore, the rabbit needs the rabbit.
If something needs the rabbit and it likes the
bald eagle then the bald eagle is blue. We know
If something needs the rabbit and it likes the that the rabbit needs the rabbit and the rabbit
bald eagle then the bald eagle is blue. likes the bald eagle. Therefore, the bald eagle is
If something likes the mouse then it is round. blue.
If something is rough and it does not see the If something is blue then it sees the rabbit. We
mouse then it does not need the bald eagle. know that the bald eagle is blue. Therefore, the
If something sees the mouse then it needs the bald eagle sees the rabbit.
rabbit. Target Answer: False
the rabbit likes the mouse.
the bald eagle likes the rabbit.
the rabbit likes the bald eagle.
the rabbit sees the bald eagle. Example 2 (value: -8.6e-06)
the rabbit is blue.
the rabbit needs the bald eagle. Context:
the bald eagle needs the mouse. If someone needs the bald eagle and the bald
the rabbit needs the mouse. eagle needs the squirrel then the bald eagle likes
the mouse needs the bald eagle. the cow.
the rabbit is not green. If someone needs the squirrel then they are
the bald eagle likes the mouse. green.
the rabbit sees the mouse. If someone needs the cow and they need the dog
the bald eagle is round. then the cow is green.
the bald eagle does not see the mouse. If someone likes the squirrel and they like the
the rabbit is red. dog then the dog is nice.
the rabbit is rough. If someone sees the cow then they need the dog.
Question: Does it imply that the statement "The If someone is cold then they see the cow.
bald eagle does not see the rabbit" is True? If someone needs the dog then the dog sees the
Proof: cow.
If something sees the mouse then it needs the If someone likes the dog and the dog is blue then
rabbit. We know that the rabbit sees the mouse. the dog sees the bald eagle.
Therefore, the rabbit needs the rabbit. the dog is nice.
If something sees the mouse then it needs the the cow is nice.
rabbit. We know that the rabbit sees the mouse. the dog likes the squirrel.
Therefore, the rabbit needs the rabbit. the squirrel likes the dog.
If something sees the mouse then it needs the the dog likes the cow.
rabbit. We know that the rabbit sees the mouse. the squirrel sees the dog.
Therefore, the rabbit needs the rabbit. the bald eagle is cold.
If something needs the rabbit and it likes the the bald eagle sees the dog.
bald eagle then the bald eagle is blue. We know the squirrel sees the bald eagle.
21
Faithful Reasoning Using Large Language Models
the dog sees the bald eagle. All cold things are nice.
the squirrel needs the cow. If the bald eagle needs the dog and the dog eats
the squirrel is cold. the bald eagle then the dog is not cold.
Question: Does it imply that the statement "The If something needs the bald eagle then it eats the
cow is not green" is True? cat.
Proof: the tiger needs the bald eagle.
If someone is cold then they see the cow. We the bald eagle is nice.
know that the squirrel is cold. Therefore, the the bald eagle is round.
squirrel sees the cow. the cat likes the bald eagle.
If someone is cold then they see the cow. We the bald eagle does not eat the cat.
know that the bald eagle is cold. Therefore, the the cat likes the tiger.
bald eagle sees the cow. the cat needs the tiger.
If someone sees the cow then they need the the dog likes the bald eagle.
dog. We know that the squirrel sees the cow. the tiger eats the dog.
Therefore, the squirrel needs the dog. the cat eats the tiger.
If someone needs the cow and they need the dog the bald eagle is green.
then the cow is green. We know that the squirrel the bald eagle likes the tiger.
needs the cow and the squirrel needs the dog. the cat is red.
Therefore, the cow is green. the dog is not round.
Answer: False Question: Does it imply that the statement "The
Target Proof: cat is nice" is True?
If someone is cold then they see the cow. We Proof:
know that the squirrel is cold. Therefore, the If something is red and it likes the tiger then it
squirrel sees the cow. needs the dog. We know that the cat is red and
If someone sees the cow then they need the the cat likes the tiger. Therefore, the cat needs
dog. We know that the squirrel sees the cow. the dog.
Therefore, the squirrel needs the dog. If something needs the bald eagle then it eats the
If someone needs the cow and they need the dog cat. We know that the tiger needs the bald eagle.
then the cow is green. We know that the squirrel Therefore, the tiger eats the cat.
needs the cow and the squirrel needs the dog. If something eats the cat then the cat needs the
Therefore, the cow is green. bald eagle. We know that the tiger eats the cat.
Target Answer: False Therefore, the cat needs the bald eagle.
If something eats the cat and it needs the dog
then the cat is cold. We know that the tiger eats
the cat and the cat needs the dog. Therefore, the
Example 3 (value: -9.5e-06) cat is cold.
All cold things are nice. We know that the cat is
Context: cold. Therefore, the cat is nice.
If something is red and it likes the tiger then it Answer: True
needs the dog. Target Proof:
If the cat likes the tiger and the cat needs the dog If something needs the bald eagle then it eats the
then the dog is not cold. cat. We know that the tiger needs the bald eagle.
If something eats the dog and the dog needs the Therefore, the tiger eats the cat.
bald eagle then the dog is nice. If something eats the cat then the cat needs the
If something eats the cat and it needs the dog bald eagle. We know that the tiger eats the cat.
then the cat is cold. Therefore, the cat needs the bald eagle.
If the tiger needs the dog and the dog is green If something needs the bald eagle then it eats the
then the dog needs the tiger. cat. We know that the cat needs the bald eagle.
If something eats the cat then the cat needs the Therefore, the cat eats the cat.
bald eagle.
22
Faithful Reasoning Using Large Language Models
If something is red and it likes the tiger then it Therefore, the rabbit needs the rabbit.
needs the dog. We know that the cat is red and If something needs the rabbit and it likes the
the cat likes the tiger. Therefore, the cat needs bald eagle then the bald eagle is blue. We know
the dog. that the rabbit needs the rabbit and the rabbit
If something eats the cat and it needs the dog likes the bald eagle. Therefore, the bald eagle is
then the cat is cold. We know that the cat eats blue.
the cat and the cat needs the dog. Therefore, the If something is blue then it sees the rabbit. We
cat is cold. know that the bald eagle is blue. Therefore, the
All cold things are nice. We know that the cat is bald eagle sees the rabbit.
cold. Therefore, the cat is nice. Answer: True
Target Answer: True Target Proof:
If something sees the mouse then it needs the
rabbit. We know that the rabbit sees the mouse.
Example 4 (value: -1.0e-05) Therefore, the rabbit needs the rabbit.
If something needs the rabbit and it likes the
Context: bald eagle then the bald eagle is blue. We know
If something is rough and it does not see the that the rabbit needs the rabbit and the rabbit
mouse then it does not need the bald eagle. likes the bald eagle. Therefore, the bald eagle is
If something needs the rabbit and it likes the blue.
bald eagle then the bald eagle is blue. If something is blue then it sees the rabbit. We
If something sees the mouse then it needs the know that the bald eagle is blue. Therefore, the
rabbit. bald eagle sees the rabbit.
If something likes the rabbit and it sees the bald Target Answer: True
eagle then the bald eagle needs the rabbit.
If something is blue then it sees the rabbit.
If something likes the mouse then it is round.
the rabbit needs the bald eagle. Example 5 (value: -1.0e-05)
the rabbit likes the bald eagle.
the bald eagle does not see the mouse. Context:
the bald eagle needs the mouse. If the mouse is young and the mouse likes the cat
the bald eagle likes the mouse. then the mouse chases the cat.
the mouse needs the bald eagle. If something likes the mouse and the mouse likes
the bald eagle likes the rabbit. the tiger then the tiger likes the mouse.
the rabbit is red. If the tiger is cold then the tiger sees the mouse.
the rabbit is rough. If something sees the mouse then it is young.
the rabbit is not green. If something chases the cat and it likes the cat
the rabbit is blue. then the cat sees the mouse.
the rabbit likes the mouse. If something is young and it chases the mouse
the rabbit sees the mouse. then it likes the cat.
the bald eagle is round. If something chases the tiger and the tiger likes
the rabbit sees the bald eagle. the cat then the cat likes the mouse.
the rabbit needs the mouse. If something sees the cat and it chases the cat
Question: Does it imply that the statement "The then it likes the cat.
bald eagle sees the rabbit" is True? the tiger is nice.
Proof: the tiger sees the cat.
If something sees the mouse then it needs the the mouse chases the tiger.
rabbit. We know that the rabbit sees the mouse. the tiger is red.
Therefore, the rabbit needs the rabbit. the mouse is young.
If something sees the mouse then it needs the the mouse is round.
rabbit. We know that the rabbit sees the mouse. the tiger likes the mouse.
23
Faithful Reasoning Using Large Language Models
24
Faithful Reasoning Using Large Language Models
25
Faithful Reasoning Using Large Language Models
rabbit. Therefore, the squirrel likes the mouse. If someone is green then they like the cow. We
If someone likes the mouse and they like the know that the dog is green. Therefore, the dog
rabbit then the mouse is red. We know that the likes the cow.
squirrel likes the mouse and the mouse likes the Answer: False
rabbit. Therefore, the mouse is red. Target Proof:
Answer: False If someone is green then they like the cow. We
Target Proof: know that the dog is green. Therefore, the dog
If the squirrel likes the mouse and the mouse eats likes the cow.
the squirrel then the mouse likes the squirrel. We Target Answer: False
know that the squirrel likes the mouse and the
mouse eats the squirrel. Therefore, the mouse
Below are examples of the worst five reasoning
likes the squirrel.
traces from SI + Halter + Search according to
If someone likes the squirrel then the squirrel
the value function. Errors are highlighted in red.
likes the rabbit. We know that the mouse likes
the squirrel. Therefore, the squirrel likes the
rabbit.
If someone likes the mouse and they like the Example 1 (value: -10.9)
rabbit then the mouse is red. We know that the
squirrel likes the mouse and the squirrel likes the Context:
rabbit. Therefore, the mouse is red. If something is kind then it likes the cow.
Target Answer: False If something likes the cow then the cow is kind.
the cow is big.
the mouse eats the bear.
the tiger is kind.
Example 10 (value: -1.2e-05) the bear visits the tiger.
Question:
Context: If someone likes the mouse and the Does it imply that the statement "The cow likes
mouse sees the cow then they like the dog. the cow" is True?
If someone is green and they chase the lion then Proof:
the lion chases the mouse. If something is kind then it likes the cow. We
If the mouse does not like the lion then the know that the tiger is kind. Therefore, the tiger
mouse does not chase the lion. likes the cow.
If the dog is young then the dog is round. If something likes the cow then the cow is kind.
If someone likes the lion then the lion likes the We know that the tiger likes the cow. Therefore,
cow. the cow is kind.
If someone is green then they like the cow. If something is kind then it likes the cow. We
If someone likes the cow and they are rough then know that the cow is kind. Therefore, the cow
the cow does not like the lion. likes the cow.
If someone likes the cow then they chase the cow. Answer: True
the lion likes the cow. Target Proof: If something is kind then it
the cow sees the lion. likes the cow. We know that the tiger is kind.
the dog sees the mouse. Therefore, the tiger likes the cow.
the mouse chases the cow. If something likes the cow then the cow is kind.
the dog is green. We know that the tiger likes the cow. Therefore,
the mouse does not chase the dog. the cow is kind.
the mouse is rough. If something is kind then it likes the cow. We
Question: know that the cow is kind. Therefore, the cow
Does it imply that the statement "The dog does likes the cow.
not like the cow" is True? Target Answer: True
Proof:
26
Faithful Reasoning Using Large Language Models
27
Faithful Reasoning Using Large Language Models
28
Faithful Reasoning Using Large Language Models
Context:
Example 3 (value: -0.045) drinking straws are usually made of plastic
plastic is flexible
Context: Question:
the surface of the moon contains mountains Which of these is MOST flexible? Broom handle
the surface of the earth contains mountains OR Wooden ruler OR Drinking straw OR Sewing
Question: needle.
Which features can be found on the surface of Proof:
both Earth and the Moon? plants OR oceans OR drinking straws are usually made of plastic. We
animals OR mountains. know that plastic is flexible. Therefore, drinking
Proof: straws are flexible.
the surface of the moon contains mountains. Answer: Drinking straw
We know that the surface of the earth contains Target Proof:
mountains. Therefore, both the surface of the drinking straws are usually made of plastic.
moon and the surface of the earth contain We know that plastic is flexible. Therefore, a
mountains. drinking straw is flexible.
Answer: mountains Target Answer: Drinking straw
Target Proof:
29
Faithful Reasoning Using Large Language Models
30
Faithful Reasoning Using Large Language Models
31
Faithful Reasoning Using Large Language Models
the gene from each parent temperature. Therefore, the air mass formed
Target Proof: close to the ocean would increase in temperature.
light eyes are a recessive trait for human. We the air mass formed close to the ocean would
know that jessica is a human with light eyes. increase in temperature. We know that if
Therefore, jessica has a recessive trait of light something increases in temperature, then that
eyes. something will be warm. Therefore, the air mass
jessica has a recessive trait of light eyes. We formed close to the ocean would be warm.
know that receiving the recessive gene for a trait Answer: moist and warm
from each parent causes the offspring to express Target Proof:
that recessive trait. Therefore, jessica received if something receives sunlight, then it will
the recessive gene from each of her parent. increase in temperature. We know that the air
Target Answer: Jessica inherited one recessive around the equator receives direct sunlight.
form of the gene from each parents from each Therefore, the air around the equator will
parent. increase in temperature.
the air around the equator will increase in tem-
perature. We know that if something increases in
temperature, then that something will be warm.
D.4. EntailmentBankQA Task 1 - depth 3: SI Therefore, the air is warm around the equator.
+ Halter + Search the air is warm around the equator. We know
that the air mass formed close to the ocean is
Below are the top five examples of reasoning
humid in moisture. Therefore, air masses formed
traces on EntailmentBankQA problems that re-
over the ocean near the equator are moist /
quire three steps of reasoning, ranked by their
warm.
value according to our learned value function and
Target Answer: moist and warm
are otherwise not cherry picked.
32
Faithful Reasoning Using Large Language Models
that friction occurs when two objects move passed from parent to offspring. We know that
against each other and a rock is a kind of object the color of / coloration of fur is an inherited
and a frozen pond is a kind of object. Therefore, characteristic. Therefore, offspring can inherit
friction occurs when the rock is slid across the the color of fur from parents.
frozen pond. offspring can inherit the color of fur from parents.
friction occurs when the rock is slid across the We know that the parent cats have white fur
frozen pond. We know that friction causes the and white fur is white in color. Therefore, the
speed of an object to decrease and a rock is a offspring of the cats will inherit white fur from
kind of object. Therefore, friction will cause their parents.
the speed of the rock to decrease when it is slid Answer: having white fur
across the frozen pond. Target Proof:
Answer: The force of friction between the ice inheriting is when an inherited characteristic is
and the rock passed from parent to offspring. We know that
Target Proof: the color of / coloration of fur is an inherited
sliding is a kind of movement. We know that a characteristic. Therefore, the offspring will
rock is slided across the frozen pond. Therefore, inherit the color of the fur of its parent.
a rock is moving on the frozen pond. the offspring will inherit the color of the fur of
a rock is moving on the frozen pond. We know its parent. We know that a cat is a kind of animal
that friction occurs when two objects move and animals produce offspring. Therefore, a cat
against each other and a rock is a kind of object will inherit the color of the fur of its parents.
and a frozen pond is a kind of object. Therefore, a cat will inherit the color of the fur of its parents.
friction will occur between the rock and the We know that the parent cats have white fur and
frozen pond. white fur is white in color. Therefore, the cat will
friction will occur between the rock and the inherit the white colored fur from its parents.
frozen pond. We know that a rock is a kind of Target Answer: having white fur
object and friction causes the speed of an object
to decrease. Therefore, friction will cause the
speed of the rock to decrease.
Target Answer: The force of friction between Example 4 (value: -0.027)
the ice and the rock
Context:
carbon is a kind of element
a compound is made of two or more elements
Example 3 (value: -0.027) chemically combined
calcium carbonate is made of calcium, carbon,
Context: and oxygen
the parent cats have white fur calcium is a kind of element
white fur is white in color oxygen is a kind of element
inheriting is when an inherited characteristic is Question:
passed from parent to offspring Calcium carbonate (CaCO3 ) is considered a
a cat is a kind of animal compound because it is made up of two or more
the color of / coloration of fur is an inherited molecules. OR elements. OR gases. OR solids..
characteristic Proof:
animals produce offspring carbon is a kind of element. We know that
Question: calcium is a kind of element and oxygen is a kind
Which trait would a cat most likely inherit from of element. Therefore, calcium, oxygen, and
its parents? having a scar OR chasing a toy OR carbon are all elements.
having white fur OR knowing its name. calcium, oxygen, and carbon are all elements.
Proof: We know that calcium carbonate is made of
inheriting is when an inherited characteristic is calcium, carbon, and oxygen. Therefore, calcium
33
Faithful Reasoning Using Large Language Models
34
Faithful Reasoning Using Large Language Models
gull, tree frog, grasshopper OR coyote, antelope, quickly. OR the toxic chemicals having no effect
bison OR squirrel, deer, woodpecker. on the pond. OR fish in the pond being harmed
Proof: or dying off..
a squirrel is a kind of animal. We know that a Proof:
deer is a kind of animal. Therefore, a deer and a a fish is a kind of organism. We know that
squirrel are kinds of animals. harming an organism can cause that organism to
a deer and a squirrel are kinds of animals. We die. Therefore, harming a fish can cause that fish
know that animals adapt to their environment. to die.
Therefore, a deer and a squirrel adapt to their harming a fish can cause that fish to die. We
environment. know that poisonous things can cause harm to
Answer: squirrel, deer, woodpecker organisms. Therefore, poisonous things can
Target Proof: cause fish to die.
a squirrel is a kind of animal. We know that harming a fish can cause that fish to die. We
squirrels live in forests. Therefore, squirrels are know that poisonous things can cause fish to die.
animals live in forests. Therefore, poisonous things can harm fish.
a deer lives in a forest. We know that a deer is a poisonous things can cause fish to die. We know
kind of animal. Therefore, deer is an animal lives that toxic chemicals are poisonous. Therefore,
in a forest. toxic chemicals can cause fish to die.
woodpeckers live in forests. We know that Answer: fish in the pond being harmed or dying
a woodpecker is a kind of bird. Therefore, off
woodpeckers are animals live in forests. Target Proof:
squirrels are animals live in forests. We know toxic chemicals are dumped into a pond. We
that deer is an animal lives in a forest and know that toxic chemicals are poisonous. There-
woodpeckers are animals live in forests. fore, the pond is poisonous because of toxic
Therefore, squirrels and deer and woodpeckers chemicals.
are animals that live in forests. poisonous things can cause harm to organisms.
squirrels and deer and woodpeckers are animals We know that harming an organism can cause
that live in forests. We know that animals adapt that organism to die. Therefore, poisonous things
to their environment and a forest is a kind of can cause organisms to die.
environment. Therefore, squirrels and deer and a pond is a kind of body of water. We know that
woodpeckers are adapted to living in forests. a fish lives in water. Therefore, fish live in ponds.
Target Answer: squirrel, deer, woodpecker fish live in ponds. We know that a fish is a kind of
organism. Therefore, a fish is a kind of organism
that lives in ponds.
the pond is poisonous because of toxic chemicals.
Example 2 (value: -0.039) We know that poisonous things can cause
organisms to die and a fish is a kind of organism
Context: that lives in ponds. Therefore, the pond with
toxic chemicals are dumped into a pond toxic chemicals can cause fish in the pond to die.
poisonous things can cause harm to organisms Target Answer: fish in the pond being harmed
toxic chemicals are poisonous or dying off.
harming an organism can cause that organism to
die
a fish is a kind of organism
a pond is a kind of body of water Example 3 (value: -0.049)
a fish lives in water
Question: In this example, the model makes a mistake on
Dumping toxic chemicals into a pond would most the first step of reasoning but is able to recover
likely result in an increase in oxygen levels in the and does not use the incorrect inference in later
pond. OR plants near the pond growing more steps.
35
Faithful Reasoning Using Large Language Models
36
Faithful Reasoning Using Large Language Models
Answer: creating recessive genes in the microbes wood boards can be used to build houses.
Target Proof: We know that wood is a renewable resource.
a source of something produces that something. Therefore, wood boards are a renewable resource
We know that specialized cells are a source of that can be used to build houses.
chemicals that destroy disease-causing microbes. Answer: wood boards
Therefore, specialized cells produce chemicals Target Proof:
that destroy disease-causing microbes. building materials are used to build buildings.
decreasing something negative has a positive We know that a house is a kind of building.
impact on a thing. We know that destroying Therefore, building materials can be used to
something decreases the amount of that some- build houses.
thing. Therefore, destroying something negative building materials can be used to build houses.
has a positive impact on a thing. We know that wood boards can be used to build
destroying something negative has a positive houses. Therefore, wood boards are a kind of
impact on a thing. We know that disease-causing building material.
microbes have a negative impact on the body. wood is a renewable resource. We know that a
Therefore, destroying disease-causing microbes renewable resource is a kind of natural resource.
has a positive impact on the body. Therefore, wood is a kind of renewable natural
destroying disease-causing microbes has a resource.
positive impact on the body. We know that wood is a kind of renewable natural resource.
protecting a living thing has a positive impact on We know that wood boards are made of wood.
that living thing ’s survival / health. Therefore, Therefore, wood boards are made of a renewable
destroying disease-causing microbes is a kind of natural resource.
method for protecting the body. wood boards are a kind of building material. We
specialized cells produce chemicals that destroy know that wood boards are made of a renewable
disease-causing microbes. We know that destroy- natural resource. Therefore, wood boards are
ing disease-causing microbes is a kind of method a kind of building material that is made of a
for protecting the body. Therefore, specialized renewable natural resource.
cells protect the body be producing chemicals Target Answer: wood boards
that destroy disease-causing microbes.
Target Answer: producing chemicals that
destroy the microbes
D.6. EntailmentBankQA Task 2 - depth 1: SI
+ Halter + Search
37
Faithful Reasoning Using Large Language Models
38
Faithful Reasoning Using Large Language Models
increases, the fire will burn more Target Answer: adds more oxygen needed for
heat is a kind of energy burning
burning converts chemical energy into thermal
energy
combustion means fire converts chemical energy
into light energy and heat energy Example 3 (value: -0.023)
thermal energy is a kind of energy
Question: Context:
Fanning can make a wood fire burn hotter melting point is a property of a substance /
because the fanning makes the food hot enough material
to burn OR adds more oxygen needed for burning ice is cold in temperature
OR increases the amount of wood there is to burn ice crystals means ice
OR provides the energy needed to keep the fire definite is the opposite of variable
going. solid is a kind of state of matter
Proof: melting is when solids are heated above their
fanning increases the oxygen near the fire. We melting point
know that as the amount of oxygen exposed to a an ice cube is a kind of solid
fire increases, the fire will burn more. Therefore, matter in the solid phase has definite shape
as the amount of oxygen exposed to a fire melting point means temperature at which a
increases, the fire will burn more quickly. solid melts / above which a solid melts
as the amount of oxygen exposed to a fire state of matter means physical state
increases, the fire will burn more. We know state means form
that fanning increases the oxygen near the fire. water is in the liquid state , called liquid water ,
Therefore, fanning the fire increases the amount for temperatures between 273 / 32 / 0 and 373
of burning woodchips. / 212 / 100 k / f / c
as the amount of oxygen exposed to a fire to be found in means to be contained in
increases, the fire will burn more quickly. We freezing causes a solid to form
know that fanning the fire increases the amount melting is a kind of phase change
of burning woodchips. Therefore, fanning the state of matter at room temperature is a kind of
fire increases the amount of oxygen exposed to physical property
the fire. a physical property is a kind of property
as the amount of oxygen exposed to a fire temperature / heat energy is a property of
increases, the fire will burn more quickly. We objects / weather and includes ordered values of
know that fanning the fire increases the amount cold / cool / warm / hot
of burning woodchips. Therefore, fanning the liquid has a a lower melting point than solid
fire increases the amount of oxygen exposed to ice is colder in temperature than water
the fire. temperature is a measure of heat energy
fanning the fire increases the amount of oxygen matter in the solid phase has definite volume
exposed to the fire. We know that as the amount to be made of something means to be formed of
of oxygen exposed to a fire increases, the fire that something
will burn more quickly. Therefore, fanning the matter in the liquid phase has definite volume
fire will cause the fire to burn more quickly. the properties of something can be used to
Answer: provides the energy needed to keep the identify / used to describe that something
fire going Question:
Target Proof: Which word best describes the physical state of
fanning increases the oxygen near the fire. We an ice cube? gas OR solid OR liquid OR plasma.
know that as the amount of oxygen exposed to a Proof:
fire increases, the fire will burn more. Therefore, an ice cube is a kind of solid. We know that solid
fanning can cause the fire to burn hotter. is a kind of state of matter. Therefore, an ice
cube is in solid state.
39
Faithful Reasoning Using Large Language Models
40
Faithful Reasoning Using Large Language Models
41
Faithful Reasoning Using Large Language Models
classifying is when one sorts something by kind / light. We know that visible light is a part of the
type electromagnetic spectrum. Therefore, visible light
visible light is a part of the electromagnetic spec- is a part of the electromagnetic spectrum that in-
trum cludes violet blue green yellow orange and red
both means two light.
if an object reflects a light of a certain color then Answer: visible light
the object appears to be that color Target Proof:
category means kind visible light is made of lights of different colors
made up of means contains / made of including violet, blue, green, yellow, orange, red
a type is synonymous with a kind light. We know that visible light is a part of the
electromagnetic energy travels as electromagnetic electromagnetic spectrum. Therefore, visible light
waves is the part of electromagnetic spectrum that has
knowing the properties of something means red, green, and blue light.
knowing information about that something Target Answer: visible light is the part of elec-
spectrum means a range of colors tromagnetic spectrum that has red, green, and
all waves transfer / transmit energy from one blue light
place to another
the properties of something are used for describ-
ing that something Example 4
radiation means radiant energy
light means light energy Context:
electromagnetic radiation means electromagnetic to have a function is similar to to be responsible
energy for
light rays means light to enable is to help
light is a kind of nonliving thing providing support is a kind of function
the properties of something can be used to iden- a function is a kind of characteristic
tify / used to describe that something a trunk is a part of a tree for supporting the tree
uv light is a kind of light if an organism can do something , then that or-
the color of an object can be discovered by look- ganism is able to do that something
ing at that object bark is a protective covering around the trunk of
amount is a property of something and includes / branches of a tree
ordered values of none / least / little / some / purpose means role
half / much / many / most / all role means function
visible light is made of lights of different colors if something has the ability to do something else
including violet, blue, green, yellow, orange, red , then that something can do that something else
light to be used for something means to be required
if something is a part of something else then that by that something
something else contains that something to be used for something means to help to do that
color is a property of an object / a material something
Question: a function is an activity
The electromagnetic spectrum is the complete the function of something is what that something
range of electromagnetic radiation ranging from is used to do
radio waves to gamma rays. Which type of radia- a component of something means a part of that
tion in this spectrum is associated with observable something
colors such as red, green, and blue? gamma rays essential means required
OR X-rays OR visible light OR radio waves. ability is a property of things and includes or-
Proof: dered values of able / unable / can / cannot
visible light is made of lights of different colors to anchor means to provide support
including violet, blue, green, yellow, orange, red role is similar to importance
a cell wall provides support / protection / shape
42
Faithful Reasoning Using Large Language Models
43
Faithful Reasoning Using Large Language Models
metal is flexible. We know that a wire is usually Reason: If something chases the mouse then it
made of metal. is big. We know that the cow chases the mouse.
Therefore, a metal wire is flexible. Therefore, the cow chases the cow.
Target Answer: a metal wire is flexible If something chases the cow and it is big then it
does not chase the dog. We know that the cow
chases the mouse and the cow chases the cow.
E. Baseline Model Outputs
Therefore, the cow needs the dog.
If something chases the cow then it needs the
E.1. EntailmentWriter on Proof Writer
dog. We know that the cow chases the mouse.
Below are example EntailmentWriter + Answer Therefore, the dog is nice.
proofs on the Proof Writer dataset. We have re- If something needs the dog and it is nice then
placed the sentence labels with the corresponding the dog is nice. We know that the dog is nice
statements from the context and have not shown and the cow is nice. Therefore, the dog chases
the context for brevity. We see that in each case the cow.
the final conclusion answers the question, but If something is nice then it chases the cow. We
using the incorrect reasoning. We see that the know that the dog chases the cow. Therefore,
model often makes up inferences that do not fol- the dog does not chase the dog.
low, but do help the model solve the problem.
Unlike SI, whose inference step does not have ac-
cess to the question, EntailmentWriter has direct Example 3
access to the question and therefore has the in-
centive to make up inferences that help to answer Question: Does it imply that the statement "The
the questions even if they do not follow logically. cat sees the cat" is True?
Reason: All nice things are young. We know that
Example 1 The cow is nice. Therefore, the cat is nice. If
something is young and nice then it sees the cat.
Question: Does it imply that the statement "The We know that the cat is nice and the cow is nice.
dog is big" is True? Therefore, the cat is young. If something sees
Reason: If something eats the bear then the bear the cat and the cat needs the cow then cat is nice.
needs the lion. We know that the dog eats the We know that the cat is young and the cat needs
bear. Therefore, the bear needs the lion. the cow. Therefore, the cat sees the cat.
If something needs the lion then the lion eats
the bear. We know that the bear needs the lion.
Therefore, the lion eats the bear.
If the lion eats the bear then the bear does not
eat the dog. We know that the lion eats the bear.
Therefore, the bear does not eat the dog.
If something is red and it does not eat the dog
then the dog sees the rabbit. We know that the
rabbit needs the dog and the bear does not eat
the dog. Therefore, the dog sees the rabbit.
If something sees the rabbit then it is not big. We
know that the dog sees the rabbit. Therefore,
the dog is not big.
Example 2
44
Faithful Reasoning Using Large Language Models
Figure 14 | Examples of correct (top) and incorrect (bottom) training data samples used to
train the Value LM. Targets are shown in red. The underlined statement in the last line of the
incorrect reasoning trace is the one that is substituted in. The Inference LM is used to compute the
inference. It is very easy to see here that the second statement is incorrect because it contains a rule
rather than a fact.
45
Faithful Reasoning Using Large Language Models
Figure 15 | Examples of correct (top) and incorrect (bottom) training data samples used to
train the Value LM. Targets are shown in red. The underlined statement in the last line of the correct
reasoning trace is replaced with a random, incorrect statement from the context. The Inference LM is
used to compute the inference.
46
Faithful Reasoning Using Large Language Models
(c) Intermediate inference accuracy where order (d) Exact string match between the ground truth
of the inferences matters. proof and the predicted proof.
Figure 17 | Evaluating reasoning traces on Proof Writer. For exact string match we remove all
removing non-alphabetic characters and compare characters in lower case.
47
Faithful Reasoning Using Large Language Models
48