2310.07289v1

Beyond Factuality: A Comprehensive Evaluation of Large Language
Models as Knowledge Generators

Liang Chen1 , Yang Deng2 , Yatao Bian3†, Zeyu Qin4 ,
Bingzhe Wu3 , Tat-Seng Chua2 , Kam-Fai Wong1†
1
The Chinese University of Hong Kong, 2 National University of Singapore, 3 Tencent AI Lab
4
The Hong Kong University of Science and Technology
[email protected]
Abstract
Large language models (LLMs) outperform
information retrieval techniques for down-
stream knowledge-intensive tasks when being
arXiv:2310.07289v1 [cs.CL] 11 Oct 2023
prompted to generate world knowledge. Yet,

community concerns abound regarding the fac-
tuality and potential implications of using this
uncensored knowledge. In light of this, we in-
troduce CONNER, a COmpreheNsive kNowledge
Evaluation fRamework, designed to system-
atically and automatically evaluate generated
knowledge from six important perspectives –
Factuality, Relevance, Coherence, Informative- Figure 1: The CONNER Framework: Intrinsic evaluations
ness, Helpfulness and Validity. We conduct an probe the internal properties of acquired knowledge,
extensive empirical analysis of the generated while extrinsic evaluations assess its downstream im-
knowledge from three different types of LLMs pacts. This framework applies universally to two-stage
on two widely-studied knowledge-intensive processes in knowledge-intensive tasks.
tasks, i.e., open-domain question answering
and knowledge-grounded dialogue. Surpris- than information retrieval (IR) models (Karpukhin
ingly, our study reveals that the factuality of et al., 2020) when it comes to generating world
generated knowledge, even if lower, does not knowledge (Yu et al., 2023; Liu et al., 2022) for the
significantly hinder downstream tasks. Instead, downstream tasks. However, the knowledge gener-
the relevance and coherence of the outputs are ated may contain inherent issues, such as false state-
more important than small factual mistakes. ments or off-topic information. Therefore, the lack
Further, we show how to use CONNER to im-
of extensive evaluation of this knowledge raises
prove knowledge-intensive tasks by designing
two strategies: Prompt Engineering and Knowl- concerns about its use in downstream tasks.
edge Selection. Our evaluation code and LLM- To this end, four lines of research emerge.
generated knowledge with human annotations Firstly, human evaluations are conducted to as-
will be released1 to facilitate future research. sess the generated knowledge from diverse per-
spectives (Li et al., 2022; Yu et al., 2023; Liu et al.,
1 Introduction 2023a). However, their time-consuming nature
and subjectivity often encounter issues of scalabil-
The exceptional success of large language models ity and reproducibility. Secondly, datasets have
(LLMs) like ChatGPT and GPT4 (Ouyang et al., been constructed to evaluate open-domain gener-
2022; OpenAI, 2023) has fueled a growing interest ation with the aid of references (Honovich et al.,
in substituting traditional models with LLMs to 2021; Glover et al., 2022a; Lee et al., 2023; Li
attain superior performance across various NLP et al., 2023). These methods, while more objec-
tasks (Liu et al., 2023b; Jagerman et al., 2023; tive, are limited by their dependence on human-
Wang et al., 2023). In open-domain question an- labelled references, impacting their real-world ap-
swering (QA) and knowledge-grounded dialogue, plicability and generalizability to dynamically gen-
LLMs have demonstrated superior performance erated content. Thirdly, self-evaluation methods
†
Corresponding author. (Kadavath et al., 2022b; Manakul et al., 2023) esti-
1
https://github.com/ChanLiang/CONNER mate a model’s uncertainty in its generated content.
Evaluation Taxonomy Definition
Factuality whether the information in the knowledge can be verified by external evidence.
Relevance whether the knowledge is relevant to the user query.
Intrinsic
Coherence whether the knowledge is coherent at the sentence and paragraph levels.
Informativeness whether the knowledge is new or unexpected against the model’s existing knowledge.
Helpfulness whether the knowledge can improve the downstream tasks.
Extrinsic
Validity whether the results of downstream tasks using the knowledge are factually accurate.
Table 1: Taxonomy of evaluation metrics of acquired knowledge.
Despite simplicity, they lack interpretability and (§ 4.4). 3) In addition to assessing and analyzing
are less effective for long-form answers. Lastly, the generated knowledge from different LLMs, the
contemporary studies (Pan et al., 2023; Min et al., evaluation outcome of CONNER can be exploited to
2023) apply fact-checking principles to spot factual enhance knowledge generation and further improve
inaccuracies. However, these evaluation methods the performance of downstream tasks (§ 5).
mainly assess a single aspect of the intrinsic quality Our main contributions are as follows:
of generated knowledge, overlooking other facets • We conduct the first empirical analysis focusing
and their extrinsic impact on downstream tasks, on both intrinsic quality and extrinsic reliability
thereby limiting a comprehensive understanding of of the generated knowledge from LLMs.
LLM-generated content.
• We propose CONNER, a COmpreheNsive kNowl-
In light of these limitations, we propose CONNER, edge Evaluation fRamework that enables the au-
a COmpreheNsive kNowledge Evaluation fRame- tomatic evaluation of LLMs as knowledge gener-
work, as illustrated in Figure 1. CONNER is designed ators from diverse perspectives, eliminating the
to be a reference-free framework that can system- need for human-labelled references.
atically and automatically evaluate the generated
• The extensive evaluation and analysis yield pro-
knowledge from six fine-grained perspectives, in-
found insights and valuable practical experience
cluding diverse intrinsic evaluation of its internal
for leveraging LLMs as knowledge generators.
properties, as well as uniform extrinsic evaluation
of its impact on specific downstream tasks. The • We collect a new set of multi-perspective hu-
taxonomy of evaluation metrics is presented in Ta- man judgments of LLM-generated knowledge
ble 1. Based on CONNER, we conduct empirical eval- for two knowledge-intensive generation datasets.
uations on three different types of LLMs, including We demonstrate that CONNER aligns well with hu-
LLaMA (Wei et al., 2022) (a base LLM), FLAN-T5 man judgments. The human annotations will be
(Wei et al., 2022) (an instruction-tuned LLM), Chat- released to facilitate future research.
GPT (Ouyang et al., 2022) (a commercial LLM
trained with human feedbacks). We evaluate them 2 Related Work
on two widely-studied knowledge-intensive tasks: Knowledge-intensive tasks rely heavily on ac-
open-domain QA (Kwiatkowski et al., 2019) and cess to external knowledge sources, such as open-
knowledge-grounded dialogue (Dinan et al., 2018). domain dialogue and QA (Dinan et al., 2018;
Our detailed investigations yield several valu- Kwiatkowski et al., 2019; Petroni et al., 2021).
able insights about the LLM-generated knowledge: The main-streamed methods (Karpukhin et al.,
1) LLM-generated knowledge surpasses retrieved 2020; Lewis et al., 2020; Izacard and Grave, 2021;
knowledge in most evaluation perspectives, while Deng et al., 2023b) typically employ IR tech-
it actually suffers from the factuality issue as ex- niques to first retrieve the relevant knowledge from
pected. Notably, the factuality of downstream tasks Wikipedia and then produce the answer or response
is found to be less affected by this issue, when conditioned on the knowledge. Nowadays, with the
compared to the impact of lower relevancy and powerful capabilities of LLMs (OpenAI, 2023; Ka-
coherency observed in the retrieved knowledge (§ davath et al., 2022a), a new trending approach is
4.3). 2) Several critical factors are identified to in- to leverage LLMs to directly generate the relevant
fluence the factuality of the generated knowledge, knowledge for a given query and then apply the
such as their frequency and length, while few-shot model-generated knowledge to complete the down-
in-context learning and larger size of models do not stream tasks (Liu et al., 2022; Li et al., 2022; Yu
necessarily guarantee higher quality and reliability et al., 2023). Despite the better performance than
retrieval-based methods, there is a lack of rigorous orous evaluation of the quality and dependability
evaluation of the quality and reliability of the gener- of knowledge used in knowledge-intensive tasks.
ated knowledge, which may contain misleading or CONNER is rooted in in-depth error analysis, paving
even plausible false information, e.g., hallucination the way for the construction of an evaluation tax-
and factual inconsistency. onomy, which integrates six unique perspectives
These issues are prevalent across various NLP into two coherent categories, as delineated in Ta-
tasks (Ji et al., 2023). However, most studies target ble 1. Capitalizing on the advantages of unsuper-
specific downstream tasks, such as text summa- vised metrics, our framework eliminates the need
rization (Maynez et al., 2020; Wang et al., 2020; for human-labeled reference knowledge and stan-
Kryscinski et al., 2020a; Pagnoni et al., 2021), dia- dardizes scores within an intuitive range of [0, 1],
logue generation (Shuster et al., 2021; Dziri et al., simplifying comparison and interpretation.
2022; Chen et al., 2023; Deng et al., 2023a), and The subsequent subsections provide a detailed
fact verification (Thorne et al., 2018; Wadden et al., examination of the framework’s design, commenc-
2020; Schuster et al., 2021; Pan et al., 2023). These ing with the formulation of knowledge-intensive
tasks are designed to examine consistency either tasks and the identification of associated error pat-
between the input and output or between the input terns. These insights direct the design of our met-
and a human-labeled reference, e.g., the source doc- rics. Through comprehensive intrinsic and extrinsic
ument and its summary, the grounded knowledge evaluations, we aim to gain a holistic understanding
and the generated response, or a human-written of the LLMs-generated knowledge.
claim and pre-annotated references.
The success of LLMs and generative search en- 3.1 Tasks Formulation
gines have brought hallucinations in LLM outputs
(Zhang et al., 2023) into focus. Research typically Formally, we define the knowledge-intensive task
falls into four categories. (Lee et al., 2023; Li et al., as follows: given a user query q, the goal is to pro-
2023) aim to assess the factuality of open-domain duce an answer with access to knowledge resources.
generation automatically using specially designed Specifically, the system first obtains the relevant
datasets, but their reliance on references may limit knowledge k that can help answer the query q from
real-world applicability. Another stream of work knowledge resources K, then generate an answer a
(Li et al., 2022; Yu et al., 2023; Liu et al., 2023a) using the acquired knowledge k. Specifically, the
uses human evaluation to measure output quality, knowledge resource K can be either a knowledge
which is difficult to scale. A third approach (Kada- base for knowledge retrieval or language models
vath et al., 2022b; Manakul et al., 2023) detects hal- for knowledge generation. Detailed formulations
lucinations by examining the model’s uncertainty of these two settings are presented in Appendix A.
or confidence, which can be inaccurate for long an-
swers. Lastly, recent studies (Peng et al., 2023; Pan 3.2 From Error Patterns to Metrics Design
et al., 2023; Min et al., 2023) apply fact-checking
principles to spot factual inaccuracies. To identify common errors by LLMs in knowledge-
Different from previous studies, we propose a intensive tasks and create a more targeted evalua-
comprehensive framework for evaluating knowl- tion framework, we used thematic analysis (Braun
edge generated by LLMs. Our goal is to automati- and Clarke, 2012). We began by extracting and
cally test the intrinsic quality and extrinsic impact consolidating patterns from subtle errors in knowl-
of generated information in knowledge-intensive edge and answers in responses from LLaMA to
tasks, without requiring knowledge labelling or hu- 160 samples from NQ (Kwiatkowski et al., 2019)
man involvement. Through extensive testing with and WoW (Dinan et al., 2018) datasets. To ensure
this framework, we aim to deepen and broaden our the breadth of the error spectrum was adequately
understanding of LLM-generated knowledge and represented, we further substantiated these patterns
provide valuable insights for future research. using additional questions from NQ and WoW. As
a result, we discerned four primary error categories
3 The Evaluation Framework in knowledge generation and two in answer genera-
tion. In response, we devised four intrinsic metrics
We introduce CONNER, a comprehensive and inno- for knowledge evaluation and two extrinsic metrics
vative framework, specifically designed for the rig- for answer evaluation, as outlined in Table 1.
3.3 Intrinsic Evaluation Relevance To assess the relevance between a
Intrinsic evaluation refers to the assessment of the given query q and the acquired knowledge k, we
acquired knowledge based on its internal properties compute the relevance score as follows:
and performance, without considering its impact Srel (k, q) = Matching(k, q) (2)
on downstream tasks or applications. In specific,
we implement four model-based metrics for evalu- The Matching(·) function denotes a fine-grained
ating the acquired knowledge in terms of factuality, matching model specifically designed for assessing
relevance, informativeness, and coherence. the relevance between the query and knowledge.
Factuality The core of factuality assessment is In our study, we employ the BERT ranking model
validating the acquired knowledge by external ev- (Nogueira et al., 2019) for this purpose.
idence 2 . Given an acquired knowledge k = This methodology addresses the limitations that
{s1 , . . . , sm } composed of m sentences, we can arise when traditional relevance metrics are applied
use a dense retrieval model (Santhanam et al., 2021) within knowledge generation scenarios. Traditional
or search engine API to recall the li most relevant relevance metrics (Karpukhin et al., 2020; Shuster
evidence Ei = {ei,1 , . . . , ei,li } for each sentence si et al., 2021; Komeili et al., 2021), which typically
from the expert knowledge base or the internet. Af- rely on word overlap or similarity with human-
ter collecting all the evidence E = {E1 , . . . , Em }, written references, face two significant challenges.
the factuality score is computed as follows: First, these traditional metrics do not correspond
Sfact (k, E) = min f (si , Ei )
well with scenarios where LLMs serve as genera-
i=1..m
(1) tive search engines, as evidenced by the unsatisfac-
= min max NLI(si , ei,j )
i=1..m j=1..li tory results in Table 10. Second, the reliance on
where f (·) is a function to compute sentence-level reference knowledge constitutes a substantial chal-
factuality, NLI(·) is a natural language inference lenge, especially when such references are scarce
model processing a premise-hypothesis pair to out- or absent in real-world applications. Contrarily,
put a R3 vector, indicating whether a hypothesis our BERT ranking model, trained on manually an-
(si ) is entailed by, neutral to or refuted by the notated Bing search data, excels at comparing the
given premise (ei,j ). Following these computa- relevance of different knowledge to a given query.
tions, sentence-level results are aggregated along Coherence As the acquired knowledge is typi-
the entailment dimension using one of three oper- cally long-form texts composed of multiple sen-
ations: min, mean, or max to match the desired tences, we propose to measure sentence-level co-
error tolerance level. In this instance, we exem- hesion and paragraph-level coherence: the for-
plify the process using min. Finally, we obtain mer measures the cohesion of individual sentences,
a three-dimensional factuality score Sfact (k, E). and the latter measures the coherence between
From each dimension of this vector, we can de- sentences. The sentence-level cohesion score
rive three fine-grained scores. We denote those Scoh_sent (k) is computed as follows:
scores as factual-consistent, non-verified,
and factual-inconsistent, respectively. 1 Xm
Scoh_sent (k) = 1/PPL(si ) (3)
This strategy seeks to address the shortcomings m i=1
of traditional factuality metrics (Wang et al., 2020;

where PPL(·) is computed by a GPT-based model
Honovich et al., 2021; Glover et al., 2022a; Lee
(Radford et al., 2019; Black et al., 2021), measuring
et al., 2023) that mainly depend on consistency
the perplexity for each sentence.
with human-annotated references. These metrics
On the other hand, the paragraph-level coherence
often fail in emerging knowledge generation sce-
score is determined by the normalized score of
narios (Table 10), as they struggle with model-
a discourse coherence model (Jwalapuram et al.,
generated content beyond reference knowledge
2021), denoted as Scoh_para (k):
scope and face difficulties when references are un-
available in real-world applications. Our method Scoh_para (k) = Scorerpara (s1 , ..., sm ) (4)
of evidence collection and results aggregation ef-
fectively tackles these issues. By considering both sentence-level cohesion and
2
We empirically demonstrate ground-truth knowledge is paragraph-level coherence, we gain insights into
dispensable for the factuality evaluation in Appendix B. the overall coherence of the acquired knowledge.
Informativeness To assess the informativeness where L(q, k, a) and L(q, ki− , a) are cross entropy
of the procured knowledge—defined as the degree losses of answer generation using k and ki− re-
to which the knowledge is novel or unexpected in spectively. Ideally, the generated knowledge k
relation to the model’s existing knowledge about can provide enough information and reduce the
the query—we calculate the informativeness score L(q, k, a) to zero and then the helpfulness score
of the acquired knowledge k given q as follows: equals one. The worst case is the generated
M
! knowledge is noPbetter than random knowledge
1 X (L(q, k, a) ≥ u1 ui=1 L(q, ki− , a)), and the help-
Sinfo (k, q) = 1 − exp ln Pθ (kt |k1:t−1 , q) (5)
M i=1
fulness score is naturally zero.
Assuming the unbiased benchmark model θ encap- Validity To measure how the reliability of the ac-
sulates world knowledge from general pretraining quired knowledge affects the factuality of the gener-
data, we thus select the GPT-2 series models. ated answer a on downstream tasks, we define the
To grasp the expected behaviour of this metric, validity metric for two types of downstream tasks:
consider a simple query: "What is the capital of span-based answers (e.g., open-domain QA) and
the United States?" The knowledge acquired here open-ended answers (e.g., knowledge-grounded di-
is "Washington". In this situation, the model’s aver- alogue). As for span-based answers, the generated
age probability of generating "Washington" is high, answers cannot form a complete sentence for fac-
as it already knows this fact. Consequently, our tuality measurement. To this end, we concatenate
informativeness score for this knowledge would be (q, a∗ ) as the premise and (q, a) as the hypothesis
low. Conversely, if the acquired knowledge was for deriving the factual-consistent score of the
"Chicago", the model’s probability of generating it NLI(·) model as the validity score:
would be low. This knowledge is surprising com-
pared to its existing knowledge, resulting in a high Sval (q, a∗ , a) = NLIfact ((q, a), (q, a∗ )) (7)
informativeness score. On the other hand, for a
tough query where the model is clueless, any pro- where a∗ denotes the ground-truth answer for
vided knowledge would score high on informative- downstream tasks and the NLI(·) model is the same
ness due to the model’s low output probabilities. as that of Eq. (1).
We demonstrate this measure outperforms tra-
3.4 Extrinsic Evaluation ditional metrics like Exact Match and F1 score as
Extrinsic evaluation, in contrast to intrinsic evalu- shown in Table 10, which rely on literal matches,
ation, focuses on uniformly assessing the perfor- and often yield low recall. For instance, an entity
mance of the acquired knowledge within the con- pair like ’PRC’ and ’China’ would receive a zero
text of different downstream tasks. Specifically, score due to their differing literal presentations.
we measure how well the acquired knowledge con- As for open-ended answers, we collect l evi-
tributes to the downstream task on two types of dence E = {e1 , . . . , el } and adjust Eq. (1) to be:
metrics (helpfulness and validity). Extrinsic evalua- Sval (a, E) = f (a, E) = max NLIfact (a, ei ) (8)
i=1..l
tion provides a more comprehensive understanding
of the practical value of the acquired knowledge. 4 Evaluation
Helpfulness Given a query and answer pair (q, In this section, we will first validate our proposed
a), we assess to what extent the acquired knowl- metrics, and then leverage them to comprehen-
edge k can help answer the query. As we assume sively evaluate three different types of LLMs across
no pre-annotated ground-truth knowledge, we use two knowledge-intensive tasks, followed by an in-
irrelevant knowledge as the baseline. Specifically, depth analysis of the results.
we randomly sampled u knowledge {k1− , · · · , ku− }
to reduce the variance of baseline estimation. Then 4.1 Metrics Efficacy Validation
the helpfulness score is computed as follows: To validate the effectiveness of our proposed met-
Shelp (q, a, k, k1− , · · · , ku− ) rics, we conducted manual evaluations and com-
L(q, k, a) pared the results with baseline metrics. Specifi-
= max(0, 1 − 1
Pu − ) cally, we developed specific annotation guidelines
i=1 L(q, ki , a)
u
(6)
log P (a|q, k) for each metric, detailed in Appendix J, and per-
= max(0, 1 − Pu − ) formed manual annotations accordingly. These
i=1 log P (a|q, ki )
Factuality Coherence
Model Setting Relevance Inform. Helpful. Validity
Fact-cons. Non-verif. Fact-incon. Coh-sent. Coh-para.
DPR Supervised 97.78% 2.23% 0.00% 0.7514 0.0301 0.7194 0.8965 0.1236 36.86%
FLAN-T5 58.40% 27.80% 13.80% 0.6848 0.1249 0.7776 0.6727 0.0000 32.47%
LL A MA Zero-shot 94.20% 4.80% 1.00% 0.7316 0.1183 0.8240 0.7572 0.2191 42.00%
C HAT GPT 83.63% 13.6% 2.77% 0.8491 0.0909 0.9033 0.7330 0.1461 43.35%
FLAN-T5 20.75% 62.40% 25.40% 0.6787 0.0416 0.8110 0.6899 0.0000 34.65%
LL A MA Few-shot 89.00% 9.20% 1.80% 0.6966 0.0776 0.8550 0.8545 0.2528 40.49%
C HAT GPT 86.07% 10.97% 2.96% 0.9205 0.0653 0.8837 0.7700 0.1966 42.36%
Table 2: Automatic evaluation results of different LLMs in the Natural Question test set. Underlined and Bold
results denote the best results among each setting and among all settings, respectively.
Factuality Coherence
Model Setting Relevance Inform. Helpful. Validity
Fact-cons. Non-verif. Fact-incon. Coh-sent. Coh-para.
DPR Supervised 91.96% 5.18% 2.87% 0.0907 0.0223 0.6569 0.9357 0.0000 61.52%
FLAN-T5 77.90% 17.28% 4.82% 0.3776 0.1203 0.8331 0.7239 0.0904 56.97%
LL A MA Zero-shot 89.46% 8.89% 1.65% 0.5041 0.0548 0.8389 0.7889 0.1178 63.50%
C HAT GPT 88.51% 10.38% 1.11% 0.5283 0.1028 0.9250 0.7448 0.1023 59.76%
FLAN-T5 76.50% 17.20% 6.30% 0.4463 0.1523 0.7988 0.6983 0.0934 57.18%
LL A MA Few-shot 85.07% 12.05% 2.88% 0.3930 0.1088 0.7947 0.7855 0.1132 63.79%
C HAT GPT 85.75% 12.01% 2.24% 0.4618 0.0979 0.8632 0.7922 0.1164 60.27%
Table 3: Automatic evaluation results of different LLMs in the Wizard of Wikipedia test set.
annotations allowed us to calculate the correlation Evaluation Setting Following (Yu et al., 2023),
between each metric and human evaluations. Sub- we evaluate the knowledge generation of LLMs
sequently, we compared these correlations with under both zero-shot and few-shot settings. Af-
baseline metrics (Table 10). Our metrics demon- ter the knowledge acquisition, we perform QA or
strated a strong correlation with human evaluations, dialogue generation under the few-shot setting to
significantly outperforming the baseline metrics. further investigate the impact of different knowl-
Details are presented in Chapter 6 and Appendix J. edge acquisition methods on downstream tasks.
1) Zero-shot Evaluation: We test with varied
4.2 Experimental Setups prompts and report peak performance. A prompt
could be “Generate Wikipedia knowledge for the
Baselines Compared with a popular retrieval-
query. {query}”. Prompts tried are in Appendix F.
based model, DPR (Karpukhin et al., 2020), we
2) Few-shot Evaluation: We construct the
evaluate knowledge generation with the three differ-
prompt with k randomly chosen samples from the
ent types of LLMs, including FLAN-T5 (Wei et al.,
training set. The example templates used for knowl-
2022), LLaMA (Touvron et al., 2023), and Chat-
edge generation are listed in Appendix F and G.
GPT (Ouyang et al., 2022). By default, we report
the results with the largest size of each LLM and
4.3 Overall Evaluation
adopt greedy decoding in our experiments for re-
producibility. Details are presented in Appendix C. Table 2 and Table 3 summarize the evaluation re-
sults of DPR and three LLM-based knowledge
Datasets We evaluate the generated knowledge generators on NQ and WoW datasets, respectively.
on two widely-studied benchmark datasets, includ- There are several notable observations as follows:
ing 1) Natural Questions (NQ) (Kwiatkowski
et al., 2019), an open-domain QA dataset; and 2) Generated knowledge exceeds retrieved knowl-
Wizard of Wikipedia (WoW) (Dinan et al., 2018) edge in most evaluation perspectives, except the
a knowledge-grounded dialogue dataset. During factuality and informativeness. In both NQ and
experiments, we randomly sample 500 examples WoW scenarios, LLMs show remarkable capabil-
from the NQ and WoW test sets respectively for ities in generating highly relevant and coherent
evaluation. Details are presented in Appendix D. knowledge. Moreover, the knowledge generated
by LLMs also proves to be more beneficial for
Implementation Details All the adopted models downstream tasks, regarding both helpfulness and
in CONNER are introduced in Appendix E. validity. These results highlight the significant ad-
Model Query Acquired Knowledge Answer
LLaMA Who wrote the song "The Glory of Love" is a song written by Billy Hill, recorded by Billy Hill ✓
the glory of love? American singer Benny Goodman in 1936. The song has been covered
by many other artists, including a version by American singer Peggy
Lee with Bing Crosby and the Johnny Mann Singers in 1959.
FLAN-T5 Who wrote the song The Glory of Love is a song written by the American composer and Irving Berlin ✗
the glory of love? lyricist Irving Berlin.
DPR Rain sleet or snow ... Ice pellets are a form of precipitation consisting of small, translucent icy rain ✗
that contains a high balls of ice. This form of precipitation is also referred to as ""sleet""
concentration of by the United States National Weather Service. (In British English
acids is called? ""sleet"" refers to a mixture of rain and snow)...
Table 4: Factuality of acquired knowledge may not influence the validity of the answer. Red words represent factual
errors in critical information, while blue words represent factual errors in non-critical information.
Model Extrinsic
Instrinsic ble 4, which intuitively shows that the presence of
Fact. Rel. Coh-sent. Coh-para. Info. factual errors in non-critical information has mini-
helpful. 0.10 0.24† 0.07 -0.03 -0.14†
mal impact on downstream tasks, while it is highly
DPR impossible to derive the correct answer from the
validity 0.04 0.19† 0.04 -0.06 -0.09
helpful. 0.14 -0.05 0.10 -0.09 -0.05 irrelevant retrieved knowledge. While LLaMA and
LLM S
validity 0.15† -0.02 0.07 -0.03 -0.03 ChatGPT generate knowledge with slightly lower
factuality than DPR, it is shown to be adequate for
Table 5: The Somers’ correlation between intrinsic and downstream tasks. At this point, the relevance of
extrinsic metrics on NQ. Scores with p-value < 0.05 the acquired knowledge is more critical. Hence,
are marked with † . Bold results denote the most corre-
relying solely on the factuality of the knowledge
lated intrinsic metric to the concerned extrinsic metric.
The breakdowns of all correlations are in Appendix H. itself is an unreliable means of assessing its impact
on the factuality of downstream tasks. Motivated
Instrinsic by this finding, we investigate approaches to guid-
Model Extrinsic
Fact. Rel. Coh-sent. Coh-para. Info. ing the generated knowledge selection with the
multi-perspective evaluation outcome of CONNER
helpful. 0.01 0.27† 0.10† -0.03 -0.14†
DPR
validity -0.01 -0.06 0.13† -0.12† -0.13† for improving the downstream performance in § 5.
helpful. 0.06 0.05 0.10 0.00 -0.16 DPR falls short of retrieving relevant and help-
LLM S
validity 0.24† 0.09 0.05 -0.02 -0.07 ful knowledge for knowledge-grounded dia-
Table 6: The Somers’ correlation between intrinsic and
logues. As the DPR model is finetuned on QA
extrinsic metrics on WoW. datasets to match a question to Wikipedia knowl-
edge, the DPR model struggles to match dialogue
vantages of utilizing LLMs as knowledge genera- utterances with the necessary knowledge. Also,
tors in terms of knowledge quality and applicability, the candidate Wikipedia passages in DPR (100 to-
rendering them a valuable knowledge resource for kens) are much longer than the knowledge needed
various knowledge-intensive applications. in WoW, containing much redundant information.
Despite obtaining lower factuality than retrieved This reveals the shortcomings of supervised dense
knowledge, generated knowledge contributes retrieval models, such as limited transferability and
more to the factuality of downstream tasks (i.e., being constrained by knowledge bases.
higher validity). To investigate the underlying Few-shot in-context learning for LLMs gener-
reason, we analyze the correlation between differ- ally harms the factuality of generated knowl-
ent intrinsic metrics and extrinsic metrics on two edge. We observe that the length of knowledge
tasks. As shown in Tables 5 and 6, the perfor- generated by few-shot ICL is generally longer than
mance of downstream tasks is indeed hindered by that of zero-shot prompting since the ground-truth
the issue of factuality in the generated knowledge knowledge for demonstrations is relatively long.
from LLMs. However, for retrieval models (e.g., Consequently, LLM is more error-prone (see the
DPR), limitations may arise from the relevance analysis of long-form generation in § 4.4). This in-
and coherence of the retrieved knowledge, while dicates that few-shot ICL is not always better than
its high factuality fails to ensure the performance zero-shot ICL in knowledge generation, and the
of downstream tasks. We present a case study in Ta- selected demonstrations attach great importance.
(a) Analysis of Long-tail Knowledge (b) Analysis of Long-form Generation (a) Model scale of FLAN-T5 (b) Model scale of LLaMA
1.0 0.6 Fact. Fact.
0.8 0.9
0.8 0.5 0.7 0.8
0.7
Rel. 0.6 Val. Rel. Val.
Avg probability
0.5 0.6
0.6 Factual-con. 0.4
0.4
0.5
Non-verified 0.4
0.3 0.3
0.4 Factual-incon. 0.3 65B 11B
33B 3B
0.2 0.2 7B 0.8B
0.0 0.1
Coh. Help. Coh. Help.
4 5 6 7 8 9 1 2 3 4 5
Pageview of Wikipedia knowledge (10^) # of sentences in generated knowledge
Info. Info.
Figure 2: The impact of knowledge frequency and
length on the factuality of the generated knowledge. Figure 3: Performance on NQ with different sizes of
FLAN-T5 and LLaMA as the knowledge generator
Inspired by this, we investigate approaches to guid- (Help. and Val. scores are linearly scaled).
ing the few-shot demonstration selection with the Figure 2(b) displays the factuality performance
evaluation outcome of CONNER for improving the based on the number of sentences in the gener-
performance of few-shot ICL in § 5. ated knowledge. The results show that LLaMA
FLAN-T5 fails to be a qualified knowledge gen- exhibits higher error rates when generating long-
erator since its generated knowledge is poorly form knowledge. Therefore, prompting the LLMs
factual and rarely helpful to downstream tasks. to generate the required knowledge in a concise
Although FLAN-T5 (11B) significantly surpasses rather than lengthy manner can benefit factuality.
many models of the same scale through instruction Impact of Model Size Figures 3 depicts the per-
tuning on numerous tasks, it falls short of being formance scaling with the model size, including
a qualified knowledge generator. As shown in Ta- LLaMA-65B/33B/7B and FLAN-T5-11B/3B/780M.
ble 4, such a low factuality leads to frequent oc- The results are reported on the NQ dataset using
currences of factual errors in critical information, zero-shot prompting. We observe that larger mod-
thereby harming downstream tasks. To this end, els do not necessarily outperform smaller models in
we study the scaling of performance w.r.t different
terms of intrinsic evaluation (particularly when pa-
perspectives by varying the model size in § 4.4.
rameter magnitudes are similar). However, larger
4.4 Further Analysis models consistently outperform smaller models in
terms of extrinsic evaluation(helpfulness and valid-
We further analyze how different factors affect the
ity). Detailed tables are presented in Appendix I.
quality and reliability of the generated knowledge
and discuss our findings below. 5 Two Use Cases of CONNER
Long-tail Knowledge We investigate the impact To explore how our framework can guide the future
of the knowledge frequency on the factuality perfor- design of utilizing LLMs as a knowledge generator,
mance of LLaMA on the WoW dataset. Each data we design two strategies to employ CONNER as a
entry in WoW comprises a topic, query, knowledge, measurement for guiding the Prompt Engineering
and answer. The topic indicates the corresponding and Knowledge Selection for knowledge-intensive
Wikipedia page linked to the knowledge. We as- tasks. We define the overall quality of knowledge
sess this knowledge’s frequency using Wikipedia k given the query q as follows:
pageviews from 2015 to 20213 . This enables us to
differentiate between common and long-tail knowl- Qknow (q, k) = γ ⊺ · Sintr γ ∈ R4
(9)
edge in WoW. Our findings reveal that LLaMA Sintr = [Sfact , Srel , Scoh_para , Sinfo ]⊺
exhibits lower reliability when it is expected to gen- where Qknow is the linear combination of four in-
erate rare/long-tail knowledge compared to com- stinct metrics Sintr and γ is the coefficient vector.
mon knowledge, as depicted in Figure 2(a).
Prompt Engineering We show how to use
Long-form Generation We investigate the im-
CONNER to improve knowledge generation by per-
pact of generation length on the factuality of the
forming prompt engineering for few-shot ICL. We
generated knowledge. Specifically, we consider
random sample a small set of m samples from the
knowledge over 40 tokens and take sentences as
training set, then use Qknow (q, k) as the scoring
evaluation units aligned with factuality evaluation.
function to select the top n samples to compose
3
https://wikimedia.org/api/rest_v1 the few-shot prompt. As shown in Table 7, the
Model Fact. Rel. Coh. Info. Metric DPR FLAN-T5 LLaMA ChatGPT
ChatGPT 85.8% 0.462 0.863 0.792 Factuality 0.65† 0.66† 0.66† 0.63†
ChatGPTselect prompt 87.7% 0.503 0.899 0.775
Relevance 0.69† 0.37† 0.55† 0.54†
Table 7: CONNER-guided demonstration selection im- Coherence 0.53† 0.58† 0.44† 0.49†
proves the intrinsic quality of generated knowledge.
Informative 0.30† 0.17 0.35 0.32†
Model Helpfulness Validity Helpfulness 0.75† 0.45† 0.81† 0.69†
ChatGPT 0.1461 43.45% Validity 0.83† 0.73† 0.85† 0.82†
ChatGPTselect knowledge 0.2090 44.28%
Table 9: Somer’s D correlation of metrics with the
Table 8: CONNER-guided knowledge selection improves human annotation on NQ (The results on WoW are
extrinsic (downstream) performance. presented in Appendix J.2). Correlation scores with
p-value < 0.05 are marked with † .
knowledge generated by CONNER-enhanced few-
shot prompting outperforms that with random Metric DPR FLAN-T5 LLaMA ChatGPT
demonstrations on 3 out of 4 perspectives, under Factuality 0.65† 0.66† 0.66† 0.63†
the setting of m = 30 and n = 8. HE -0.24 0.15 -0.03 0.29†
NLI 0.23 0.47† 0.27† 0.38†
Knowledge Selection We employ CONNER to im- NLI-Multitask 0.18† 0.51† 0.26† 0.32†
prove downstream tasks by selecting high-quality NLI-Decompose. 0.23† 0.47† 0.27† 0.38†
generated knowledge. Specifically, we generate r Relevance 0.69† 0.37† 0.55† 0.54†
different knowledge H = {k˜1 , ..., k˜r } from LLMs F1 0.45† 0.21 0.41† 0.47†
with top-p sampling, then select the generated Validity 0.83† 0.73† 0.85† 0.82†
knowledge for the downstream task, according to EM 0.59† 0.51† 0.54† 0.61†
k = argmaxk̃∈H Qknow (q, k̃). As shown in Table 8, F1 0.74† 0.67† 0.76† 0.77†
we achieve a relative improvement of 43.15% in Table 10: Comparing CONNER with reference-reliant
helpfulness on ChatGPT with p = 0.9 and r = 5. baseline metrics on the NQ dataset. Details of baseline
metrics are presented in Appendix J.3.
6 Human Evaluation
knowledge and human knowledge. (2) CONNER met-
We conducted a human evaluation by randomly se- rics consistently outperform all other reference-
lecting 400 samples from the NQ and WoW test reliant metrics, indicating the effectiveness of our
sets. Our three annotators provided ratings for the framework in the knowledge evaluation scenarios.
intrinsic and extrinsic metrics for the four mod-
els. Additionally, for FLAN-T5 and LLaMA, we 7 Conclusion
annotated the specific locations of factual errors
in the generated knowledge, aiming to facilitate In this work, we introduce CONNER, a comprehen-
future research on fine-grained fallacy detection. sive evaluation framework designed to automati-
Detailed annotation instructions and the statistics cally assess both the intrinsic quality and extrinsic
of our labelled data can be found in Appendix J.1. reliability of the knowledge generated by LLMs.
To evaluate how well CONNER matches human Notably, CONNER is reference-free but demonstrates
evaluation of knowledge and compares with several a better correlation with human judgement com-
baseline metrics, we measure the Somers’ D cor- pared with previous reference-reliant metrics.
relation (Somers, 1962) between the human rating Through extensive evaluation and in-depth analy-
0, 1, 2 of the knowledge quality and corresponding sis, we identify several key factors affecting the fac-
metric scores. Table 9 and Table 10 illustrate the re- tuality of generated knowledge. We find although
sults of four models on the NQ dataset. We observe the generated knowledge is less factual than the
that: (1) CONNER yields consistently good correla- retrieved knowledge, it remarkably enhances the
tions with human evaluation w.r.t different eval- factuality of downstream tasks over the retrieved
uation perspectives (except for informativeness), ones. Furthermore, we propose two approaches
which indicates that the quality of knowledge can to improve knowledge generation and downstream
be more effectively evaluated with CONNER. The task performance with the guidance of CONNER. We
inconsistency between informativeness and human believe our framework and findings will facilitate
judgment is attributed to the differences in model the future research of trustworthy AIGC.
Limitations data that targets fine-grained factual errors.
Despite these limitations, we believe our work
In this section, we discuss the limitations in this serves as a significant catalyst for the automated
work from three perspectives. evaluation of knowledge generated by large lan-
Firstly, the knowledge we evaluate primarily re- guage models, contributing positively to the ad-
lies on information sourced from Wikipedia. This vancement of more trustworthy AI systems.
choice is driven by two considerations: (1) Large
language models (LLMs) are trained on diverse Acknowledgements
corpora, which may include undisclosed domain-
specific or task-specific data. To ensure fairness We extend our sincerest gratitude to Professor Jing
in our evaluations and enable meaningful compar- Ma, whose insightful discussions and suggestions
isons, we focus on the common data sources that on factuality evaluation have significantly inspired
all models have learned from, with Wikipedia be- our design. We are particularly grateful to our
ing a prevalent pre-training corpus for different three anonymous reviewers, whose thorough and
LLMs. (2) Wikipedia is renowned for its high- meticulous reviews have considerably improved
quality knowledge, providing us with authoritative the quality of our work. Their constructive discus-
evidence to validate the generated knowledge. Ad- sions and insights have undoubtedly enhanced our
ditionally, leveraging such authoritative evidence revisions. This research work is partially supported
enhances the interpretability of our factual judg- by CUHK under Project No. 3230377 (Ref. No.
ments. In future work, we aim to expand our KPF23GW20).
evaluations to include a broader range of world
knowledge, thus further enhancing the scope and
References
generalizability of our findings.
Secondly, while our work primarily aims to pro- Sid Black, Leo Gao, Phil Wang, Connor Leahy, and
pose a general framework that can be applied to Stella Biderman. 2021. GPT-Neo: Large scale autore-
gressive language modeling with Mesh-Tensorflow.
any language, our evaluation framework presents
potential generalization challenges for non-English Virginia Braun and Victoria Clarke. 2012. Thematic
languages. This is due to its reliance on several analysis., pages 57–71.
common NLP components, a limitation echoed
Liang Chen, Hongru Wang, Yang Deng, Wai Chung
across many NLP methodologies. Encouragingly, Kwan, Zezhong Wang, and Kam-Fai Wong. 2023.
the development of model variants in other lan- Towards robust personalized dialogue generation via
guages, such as Chinese (Hu et al., 2020; Xie et al., order-insensitive representation regularization. In
2023; Huang et al., 2017), indicates the potential Findings of the Association for Computational Lin-
guistics: ACL 2023, pages 7337–7345, Toronto,
for broader applications. Nonetheless, the reality Canada. Association for Computational Linguistics.
remains that for very low-resource languages with-
out existing NLP models, these components may Yang Deng, Wenqiang Lei, Minlie Huang, and Tat-Seng
need to be developed from scratch. This issue rep- Chua. 2023a. Goal awareness for conversational
AI: proactivity, non-collaborativity, and beyond. In
resents a challenge that the community needs to Proceedings of the 61st Annual Meeting of the As-
address in the future. sociation for Computational Linguistics: Tutorial
A third limitation is that our assessment of Abstracts, ACL 2023, Toronto, Canada, July 9-14,
factuality is limited to sentence-level granularity. 2023, pages 1–10. Association for Computational
Linguistics.
Through analysis and manual annotation, we have
identified that large language models (LLMs) tend Yang Deng, Wenxuan Zhang, Yifei Yuan, and Wai Lam.
to exhibit errors at a more detailed level, particu- 2023b. Knowledge-enhanced mixed-initiative dia-
larly concerning numbers, time, and the generation logue system for emotional support conversations. In
Proceedings of the 61st Annual Meeting of the Associ-
of misleading or fabricated concepts (e.g., key char- ation for Computational Linguistics (Volume 1: Long
acters, identities, and locations), particularly within Papers), ACL 2023, pages 4079–4095. Association
parallel structures. To address this limitation, fu- for Computational Linguistics.
ture research will concentrate on developing more
Emily Dinan, Stephen Roller, Kurt Shuster, Angela
fine-grained methods for detecting hallucinations Fan, Michael Auli, and Jason Weston. 2018. Wizard
and assessing factual accuracy. To facilitate such of wikipedia: Knowledge-powered conversational
research, we have annotated a specific subset of agents. arXiv preprint arXiv:1811.01241.
Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Os- Tran-Johnson, Scott Johnston, Sheer El Showk, Andy
mar Zaiane, Mo Yu, Edoardo M. Ponti, and Siva Jones, Nelson Elhage, Tristan Hume, Anna Chen,
Reddy. 2022. Faithdial: A faithful benchmark for Yuntao Bai, Sam Bowman, Stanislav Fort, Deep
information-seeking dialogue. Ganguli, Danny Hernandez, Josh Jacobson, Jack-
son Kernion, Shauna Kravec, Liane Lovitt, Ka-
John Glover, Federico Fancellu, Vasudevan Jagan- mal Ndousse, Catherine Olsson, Sam Ringer, Dario
nathan, Matthew R. Gormley, and Thomas Schaaf. Amodei, Tom Brown, Jack Clark, Nicholas Joseph,
2022a. Revisiting text decomposition methods for Ben Mann, Sam McCandlish, Chris Olah, and Jared
nli-based factuality scoring of summaries. CoRR, Kaplan. 2022b. Language models (mostly) know
abs/2211.16853. what they know. CoRR, abs/2207.05221.
John Glover, Federico Fancellu, Vasudevan Jagan- Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
nathan, Matthew R. Gormley, and Thomas Schaaf. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and
2022b. Revisiting text decomposition methods for Wen-tau Yih. 2020. Dense passage retrieval for open-
nli-based factuality scoring of summaries. domain question answering. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Or Honovich, Leshem Choshen, Roee Aharoni, Ella Language Processing (EMNLP), pages 6769–6781.
Neeman, Idan Szpektor, and Omri Abend. 2021.
q 2 : Evaluating factual consistency in knowledge- Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2021.
grounded dialogues via question generation and ques- Internet-augmented dialogue generation.
tion answering.
Wojciech Kryscinski, Bryan McCann, Caiming Xiong,
Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra and Richard Socher. 2020a. Evaluating the factual
Kübler, and Lawrence Moss. 2020. OCNLI: Orig- consistency of abstractive text summarization. In
inal Chinese Natural Language Inference. In Find- Proceedings of the 2020 Conference on Empirical
ings of the Association for Computational Linguistics: Methods in Natural Language Processing (EMNLP),
EMNLP 2020, pages 3512–3526, Online. Association pages 9332–9346, Online. Association for Computa-
for Computational Linguistics. tional Linguistics.
Guimin Huang, Min Tan, Sirui Huang, Ruyu Mo, and Wojciech Kryscinski, Bryan McCann, Caiming Xiong,
Ya Zhou. 2017. A discourse coherence model for and Richard Socher. 2020b. Evaluating the factual
analyzing chinese students’ essay. In 2017 Interna- consistency of abstractive text summarization. In
tional Conference on Progress in Informatics and Proceedings of the 2020 Conference on Empirical
Computing (PIC), pages 430–434. Methods in Natural Language Processing, EMNLP
Gautier Izacard and Edouard Grave. 2021. Leveraging 2020, pages 9332–9346.
passage retrieval with generative models for open
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
domain question answering. In EACL 2021, pages
field, Michael Collins, Ankur Parikh, Chris Alberti,
874–880.
Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui ton Lee, Kristina Toutanova, Llion Jones, Matthew
Wang, and Michael Bendersky. 2023. Query expan- Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob
sion by prompting large language models. CoRR, Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
abs/2305.03653. ral questions: A benchmark for question answering
research. Transactions of the Association for Compu-
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan tational Linguistics, 7:452–466.
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea
Madotto, and Pascale Fung. 2023. Survey of halluci- Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pas-
nation in natural language generation. ACM Comput. cale Fung, Mohammad Shoeybi, and Bryan Catan-
Surv., 55(12). zaro. 2023. Factuality enhanced language models for
open-ended text generation.
Prathyusha Jwalapuram, Shafiq R. Joty, and Xiang
Lin. 2021. Rethinking self-supervision objectives Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik-
for generalizable coherence modeling. CoRR, tus, Fabio Petroni, Vladimir Karpukhin, Naman
abs/2110.07198. Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih,
Tim Rocktäschel, Sebastian Riedel, and Douwe
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Kiela. 2020. Retrieval-augmented generation for
Henighan, Dawn Drain, Ethan Perez, Nicholas knowledge-intensive NLP tasks. In Advances in Neu-
Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli ral Information Processing Systems 33: Annual Con-
Tran-Johnson, et al. 2022a. Language models ference on Neural Information Processing Systems
(mostly) know what they know. arXiv preprint 2020, NeurIPS 2020.
arXiv:2207.05221.
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Nie, and Ji-Rong Wen. 2023. Halueval: A large-
Henighan, Dawn Drain, Ethan Perez, Nicholas scale hallucination evaluation benchmark for large
Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli language models. CoRR, abs/2305.11747.
Yanyang Li, Jianqiao Zhao, Michael R. Lyu, and Li- Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng,
wei Wang. 2022. Eliciting knowledge from large Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou
pre-trained models for unsupervised knowledge- Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check
grounded conversation. In Proceedings of the 2022 your facts and try again: Improving large language
Conference on Empirical Methods in Natural Lan- models with external knowledge and automated feed-
guage Processing, EMNLP 2022, pages 10551– back.
10564.
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick
Nelson F. Liu, Tianyi Zhang, and Percy Liang. 2023a. Lewis, Majid Yazdani, Nicola De Cao, James Thorne,
Evaluating verifiability in generative search engines. Yacine Jernite, Vladimir Karpukhin, Jean Maillard,
et al. 2021. Kilt: a benchmark for knowledge in-
Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Dragomir tensive language tasks. In Proceedings of the 2021
Radev, and Arman Cohan. 2023b. On learning to Conference of the North American Chapter of the
summarize with large language models as references. Association for Computational Linguistics: Human
CoRR, abs/2305.14239. Language Technologies, pages 2523–2544.
Zihan Liu, Mostofa Patwary, Ryan Prenger, Shrimai
Prabhumoye, Wei Ping, Mohammad Shoeybi, and Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Bryan Catanzaro. 2022. Multi-stage prompting for Dario Amodei, Ilya Sutskever, et al. 2019. Language
knowledgeable dialogue generation. models are unsupervised multitask learners. OpenAI
blog, 1(8):9.
Potsawee Manakul, Adian Liusie, and Mark J. F. Gales.
2023. Selfcheckgpt: Zero-resource black-box hal- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
lucination detection for generative large language Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
models. CoRR, abs/2303.08896. Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and former.
Ryan T. McDonald. 2020. On faithfulness and factu-
ality in abstractive summarization. In Proceedings of Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
the 58th Annual Meeting of the Association for Com- Percy Liang. 2016. Squad: 100,000+ questions for
putational Linguistics, ACL 2020, pages 1906–1919. machine comprehension of text.
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Keshav Santhanam, Omar Khattab, Jon Saad-Falcon,
Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Christopher Potts, and Matei Zaharia. 2021. Col-
Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. bertv2: Effective and efficient retrieval via
Factscore: Fine-grained atomic evaluation of factual lightweight late interaction. CoRR, abs/2112.01488.
precision in long form text generation.
Tal Schuster, Adam Fisch, and Regina Barzilay. 2021.
Rodrigo Frassetto Nogueira, Wei Yang, Kyunghyun Get your vitamin C! robust fact verification with
Cho, and Jimmy Lin. 2019. Multi-stage document contrastive evidence. In Proceedings of the 2021
ranking with BERT. CoRR, abs/1910.14424. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
OpenAI. 2023. GPT-4 technical report. CoRR,
Language Technologies, pages 624–643, Online. As-
abs/2303.08774.
sociation for Computational Linguistics.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
Sandhini Agarwal, Katarina Slama, Alex Ray, John and Jason Weston. 2021. Retrieval augmentation
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, reduces hallucination in conversation. In Findings
Maddie Simens, Amanda Askell, Peter Welinder, of the Association for Computational Linguistics:
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. EMNLP 2021, pages 3784–3803, Punta Cana, Do-
Training language models to follow instructions with minican Republic. Association for Computational
human feedback. Linguistics.
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Robert H Somers. 1962. A new asymmetric measure of
Tsvetkov. 2021. Understanding factuality in abstrac- association for ordinal variables. American sociolog-
tive summarization with FRANK: A benchmark for ical review, pages 799–811.
factuality metrics. In Proceedings of the 2021 Con-
ference of the North American Chapter of the Asso- James Thorne, Andreas Vlachos, Christos
ciation for Computational Linguistics: Human Lan- Christodoulopoulos, and Arpit Mittal. 2018.
guage Technologies, pages 4812–4829, Online. As- FEVER: a large-scale dataset for fact extraction
sociation for Computational Linguistics. and VERification. In Proceedings of the 2018
Conference of the North American Chapter of
Liangming Pan, Xiaobao Wu, Xinyuan Lu, Anh Tuan the Association for Computational Linguistics:
Luu, William Yang Wang, Min-Yen Kan, and Preslav Human Language Technologies, Volume 1 (Long
Nakov. 2023. Fact-checking complex claims with Papers), pages 809–819, New Orleans, Louisiana.
program-guided reasoning. Association for Computational Linguistics.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix, 70
Prob of factual-consistent (%)

Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal 60
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open 50
and efficient foundation language models. 40
David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu 30
Wang, Madeleine van Zuylen, Arman Cohan, and
Hannaneh Hajishirzi. 2020. Fact or fiction: Verifying 20
scientific claims. In Proceedings of the 2020 Con- 10
ference on Empirical Methods in Natural Language w reference knowledge
Processing (EMNLP), pages 7534–7550, Online. As- 0 wo reference knowledge
sociation for Computational Linguistics. 0 2 4 6 8 10

Num of retrieved evidence
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
Asking and answering questions to evaluate the fac- Figure 4: The influence of reference knowledge in fac-
tual consistency of summaries. In Proceedings of the tuality evaluation weakens as the amount of retrieved
58th Annual Meeting of the Association for Compu- evidence increases.
tational Linguistics, pages 5008–5020, Online. Asso-
ciation for Computational Linguistics.
the autoregressive answer generation process based
Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang,
Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. on the acquired knowledge.
2023. GPT-NER: named entity recognition via large Retrieval-based knowledge acquisition methods
language models. CoRR, abs/2304.10428. use a retrieval model to retrieve the most rel-
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin evant knowledge from the knowledge resource
Guu, Adams Wei Yu, Brian Lester, Nan Du, An- K = {d1 , d2 , . . . , dK } composed of K documents:
drew M. Dai, and Quoc V. Le. 2022. Finetuned
language models are zero-shot learners. esim(q,di )
P (k = di |q, K) = PK sim(q,d ) (11)
j=1 e
j
Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv,

Ting Yao, Weinan Gan, Zhijing Wu, Xiangsheng Li, where sim(·) function is used to measure the sim-
Haitao Li, Yiqun Liu, and Jin Ma. 2023. T2ranking:
A large-scale chinese benchmark for passage ranking.
ilarity, e.g., cosine similarity, between the query
and the knowledge document.
Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Generation-based knowledge acquisition meth-
Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, ods prompt a large language model to directly gen-
Michael Zeng, and Meng Jiang. 2023. Generate
rather than retrieve: Large language models are erate the required knowledge:
strong context generators. YM
P (k|q, K) = PK (kt |k1:t−1 , q, prompt) (12)
t=1
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, where prompt denotes the zero-shot or few-shot
Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei prompt and the LLM is regarded as the knowledge
Bi, Freda Shi, and Shuming Shi. 2023. Siren’s song
in the ai ocean: A survey on hallucination in large resource K and PK stands for the distribution in-
language models. duced by the LLM.
Appendix B Analysis of Reference Knowledge

A Details of Problem Formulation We investigated the importance of reference knowl-
edge in evaluating the factuality of generated
We provide a formulation of the two-step process knowledge. Specifically, we conducted FLAN-
for knowledge-intensive tasks, as illustrated in T5 experiments on the WoW dataset using a zero-
Fig.1. Formally, the knowledge-intensive gener- shot approach. Two sets of experiments were per-
ation problem can be formulated as the following formed: one included reference knowledge in the
chain rule: retrieved evidence pool, while the other did not.
Figure 4 illustrates our findings, indicating that the
X
P (a|q, K) = P (k|q, K)P (a|q, k) (10)
k
group with reference knowledge exhibits a clear
where P (k|q, K) is the knowledge acquisition pro- advantage when the number of retrieved evidence
cess and P (a|q, k) = N
Q
t=1 P (a t |a1:t−1 , q, k) is is limited. However, as the number of retrieved
Dataset Prompts Best
Topic: {topic} \n Query: {query} \n Related wikipedia knowledge:
Topic: {topic} \n Generate a background document from Wikipedia to answer the given question. \n {query} \n
NQ
Topic: {topic} \n Generate a Wikipedia knowledge to answer the given question.\n Question: {query} \n Wikipedia knowledge:
Topic: {topic} \n Generate a Wikipedia to answer the given question.\n Question: {query} \n Wikipedia: ✓
Topic: {topic} \n Query: {utterance} \n Related Wikipedia knowledge:
Topic: {topic} \n Generate a background document from Wikipedia to reply to the utterance. \n {utterance} \n
W OW
Topic: {topic} \n Generate a Wikipedia knowledge to answer the given question.\n Utterance: {utterance} \n Wikipedia knowledge:
Topic: {topic} \n Generate a Wikipedia to answer the given question.\n Question: {utterance} \n Wikipedia: ✓
Table 11: List of human prompts we tried for zero-shot knowledge generation, evaluated on the validation set of
NQ, WoW. {} represents placeholder, and ’utterance’ denotes the last utterance of the dialogue partner. We use ✓to
denote the prompt achieving the best performance.

Topic: {topic} \n Query: {query} \n Related Wikipedia knowledge: {knowledge} ✓
Topic: {topic} \n Query: {query} \n Knowledge: {knowledge}
NQ
Topic: {topic} \n Query: {query} \n Document: {knowledge}
Topic: {topic} \n Generate a background document from Wikipedia to answer the given question. \n {query} \n {knowledge}
Topic: {topic} \n Generate a Wikipedia to answer the given question.\n Question: {query} \n Wikipedia: {knowledge}
Topic: {topic} \n Query: {utterance} \n Related Wikipedia knowledge: {knowledge} ✓
Topic: {topic} \n Query: {utterance} \n Knowledge: {knowledge}
W OW
Topic: {topic} \n Query: {utterance} \n Document: {knowledge}
Topic: {topic} \n Generate a background document from Wikipedia to reply to the utterance. \n {utterance} \n {knowledge}
Topic: {topic} \n Generate a Wikipedia to answer the given question.\n Question: {utterance} \n Wikipedia: {knowledge}
Table 12: List of example templates we tried for few-shot knowledge generation.

Topic: {topic} \n Passage: {knowledge} \n Query: {query} \n Answer: {answer} ✓
Topic: {topic} \n Read the passage and answer the question below:\n Passage: {knowledge} \n Question: {query} \n Answer: {answer}
NQ
Topic: {topic} \n Using the knowledge from the passage to answer the question below:\n Passage: {knowledge} \n Question: {query} \n Answer: {answer}
Topic: {topic} \n Passage: {knowledge} \n Speaker 1: {utterance} \n Speaker 2: {response} ✓
Topic: {topic} \n Knowledge: {knowledge} \n Speaker 1: {utterance} \n Speaker 2: {response}
W OW
Topic: {topic} \n Grounding document: {knowledge} \n Speaker 1: {utterance} \n Speaker 2: {response}
Passage: {knowledge} \n Query: {utterance} \n Answer: {response}
Topic: {topic} \n Using the knowledge from the passage, complete the dialogue below: {knowledge} \n Speaker 1: {utterance} \n Speaker 2: {response}
Table 13: List of example templates we tried for few-shot answer generation.
evidence increases, the performance of both groups foundation language model trained on publicly
converges. These results suggest that reference available datasets and shows competitive perfor-
knowledge is dispensable, particularly when a sig- mance with the best models, including GPT-3
nificant amount of evidence is available. When (175B) and PaLM-540B.
the number of retrieved evidence surpasses ten, the ChatGPT is a sibling model to InstructGPT
impact of reference knowledge becomes negligible. (Ouyang et al., 2022) that is trained to follow in-
We hope this will provide valuable insights for fu- structions in a prompt and provide a detailed re-
ture designs of factuality assessment for generated sponse. We adopt text-davinci-003 version for
knowledge. evaluation.
C Details of Baselines D Details of Datasets

DPR (Karpukhin et al., 2020) is a supervised dense Natural Questions (NQ) (Kwiatkowski et al.,
retrieval model trained on several QA datasets (in- 2019) is an open-domain QA dataset, where the
cluding NQ) to retrieve the most relevant Wikipedia questions are mined from real Google search
passages given a query. queries. The corresponding ground truth knowl-
FLAN-T5 (Wei et al., 2022) is an enhanced ver- edge and the answers to the questions are para-
sion of T5 (Raffel et al., 2020) that is instruction- graphs and short spans in the Wikipedia pages.
finetuned in 1.8k NLP datasets to acquire the gen- Wizard of Wikipedia (WoW) (Dinan et al., 2018)
eralization ability to unseen tasks. is a knowledge-grounded dialogue dataset de-
LLaMA (Touvron et al., 2023) is an open-source signed for information-seeking scenarios, where
one speaker introduces knowledge related to a topic J Details of Human Evaluation
to the other speaker by grounding his/her responses
in a specific sentence from a Wikipedia page. J.1 Human Annotation
We conducted a human evaluation with 400 sam-
E Implementation Details ples from the NQ and WoW test set. Among these,
320 samples were from the zero-shot setting in
All the metrics we designed are model-based met- the NQ dataset, involving all four models, while 80
rics, utilizing solely off-the-shelf models. We samples were from the few-shot setting in the WoW
present the models we used in Table 14. dataset, involving one model (ChatGPT). Three ex-
pert annotators, who were familiar with the tasks,
F Prompts for Knowledge Generation were employed to rate the acquired knowledge and
generated answers based on four intrinsic perspec-
F.1 Zero-shot Prompts
tives and two extrinsic perspectives. Each perspec-
In our experiments, we observed that zero-shot tive was scored on a scale of 0, 1, or 2, representing
prompting was highly unstable. Therefore, we con- unacceptable, acceptable, and excellent, respec-
duct experiments using multiple human prompts tively. The average kappa value of the annotation is
and select the most effective ones for the WoW and 0.612 on 20% cross-annotation data. The detailed
NQ datasets. The human prompts we evaluate are instructions for the human annotation can be found
listed in Table 11. in Table 17.
Note for factuality assessment, the reliable evi-
F.2 Few-shot Prompts dence for the generated knowledge k is acquired
In the few-shot setting, our prompt is constructed by the following process: For each sentence in k,
using k randomly chosen examples from the train- we use it as the query to search Google, and regard
ing set: the top1 Wikipedia webpage as a reliable knowl-
edge source to verify the factuality of this sentence.
prompt = (example1 \n ... examplek \n exampletest ) Another point worth noting is that for the evalua-
tion of validity in WoW, we reused the factuality
The example templates utilized for knowledge gen- evaluation process since the responses in WoW are
eration are provided in Table 12. Please note that open-ended.
exampletest differs from examplei as it does not
contain {knowledge} in the placeholder. J.2 Human Evaluation Results on WoW
Based on the provided annotations, we assessed the
G Prompts for Answer Generation correlation between ChatGPT’s automatic metrics
and human judgement on the WoW dataset. The
We adopt few-shot prompting on the LLaMA results are presented in Table 18.
model in answer generation and the example tem-
plates used for answer generation are provided in J.3 Baseline Metrics
Table 13.
We compared it with three reference-reliant metrics
in knowledge evaluation. Their definitions and
H Detailed Correlations between
calculation methods are as follows:
Intrinsic and Extrinsic Metrics
Hallucinated NE Ratio (HE) (Lee et al., 2023)
We listed the detailed correlations between intrinsic proposed a NE-based metric that is designed with
and extrinsic metrics for LLaMA, FLAN-T5, and an intuition that a model is hallucinating (making
ChatGPT on the NQ dataset in Table 15. factual errors) if it generates an NE that does not
appear in the reference knowledge source. The
I Table of Model Size Impact NE-based metric can be calculated as: HNE =
|H ALLUNE | / |A LLNE | where A LLNE is the set of
We list the specific numerical values of perfor- all the NEs detected in the LM generation, and
mance scaling with the model size in Table 16, H ALLUNE is a subset of NEAll that does not appear
including LLaMA-65B/33B/7B and FLAN-T5- in the reference Wikipedia knowledge. Note that
11B/3B/780M. evaluating NE E R requires the existence of refer-
Metric Model Link
NLI-RoBERTa-large https://huggingface.co/sentence-transformers/nli-roberta-large
Factuality
ColBERTv2 https://github.com/stanford-futuredata/ColBERT
Relevance BERT-ranking-large https://github.com/nyu-dl/dl4marco-bert
GPT-neo-2.7B https://huggingface.co/EleutherAI/gpt-neo-2.7B
Coherence
Coherence-Momentum https://huggingface.co/aisingapore/coherence-momentum
Informativeness GPT-neo-2.7B https://huggingface.co/EleutherAI/gpt-neo-2.7B
Helpfulness LLaMA-65B https://github.com/facebookresearch/llama/tree/main
NLI-RoBERTa-large https://huggingface.co/sentence-transformers/nli-roberta-large
Validity
ColBERTv2 https://github.com/stanford-futuredata/ColBERT
Table 14: List of all models that we use in designing our framework.
Model Extrinsic
Instrinsic probability of each example as its ER score.
Fact. Rel. Coh-sent. Coh-para. Info. F1 of knowledge (F1) (Liu et al., 2022) employs a
helpful. 0.15† -0.21† 0.20† -0.21† 0.02
unigram F1 score to evaluate the quality of gener-
FLAN-T5 ated knowledge. This metric measures the overlap
validity 0.23† -0.16† 0.14† -0.10† 0.07
helpful. 0.03 0.05 0.06 -0.09† -0.01
between the generated knowledge and the reference
LL A MA knowledge by evaluating word-level matches. By
validity 0.09† 0.07 0.05 -0.06 -0.03
helpful. 0.16† 0.03 0.08 0.02 -0.04†
assessing the degree of agreement, the F1 metric
C HAT GPT provides an estimation of the knowledge quality,
validity 0.22† 0.13† 0.02† 0.09† 0.03
specifically from a relevance perspective.
Table 15: The Somers’ correlation between intrinsic NLI-weak-supervised (Kryscinski et al., 2020b)
and extrinsic metrics in zero-shot setting on NQ. Corre- train a classification model on constructed data
lation scores with p-value < 0.05 are marked with † . to perform consistency checking on (document,
sentence) pairs. We chose the factCC version as
Model Size Fact. Rel. Coh. Info Help. Val.
our baseline.
65B 0.942 0.732 0.824 0.757 0.219 0.420
LLaMA 33B 0.656 0.633 0.734 0.608 0.203 0.402 NLI-decompose-claim (Glover et al., 2022b)
7B 0.773 0.626 0.805 0.662 0.154 0.375 found that in general, sentence-level decomposi-
11B 0.584 0.685 0.778 0.673 -0.146 0.325 tion is preferable for the hypothesis side of the
FLAN-T5 3B 0.657 0.663 0.816 0.708 -0.155 0.324 NLI input. So we also decompose the generated
780M 0.506 0.729 0.793 0.729 -0.162 0.252
knowledge into sentences and then aggregate the
sentence-level scores to produce a document-level
Table 16: Performance on NQ with varying sizes of
FLAN-T5 and LLaMA as knowledge generators. The score.
max(0, .) operation in Eq.6 has been excluded to empha- NLI-multitask fine-tunes the DeBERTa-v3-large
size the sequential relationship among different sizes model on FEVER and two NLI datasets.
of FLAN-T5. Bold and Underlined results represent Exact Match (EM) (Rajpurkar et al., 2016) use Ex-
the best and second-best performances for each model, act Match to measure the percentage of predictions
respectively. that match its ground truth answers exactly.
ence knowledge. We adopt −HNE when comput-

ing the correlation with human judgement.
Entailment Ratio (ER) (Lee et al., 2023) also in-
troduces an NLI-based approach to assess factual
knowledge by measuring its entailing relationship
with ground-truth/reference knowledge. The en-
tailment ratio is computed as follows: EntailR =
|E NTAILgen | / |A LLgen |, where A LLgen is a set of
all generated knowledge, and E NTAILgen is the set
of generated knowledge that can be entailed by the
NLI model. Specifically, we use the entailment
Dimension Value Description
2 All sentences in k are factually accurate and the information in them can be verified with reliable evidence.
Factuality 1 k contains at least one sentence with non-verified information, while others are factually accurate.
0 k contains at least one sentence with at least one factual error that is inconsistent with reliable knowledge.
2 k is highly relevant to the topic and query/utterance.
Relevance 1 k is relevant to the topic but less relevant to the query/utterance.
0 k is irrelevant to both the topic and query/utterance.
2 k is very coherent and fluent (do not consider the truncation at the end due to the maximum generation length).
Coherence 1 k has some minor incoherence or lack of fluency, e.g., phrase or sentence repetition, but it does not affect understanding.
0 k has significant coherence and fluency issues that are hard to understand.
2 k contains informative content that you don’t know before.
Informative 1 k contains limited or trivial information against your knowledge.
0 k fails to provide any meaningful information.
2 k directly provides or contains the correct answer.
Helpfulness 1 k indirectly help in generating the correct answer.
0 k does not contain any useful information for the correct answer.
2 The answer generated based on k is correct.
Validity 1 The correctness of the generated answer cannot be determined.
0 The answer generated based on k is completely incorrect.
Table 17: Annotation guideline of LLM generated knowledge.
Model Fact. Rel. Coh. Info. Help. Val.

ChatGPT 0.71 0.57 0.52 0.40 0.79 0.54
Table 18: Somer’s D correlation of metrics with the

human annotation on WoW ). p-value for all results are
< 0.05. We report the maximum for coherence.

2310.07289v1

Uploaded by

2310.07289v1

Uploaded by

Beyond Factuality: A Comprehensive Evaluation of Large Language

Models as Knowledge Generators

prompted to generate world knowledge. Yet,

Table 1: Taxonomy of evaluation metrics of acquired knowledge.

of traditional factuality metrics (Wang et al., 2020;

Prob of factual-consistent (%)

sociation for Computational Linguistics. 0 2 4 6 8 10

Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv,

Appendix B Analysis of Reference Knowledge

Dataset Prompts Best

Dataset Prompts Best

C Details of Baselines D Details of Datasets

ence knowledge. We adopt −HNE when comput-

Table 17: Annotation guideline of LLM generated knowledge.

Model Fact. Rel. Coh. Info. Help. Val.

Table 18: Somer’s D correlation of metrics with the

You might also like