2310.07289v1
2310.07289v1
Abstract
Large language models (LLMs) outperform
information retrieval techniques for down-
stream knowledge-intensive tasks when being
arXiv:2310.07289v1 [cs.CL] 11 Oct 2023
Despite simplicity, they lack interpretability and (§ 4.4). 3) In addition to assessing and analyzing
are less effective for long-form answers. Lastly, the generated knowledge from different LLMs, the
contemporary studies (Pan et al., 2023; Min et al., evaluation outcome of CONNER can be exploited to
2023) apply fact-checking principles to spot factual enhance knowledge generation and further improve
inaccuracies. However, these evaluation methods the performance of downstream tasks (§ 5).
mainly assess a single aspect of the intrinsic quality Our main contributions are as follows:
of generated knowledge, overlooking other facets • We conduct the first empirical analysis focusing
and their extrinsic impact on downstream tasks, on both intrinsic quality and extrinsic reliability
thereby limiting a comprehensive understanding of of the generated knowledge from LLMs.
LLM-generated content.
• We propose CONNER, a COmpreheNsive kNowl-
In light of these limitations, we propose CONNER, edge Evaluation fRamework that enables the au-
a COmpreheNsive kNowledge Evaluation fRame- tomatic evaluation of LLMs as knowledge gener-
work, as illustrated in Figure 1. CONNER is designed ators from diverse perspectives, eliminating the
to be a reference-free framework that can system- need for human-labelled references.
atically and automatically evaluate the generated
• The extensive evaluation and analysis yield pro-
knowledge from six fine-grained perspectives, in-
found insights and valuable practical experience
cluding diverse intrinsic evaluation of its internal
for leveraging LLMs as knowledge generators.
properties, as well as uniform extrinsic evaluation
of its impact on specific downstream tasks. The • We collect a new set of multi-perspective hu-
taxonomy of evaluation metrics is presented in Ta- man judgments of LLM-generated knowledge
ble 1. Based on CONNER, we conduct empirical eval- for two knowledge-intensive generation datasets.
uations on three different types of LLMs, including We demonstrate that CONNER aligns well with hu-
LLaMA (Wei et al., 2022) (a base LLM), FLAN-T5 man judgments. The human annotations will be
(Wei et al., 2022) (an instruction-tuned LLM), Chat- released to facilitate future research.
GPT (Ouyang et al., 2022) (a commercial LLM
trained with human feedbacks). We evaluate them 2 Related Work
on two widely-studied knowledge-intensive tasks: Knowledge-intensive tasks rely heavily on ac-
open-domain QA (Kwiatkowski et al., 2019) and cess to external knowledge sources, such as open-
knowledge-grounded dialogue (Dinan et al., 2018). domain dialogue and QA (Dinan et al., 2018;
Our detailed investigations yield several valu- Kwiatkowski et al., 2019; Petroni et al., 2021).
able insights about the LLM-generated knowledge: The main-streamed methods (Karpukhin et al.,
1) LLM-generated knowledge surpasses retrieved 2020; Lewis et al., 2020; Izacard and Grave, 2021;
knowledge in most evaluation perspectives, while Deng et al., 2023b) typically employ IR tech-
it actually suffers from the factuality issue as ex- niques to first retrieve the relevant knowledge from
pected. Notably, the factuality of downstream tasks Wikipedia and then produce the answer or response
is found to be less affected by this issue, when conditioned on the knowledge. Nowadays, with the
compared to the impact of lower relevancy and powerful capabilities of LLMs (OpenAI, 2023; Ka-
coherency observed in the retrieved knowledge (§ davath et al., 2022a), a new trending approach is
4.3). 2) Several critical factors are identified to in- to leverage LLMs to directly generate the relevant
fluence the factuality of the generated knowledge, knowledge for a given query and then apply the
such as their frequency and length, while few-shot model-generated knowledge to complete the down-
in-context learning and larger size of models do not stream tasks (Liu et al., 2022; Li et al., 2022; Yu
necessarily guarantee higher quality and reliability et al., 2023). Despite the better performance than
retrieval-based methods, there is a lack of rigorous orous evaluation of the quality and dependability
evaluation of the quality and reliability of the gener- of knowledge used in knowledge-intensive tasks.
ated knowledge, which may contain misleading or CONNER is rooted in in-depth error analysis, paving
even plausible false information, e.g., hallucination the way for the construction of an evaluation tax-
and factual inconsistency. onomy, which integrates six unique perspectives
These issues are prevalent across various NLP into two coherent categories, as delineated in Ta-
tasks (Ji et al., 2023). However, most studies target ble 1. Capitalizing on the advantages of unsuper-
specific downstream tasks, such as text summa- vised metrics, our framework eliminates the need
rization (Maynez et al., 2020; Wang et al., 2020; for human-labeled reference knowledge and stan-
Kryscinski et al., 2020a; Pagnoni et al., 2021), dia- dardizes scores within an intuitive range of [0, 1],
logue generation (Shuster et al., 2021; Dziri et al., simplifying comparison and interpretation.
2022; Chen et al., 2023; Deng et al., 2023a), and The subsequent subsections provide a detailed
fact verification (Thorne et al., 2018; Wadden et al., examination of the framework’s design, commenc-
2020; Schuster et al., 2021; Pan et al., 2023). These ing with the formulation of knowledge-intensive
tasks are designed to examine consistency either tasks and the identification of associated error pat-
between the input and output or between the input terns. These insights direct the design of our met-
and a human-labeled reference, e.g., the source doc- rics. Through comprehensive intrinsic and extrinsic
ument and its summary, the grounded knowledge evaluations, we aim to gain a holistic understanding
and the generated response, or a human-written of the LLMs-generated knowledge.
claim and pre-annotated references.
The success of LLMs and generative search en- 3.1 Tasks Formulation
gines have brought hallucinations in LLM outputs
(Zhang et al., 2023) into focus. Research typically Formally, we define the knowledge-intensive task
falls into four categories. (Lee et al., 2023; Li et al., as follows: given a user query q, the goal is to pro-
2023) aim to assess the factuality of open-domain duce an answer with access to knowledge resources.
generation automatically using specially designed Specifically, the system first obtains the relevant
datasets, but their reliance on references may limit knowledge k that can help answer the query q from
real-world applicability. Another stream of work knowledge resources K, then generate an answer a
(Li et al., 2022; Yu et al., 2023; Liu et al., 2023a) using the acquired knowledge k. Specifically, the
uses human evaluation to measure output quality, knowledge resource K can be either a knowledge
which is difficult to scale. A third approach (Kada- base for knowledge retrieval or language models
vath et al., 2022b; Manakul et al., 2023) detects hal- for knowledge generation. Detailed formulations
lucinations by examining the model’s uncertainty of these two settings are presented in Appendix A.
or confidence, which can be inaccurate for long an-
swers. Lastly, recent studies (Peng et al., 2023; Pan 3.2 From Error Patterns to Metrics Design
et al., 2023; Min et al., 2023) apply fact-checking
principles to spot factual inaccuracies. To identify common errors by LLMs in knowledge-
Different from previous studies, we propose a intensive tasks and create a more targeted evalua-
comprehensive framework for evaluating knowl- tion framework, we used thematic analysis (Braun
edge generated by LLMs. Our goal is to automati- and Clarke, 2012). We began by extracting and
cally test the intrinsic quality and extrinsic impact consolidating patterns from subtle errors in knowl-
of generated information in knowledge-intensive edge and answers in responses from LLaMA to
tasks, without requiring knowledge labelling or hu- 160 samples from NQ (Kwiatkowski et al., 2019)
man involvement. Through extensive testing with and WoW (Dinan et al., 2018) datasets. To ensure
this framework, we aim to deepen and broaden our the breadth of the error spectrum was adequately
understanding of LLM-generated knowledge and represented, we further substantiated these patterns
provide valuable insights for future research. using additional questions from NQ and WoW. As
a result, we discerned four primary error categories
3 The Evaluation Framework in knowledge generation and two in answer genera-
tion. In response, we devised four intrinsic metrics
We introduce CONNER, a comprehensive and inno- for knowledge evaluation and two extrinsic metrics
vative framework, specifically designed for the rig- for answer evaluation, as outlined in Table 1.
3.3 Intrinsic Evaluation Relevance To assess the relevance between a
Intrinsic evaluation refers to the assessment of the given query q and the acquired knowledge k, we
acquired knowledge based on its internal properties compute the relevance score as follows:
and performance, without considering its impact Srel (k, q) = Matching(k, q) (2)
on downstream tasks or applications. In specific,
we implement four model-based metrics for evalu- The Matching(·) function denotes a fine-grained
ating the acquired knowledge in terms of factuality, matching model specifically designed for assessing
relevance, informativeness, and coherence. the relevance between the query and knowledge.
Factuality The core of factuality assessment is In our study, we employ the BERT ranking model
validating the acquired knowledge by external ev- (Nogueira et al., 2019) for this purpose.
idence 2 . Given an acquired knowledge k = This methodology addresses the limitations that
{s1 , . . . , sm } composed of m sentences, we can arise when traditional relevance metrics are applied
use a dense retrieval model (Santhanam et al., 2021) within knowledge generation scenarios. Traditional
or search engine API to recall the li most relevant relevance metrics (Karpukhin et al., 2020; Shuster
evidence Ei = {ei,1 , . . . , ei,li } for each sentence si et al., 2021; Komeili et al., 2021), which typically
from the expert knowledge base or the internet. Af- rely on word overlap or similarity with human-
ter collecting all the evidence E = {E1 , . . . , Em }, written references, face two significant challenges.
the factuality score is computed as follows: First, these traditional metrics do not correspond
Sfact (k, E) = min f (si , Ei )
well with scenarios where LLMs serve as genera-
i=1..m
(1) tive search engines, as evidenced by the unsatisfac-
= min max NLI(si , ei,j )
i=1..m j=1..li tory results in Table 10. Second, the reliance on
where f (·) is a function to compute sentence-level reference knowledge constitutes a substantial chal-
factuality, NLI(·) is a natural language inference lenge, especially when such references are scarce
model processing a premise-hypothesis pair to out- or absent in real-world applications. Contrarily,
put a R3 vector, indicating whether a hypothesis our BERT ranking model, trained on manually an-
(si ) is entailed by, neutral to or refuted by the notated Bing search data, excels at comparing the
given premise (ei,j ). Following these computa- relevance of different knowledge to a given query.
tions, sentence-level results are aggregated along Coherence As the acquired knowledge is typi-
the entailment dimension using one of three oper- cally long-form texts composed of multiple sen-
ations: min, mean, or max to match the desired tences, we propose to measure sentence-level co-
error tolerance level. In this instance, we exem- hesion and paragraph-level coherence: the for-
plify the process using min. Finally, we obtain mer measures the cohesion of individual sentences,
a three-dimensional factuality score Sfact (k, E). and the latter measures the coherence between
From each dimension of this vector, we can de- sentences. The sentence-level cohesion score
rive three fine-grained scores. We denote those Scoh_sent (k) is computed as follows:
scores as factual-consistent, non-verified,
and factual-inconsistent, respectively. 1 Xm
Scoh_sent (k) = 1/PPL(si ) (3)
This strategy seeks to address the shortcomings m i=1
Table 2: Automatic evaluation results of different LLMs in the Natural Question test set. Underlined and Bold
results denote the best results among each setting and among all settings, respectively.
Factuality Coherence
Model Setting Relevance Inform. Helpful. Validity
Fact-cons. Non-verif. Fact-incon. Coh-sent. Coh-para.
DPR Supervised 91.96% 5.18% 2.87% 0.0907 0.0223 0.6569 0.9357 0.0000 61.52%
FLAN-T5 77.90% 17.28% 4.82% 0.3776 0.1203 0.8331 0.7239 0.0904 56.97%
LL A MA Zero-shot 89.46% 8.89% 1.65% 0.5041 0.0548 0.8389 0.7889 0.1178 63.50%
C HAT GPT 88.51% 10.38% 1.11% 0.5283 0.1028 0.9250 0.7448 0.1023 59.76%
FLAN-T5 76.50% 17.20% 6.30% 0.4463 0.1523 0.7988 0.6983 0.0934 57.18%
LL A MA Few-shot 85.07% 12.05% 2.88% 0.3930 0.1088 0.7947 0.7855 0.1132 63.79%
C HAT GPT 85.75% 12.01% 2.24% 0.4618 0.0979 0.8632 0.7922 0.1164 60.27%
Table 3: Automatic evaluation results of different LLMs in the Wizard of Wikipedia test set.
annotations allowed us to calculate the correlation Evaluation Setting Following (Yu et al., 2023),
between each metric and human evaluations. Sub- we evaluate the knowledge generation of LLMs
sequently, we compared these correlations with under both zero-shot and few-shot settings. Af-
baseline metrics (Table 10). Our metrics demon- ter the knowledge acquisition, we perform QA or
strated a strong correlation with human evaluations, dialogue generation under the few-shot setting to
significantly outperforming the baseline metrics. further investigate the impact of different knowl-
Details are presented in Chapter 6 and Appendix J. edge acquisition methods on downstream tasks.
1) Zero-shot Evaluation: We test with varied
4.2 Experimental Setups prompts and report peak performance. A prompt
could be “Generate Wikipedia knowledge for the
Baselines Compared with a popular retrieval-
query. {query}”. Prompts tried are in Appendix F.
based model, DPR (Karpukhin et al., 2020), we
2) Few-shot Evaluation: We construct the
evaluate knowledge generation with the three differ-
prompt with k randomly chosen samples from the
ent types of LLMs, including FLAN-T5 (Wei et al.,
training set. The example templates used for knowl-
2022), LLaMA (Touvron et al., 2023), and Chat-
edge generation are listed in Appendix F and G.
GPT (Ouyang et al., 2022). By default, we report
the results with the largest size of each LLM and
4.3 Overall Evaluation
adopt greedy decoding in our experiments for re-
producibility. Details are presented in Appendix C. Table 2 and Table 3 summarize the evaluation re-
sults of DPR and three LLM-based knowledge
Datasets We evaluate the generated knowledge generators on NQ and WoW datasets, respectively.
on two widely-studied benchmark datasets, includ- There are several notable observations as follows:
ing 1) Natural Questions (NQ) (Kwiatkowski
et al., 2019), an open-domain QA dataset; and 2) Generated knowledge exceeds retrieved knowl-
Wizard of Wikipedia (WoW) (Dinan et al., 2018) edge in most evaluation perspectives, except the
a knowledge-grounded dialogue dataset. During factuality and informativeness. In both NQ and
experiments, we randomly sample 500 examples WoW scenarios, LLMs show remarkable capabil-
from the NQ and WoW test sets respectively for ities in generating highly relevant and coherent
evaluation. Details are presented in Appendix D. knowledge. Moreover, the knowledge generated
by LLMs also proves to be more beneficial for
Implementation Details All the adopted models downstream tasks, regarding both helpfulness and
in CONNER are introduced in Appendix E. validity. These results highlight the significant ad-
Model Query Acquired Knowledge Answer
LLaMA Who wrote the song "The Glory of Love" is a song written by Billy Hill, recorded by Billy Hill ✓
the glory of love? American singer Benny Goodman in 1936. The song has been covered
by many other artists, including a version by American singer Peggy
Lee with Bing Crosby and the Johnny Mann Singers in 1959.
FLAN-T5 Who wrote the song The Glory of Love is a song written by the American composer and Irving Berlin ✗
the glory of love? lyricist Irving Berlin.
DPR Rain sleet or snow ... Ice pellets are a form of precipitation consisting of small, translucent icy rain ✗
that contains a high balls of ice. This form of precipitation is also referred to as ""sleet""
concentration of by the United States National Weather Service. (In British English
acids is called? ""sleet"" refers to a mixture of rain and snow)...
Table 4: Factuality of acquired knowledge may not influence the validity of the answer. Red words represent factual
errors in critical information, while blue words represent factual errors in non-critical information.
Model Extrinsic
Instrinsic ble 4, which intuitively shows that the presence of
Fact. Rel. Coh-sent. Coh-para. Info. factual errors in non-critical information has mini-
helpful. 0.10 0.24† 0.07 -0.03 -0.14†
mal impact on downstream tasks, while it is highly
DPR impossible to derive the correct answer from the
validity 0.04 0.19† 0.04 -0.06 -0.09
helpful. 0.14 -0.05 0.10 -0.09 -0.05 irrelevant retrieved knowledge. While LLaMA and
LLM S
validity 0.15† -0.02 0.07 -0.03 -0.03 ChatGPT generate knowledge with slightly lower
factuality than DPR, it is shown to be adequate for
Table 5: The Somers’ correlation between intrinsic and downstream tasks. At this point, the relevance of
extrinsic metrics on NQ. Scores with p-value < 0.05 the acquired knowledge is more critical. Hence,
are marked with † . Bold results denote the most corre-
relying solely on the factuality of the knowledge
lated intrinsic metric to the concerned extrinsic metric.
The breakdowns of all correlations are in Appendix H. itself is an unreliable means of assessing its impact
on the factuality of downstream tasks. Motivated
Instrinsic by this finding, we investigate approaches to guid-
Model Extrinsic
Fact. Rel. Coh-sent. Coh-para. Info. ing the generated knowledge selection with the
multi-perspective evaluation outcome of CONNER
helpful. 0.01 0.27† 0.10† -0.03 -0.14†
DPR
validity -0.01 -0.06 0.13† -0.12† -0.13† for improving the downstream performance in § 5.
helpful. 0.06 0.05 0.10 0.00 -0.16 DPR falls short of retrieving relevant and help-
LLM S
validity 0.24† 0.09 0.05 -0.02 -0.07 ful knowledge for knowledge-grounded dia-
Table 6: The Somers’ correlation between intrinsic and
logues. As the DPR model is finetuned on QA
extrinsic metrics on WoW. datasets to match a question to Wikipedia knowl-
edge, the DPR model struggles to match dialogue
vantages of utilizing LLMs as knowledge genera- utterances with the necessary knowledge. Also,
tors in terms of knowledge quality and applicability, the candidate Wikipedia passages in DPR (100 to-
rendering them a valuable knowledge resource for kens) are much longer than the knowledge needed
various knowledge-intensive applications. in WoW, containing much redundant information.
Despite obtaining lower factuality than retrieved This reveals the shortcomings of supervised dense
knowledge, generated knowledge contributes retrieval models, such as limited transferability and
more to the factuality of downstream tasks (i.e., being constrained by knowledge bases.
higher validity). To investigate the underlying Few-shot in-context learning for LLMs gener-
reason, we analyze the correlation between differ- ally harms the factuality of generated knowl-
ent intrinsic metrics and extrinsic metrics on two edge. We observe that the length of knowledge
tasks. As shown in Tables 5 and 6, the perfor- generated by few-shot ICL is generally longer than
mance of downstream tasks is indeed hindered by that of zero-shot prompting since the ground-truth
the issue of factuality in the generated knowledge knowledge for demonstrations is relatively long.
from LLMs. However, for retrieval models (e.g., Consequently, LLM is more error-prone (see the
DPR), limitations may arise from the relevance analysis of long-form generation in § 4.4). This in-
and coherence of the retrieved knowledge, while dicates that few-shot ICL is not always better than
its high factuality fails to ensure the performance zero-shot ICL in knowledge generation, and the
of downstream tasks. We present a case study in Ta- selected demonstrations attach great importance.
(a) Analysis of Long-tail Knowledge (b) Analysis of Long-form Generation (a) Model scale of FLAN-T5 (b) Model scale of LLaMA
1.0 0.6 Fact. Fact.
0.8 0.9
0.8 0.5 0.7 0.8
0.7
Rel. 0.6 Val. Rel. Val.
Avg probability
0.5 0.6
0.6 Factual-con. 0.4
0.4
0.5
Non-verified 0.4
0.3 0.3
0.4 Factual-incon. 0.3 65B 11B
33B 3B
0.2 0.2 7B 0.8B
0.0 0.1
Coh. Help. Coh. Help.
4 5 6 7 8 9 1 2 3 4 5
Pageview of Wikipedia knowledge (10^) # of sentences in generated knowledge
Info. Info.
Figure 2: The impact of knowledge frequency and
length on the factuality of the generated knowledge. Figure 3: Performance on NQ with different sizes of
FLAN-T5 and LLaMA as the knowledge generator
Inspired by this, we investigate approaches to guid- (Help. and Val. scores are linearly scaled).
ing the few-shot demonstration selection with the Figure 2(b) displays the factuality performance
evaluation outcome of CONNER for improving the based on the number of sentences in the gener-
performance of few-shot ICL in § 5. ated knowledge. The results show that LLaMA
FLAN-T5 fails to be a qualified knowledge gen- exhibits higher error rates when generating long-
erator since its generated knowledge is poorly form knowledge. Therefore, prompting the LLMs
factual and rarely helpful to downstream tasks. to generate the required knowledge in a concise
Although FLAN-T5 (11B) significantly surpasses rather than lengthy manner can benefit factuality.
many models of the same scale through instruction Impact of Model Size Figures 3 depicts the per-
tuning on numerous tasks, it falls short of being formance scaling with the model size, including
a qualified knowledge generator. As shown in Ta- LLaMA-65B/33B/7B and FLAN-T5-11B/3B/780M.
ble 4, such a low factuality leads to frequent oc- The results are reported on the NQ dataset using
currences of factual errors in critical information, zero-shot prompting. We observe that larger mod-
thereby harming downstream tasks. To this end, els do not necessarily outperform smaller models in
we study the scaling of performance w.r.t different
terms of intrinsic evaluation (particularly when pa-
perspectives by varying the model size in § 4.4.
rameter magnitudes are similar). However, larger
4.4 Further Analysis models consistently outperform smaller models in
terms of extrinsic evaluation(helpfulness and valid-
We further analyze how different factors affect the
ity). Detailed tables are presented in Appendix I.
quality and reliability of the generated knowledge
and discuss our findings below. 5 Two Use Cases of CONNER
Long-tail Knowledge We investigate the impact To explore how our framework can guide the future
of the knowledge frequency on the factuality perfor- design of utilizing LLMs as a knowledge generator,
mance of LLaMA on the WoW dataset. Each data we design two strategies to employ CONNER as a
entry in WoW comprises a topic, query, knowledge, measurement for guiding the Prompt Engineering
and answer. The topic indicates the corresponding and Knowledge Selection for knowledge-intensive
Wikipedia page linked to the knowledge. We as- tasks. We define the overall quality of knowledge
sess this knowledge’s frequency using Wikipedia k given the query q as follows:
pageviews from 2015 to 20213 . This enables us to
differentiate between common and long-tail knowl- Qknow (q, k) = γ ⊺ · Sintr γ ∈ R4
(9)
edge in WoW. Our findings reveal that LLaMA Sintr = [Sfact , Srel , Scoh_para , Sinfo ]⊺
exhibits lower reliability when it is expected to gen- where Qknow is the linear combination of four in-
erate rare/long-tail knowledge compared to com- stinct metrics Sintr and γ is the coefficient vector.
mon knowledge, as depicted in Figure 2(a).
Prompt Engineering We show how to use
Long-form Generation We investigate the im-
CONNER to improve knowledge generation by per-
pact of generation length on the factuality of the
forming prompt engineering for few-shot ICL. We
generated knowledge. Specifically, we consider
random sample a small set of m samples from the
knowledge over 40 tokens and take sentences as
training set, then use Qknow (q, k) as the scoring
evaluation units aligned with factuality evaluation.
function to select the top n samples to compose
3
https://wikimedia.org/api/rest_v1 the few-shot prompt. As shown in Table 7, the
Model Fact. Rel. Coh. Info. Metric DPR FLAN-T5 LLaMA ChatGPT
ChatGPT 85.8% 0.462 0.863 0.792 Factuality 0.65† 0.66† 0.66† 0.63†
ChatGPTselect prompt 87.7% 0.503 0.899 0.775
Relevance 0.69† 0.37† 0.55† 0.54†
Table 7: CONNER-guided demonstration selection im- Coherence 0.53† 0.58† 0.44† 0.49†
proves the intrinsic quality of generated knowledge.
Informative 0.30† 0.17 0.35 0.32†
Model Helpfulness Validity Helpfulness 0.75† 0.45† 0.81† 0.69†
ChatGPT 0.1461 43.45% Validity 0.83† 0.73† 0.85† 0.82†
ChatGPTselect knowledge 0.2090 44.28%
Table 9: Somer’s D correlation of metrics with the
Table 8: CONNER-guided knowledge selection improves human annotation on NQ (The results on WoW are
extrinsic (downstream) performance. presented in Appendix J.2). Correlation scores with
p-value < 0.05 are marked with † .
knowledge generated by CONNER-enhanced few-
shot prompting outperforms that with random Metric DPR FLAN-T5 LLaMA ChatGPT
demonstrations on 3 out of 4 perspectives, under Factuality 0.65† 0.66† 0.66† 0.63†
the setting of m = 30 and n = 8. HE -0.24 0.15 -0.03 0.29†
NLI 0.23 0.47† 0.27† 0.38†
Knowledge Selection We employ CONNER to im- NLI-Multitask 0.18† 0.51† 0.26† 0.32†
prove downstream tasks by selecting high-quality NLI-Decompose. 0.23† 0.47† 0.27† 0.38†
generated knowledge. Specifically, we generate r Relevance 0.69† 0.37† 0.55† 0.54†
different knowledge H = {k˜1 , ..., k˜r } from LLMs F1 0.45† 0.21 0.41† 0.47†
with top-p sampling, then select the generated Validity 0.83† 0.73† 0.85† 0.82†
knowledge for the downstream task, according to EM 0.59† 0.51† 0.54† 0.61†
k = argmaxk̃∈H Qknow (q, k̃). As shown in Table 8, F1 0.74† 0.67† 0.76† 0.77†
we achieve a relative improvement of 43.15% in Table 10: Comparing CONNER with reference-reliant
helpfulness on ChatGPT with p = 0.9 and r = 5. baseline metrics on the NQ dataset. Details of baseline
metrics are presented in Appendix J.3.
6 Human Evaluation
knowledge and human knowledge. (2) CONNER met-
We conducted a human evaluation by randomly se- rics consistently outperform all other reference-
lecting 400 samples from the NQ and WoW test reliant metrics, indicating the effectiveness of our
sets. Our three annotators provided ratings for the framework in the knowledge evaluation scenarios.
intrinsic and extrinsic metrics for the four mod-
els. Additionally, for FLAN-T5 and LLaMA, we 7 Conclusion
annotated the specific locations of factual errors
in the generated knowledge, aiming to facilitate In this work, we introduce CONNER, a comprehen-
future research on fine-grained fallacy detection. sive evaluation framework designed to automati-
Detailed annotation instructions and the statistics cally assess both the intrinsic quality and extrinsic
of our labelled data can be found in Appendix J.1. reliability of the knowledge generated by LLMs.
To evaluate how well CONNER matches human Notably, CONNER is reference-free but demonstrates
evaluation of knowledge and compares with several a better correlation with human judgement com-
baseline metrics, we measure the Somers’ D cor- pared with previous reference-reliant metrics.
relation (Somers, 1962) between the human rating Through extensive evaluation and in-depth analy-
0, 1, 2 of the knowledge quality and corresponding sis, we identify several key factors affecting the fac-
metric scores. Table 9 and Table 10 illustrate the re- tuality of generated knowledge. We find although
sults of four models on the NQ dataset. We observe the generated knowledge is less factual than the
that: (1) CONNER yields consistently good correla- retrieved knowledge, it remarkably enhances the
tions with human evaluation w.r.t different eval- factuality of downstream tasks over the retrieved
uation perspectives (except for informativeness), ones. Furthermore, we propose two approaches
which indicates that the quality of knowledge can to improve knowledge generation and downstream
be more effectively evaluated with CONNER. The task performance with the guidance of CONNER. We
inconsistency between informativeness and human believe our framework and findings will facilitate
judgment is attributed to the differences in model the future research of trustworthy AIGC.
Limitations data that targets fine-grained factual errors.
Despite these limitations, we believe our work
In this section, we discuss the limitations in this serves as a significant catalyst for the automated
work from three perspectives. evaluation of knowledge generated by large lan-
Firstly, the knowledge we evaluate primarily re- guage models, contributing positively to the ad-
lies on information sourced from Wikipedia. This vancement of more trustworthy AI systems.
choice is driven by two considerations: (1) Large
language models (LLMs) are trained on diverse Acknowledgements
corpora, which may include undisclosed domain-
specific or task-specific data. To ensure fairness We extend our sincerest gratitude to Professor Jing
in our evaluations and enable meaningful compar- Ma, whose insightful discussions and suggestions
isons, we focus on the common data sources that on factuality evaluation have significantly inspired
all models have learned from, with Wikipedia be- our design. We are particularly grateful to our
ing a prevalent pre-training corpus for different three anonymous reviewers, whose thorough and
LLMs. (2) Wikipedia is renowned for its high- meticulous reviews have considerably improved
quality knowledge, providing us with authoritative the quality of our work. Their constructive discus-
evidence to validate the generated knowledge. Ad- sions and insights have undoubtedly enhanced our
ditionally, leveraging such authoritative evidence revisions. This research work is partially supported
enhances the interpretability of our factual judg- by CUHK under Project No. 3230377 (Ref. No.
ments. In future work, we aim to expand our KPF23GW20).
evaluations to include a broader range of world
knowledge, thus further enhancing the scope and
References
generalizability of our findings.
Secondly, while our work primarily aims to pro- Sid Black, Leo Gao, Phil Wang, Connor Leahy, and
pose a general framework that can be applied to Stella Biderman. 2021. GPT-Neo: Large scale autore-
gressive language modeling with Mesh-Tensorflow.
any language, our evaluation framework presents
potential generalization challenges for non-English Virginia Braun and Victoria Clarke. 2012. Thematic
languages. This is due to its reliance on several analysis., pages 57–71.
common NLP components, a limitation echoed
Liang Chen, Hongru Wang, Yang Deng, Wai Chung
across many NLP methodologies. Encouragingly, Kwan, Zezhong Wang, and Kam-Fai Wong. 2023.
the development of model variants in other lan- Towards robust personalized dialogue generation via
guages, such as Chinese (Hu et al., 2020; Xie et al., order-insensitive representation regularization. In
2023; Huang et al., 2017), indicates the potential Findings of the Association for Computational Lin-
guistics: ACL 2023, pages 7337–7345, Toronto,
for broader applications. Nonetheless, the reality Canada. Association for Computational Linguistics.
remains that for very low-resource languages with-
out existing NLP models, these components may Yang Deng, Wenqiang Lei, Minlie Huang, and Tat-Seng
need to be developed from scratch. This issue rep- Chua. 2023a. Goal awareness for conversational
AI: proactivity, non-collaborativity, and beyond. In
resents a challenge that the community needs to Proceedings of the 61st Annual Meeting of the As-
address in the future. sociation for Computational Linguistics: Tutorial
A third limitation is that our assessment of Abstracts, ACL 2023, Toronto, Canada, July 9-14,
factuality is limited to sentence-level granularity. 2023, pages 1–10. Association for Computational
Linguistics.
Through analysis and manual annotation, we have
identified that large language models (LLMs) tend Yang Deng, Wenxuan Zhang, Yifei Yuan, and Wai Lam.
to exhibit errors at a more detailed level, particu- 2023b. Knowledge-enhanced mixed-initiative dia-
larly concerning numbers, time, and the generation logue system for emotional support conversations. In
Proceedings of the 61st Annual Meeting of the Associ-
of misleading or fabricated concepts (e.g., key char- ation for Computational Linguistics (Volume 1: Long
acters, identities, and locations), particularly within Papers), ACL 2023, pages 4079–4095. Association
parallel structures. To address this limitation, fu- for Computational Linguistics.
ture research will concentrate on developing more
Emily Dinan, Stephen Roller, Kurt Shuster, Angela
fine-grained methods for detecting hallucinations Fan, Michael Auli, and Jason Weston. 2018. Wizard
and assessing factual accuracy. To facilitate such of wikipedia: Knowledge-powered conversational
research, we have annotated a specific subset of agents. arXiv preprint arXiv:1811.01241.
Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Os- Tran-Johnson, Scott Johnston, Sheer El Showk, Andy
mar Zaiane, Mo Yu, Edoardo M. Ponti, and Siva Jones, Nelson Elhage, Tristan Hume, Anna Chen,
Reddy. 2022. Faithdial: A faithful benchmark for Yuntao Bai, Sam Bowman, Stanislav Fort, Deep
information-seeking dialogue. Ganguli, Danny Hernandez, Josh Jacobson, Jack-
son Kernion, Shauna Kravec, Liane Lovitt, Ka-
John Glover, Federico Fancellu, Vasudevan Jagan- mal Ndousse, Catherine Olsson, Sam Ringer, Dario
nathan, Matthew R. Gormley, and Thomas Schaaf. Amodei, Tom Brown, Jack Clark, Nicholas Joseph,
2022a. Revisiting text decomposition methods for Ben Mann, Sam McCandlish, Chris Olah, and Jared
nli-based factuality scoring of summaries. CoRR, Kaplan. 2022b. Language models (mostly) know
abs/2211.16853. what they know. CoRR, abs/2207.05221.
John Glover, Federico Fancellu, Vasudevan Jagan- Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
nathan, Matthew R. Gormley, and Thomas Schaaf. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and
2022b. Revisiting text decomposition methods for Wen-tau Yih. 2020. Dense passage retrieval for open-
nli-based factuality scoring of summaries. domain question answering. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Or Honovich, Leshem Choshen, Roee Aharoni, Ella Language Processing (EMNLP), pages 6769–6781.
Neeman, Idan Szpektor, and Omri Abend. 2021.
q 2 : Evaluating factual consistency in knowledge- Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2021.
grounded dialogues via question generation and ques- Internet-augmented dialogue generation.
tion answering.
Wojciech Kryscinski, Bryan McCann, Caiming Xiong,
Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra and Richard Socher. 2020a. Evaluating the factual
Kübler, and Lawrence Moss. 2020. OCNLI: Orig- consistency of abstractive text summarization. In
inal Chinese Natural Language Inference. In Find- Proceedings of the 2020 Conference on Empirical
ings of the Association for Computational Linguistics: Methods in Natural Language Processing (EMNLP),
EMNLP 2020, pages 3512–3526, Online. Association pages 9332–9346, Online. Association for Computa-
for Computational Linguistics. tional Linguistics.
Guimin Huang, Min Tan, Sirui Huang, Ruyu Mo, and Wojciech Kryscinski, Bryan McCann, Caiming Xiong,
Ya Zhou. 2017. A discourse coherence model for and Richard Socher. 2020b. Evaluating the factual
analyzing chinese students’ essay. In 2017 Interna- consistency of abstractive text summarization. In
tional Conference on Progress in Informatics and Proceedings of the 2020 Conference on Empirical
Computing (PIC), pages 430–434. Methods in Natural Language Processing, EMNLP
Gautier Izacard and Edouard Grave. 2021. Leveraging 2020, pages 9332–9346.
passage retrieval with generative models for open
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
domain question answering. In EACL 2021, pages
field, Michael Collins, Ankur Parikh, Chris Alberti,
874–880.
Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui ton Lee, Kristina Toutanova, Llion Jones, Matthew
Wang, and Michael Bendersky. 2023. Query expan- Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob
sion by prompting large language models. CoRR, Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
abs/2305.03653. ral questions: A benchmark for question answering
research. Transactions of the Association for Compu-
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan tational Linguistics, 7:452–466.
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea
Madotto, and Pascale Fung. 2023. Survey of halluci- Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pas-
nation in natural language generation. ACM Comput. cale Fung, Mohammad Shoeybi, and Bryan Catan-
Surv., 55(12). zaro. 2023. Factuality enhanced language models for
open-ended text generation.
Prathyusha Jwalapuram, Shafiq R. Joty, and Xiang
Lin. 2021. Rethinking self-supervision objectives Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik-
for generalizable coherence modeling. CoRR, tus, Fabio Petroni, Vladimir Karpukhin, Naman
abs/2110.07198. Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih,
Tim Rocktäschel, Sebastian Riedel, and Douwe
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Kiela. 2020. Retrieval-augmented generation for
Henighan, Dawn Drain, Ethan Perez, Nicholas knowledge-intensive NLP tasks. In Advances in Neu-
Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli ral Information Processing Systems 33: Annual Con-
Tran-Johnson, et al. 2022a. Language models ference on Neural Information Processing Systems
(mostly) know what they know. arXiv preprint 2020, NeurIPS 2020.
arXiv:2207.05221.
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Nie, and Ji-Rong Wen. 2023. Halueval: A large-
Henighan, Dawn Drain, Ethan Perez, Nicholas scale hallucination evaluation benchmark for large
Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli language models. CoRR, abs/2305.11747.
Yanyang Li, Jianqiao Zhao, Michael R. Lyu, and Li- Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng,
wei Wang. 2022. Eliciting knowledge from large Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou
pre-trained models for unsupervised knowledge- Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check
grounded conversation. In Proceedings of the 2022 your facts and try again: Improving large language
Conference on Empirical Methods in Natural Lan- models with external knowledge and automated feed-
guage Processing, EMNLP 2022, pages 10551– back.
10564.
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick
Nelson F. Liu, Tianyi Zhang, and Percy Liang. 2023a. Lewis, Majid Yazdani, Nicola De Cao, James Thorne,
Evaluating verifiability in generative search engines. Yacine Jernite, Vladimir Karpukhin, Jean Maillard,
et al. 2021. Kilt: a benchmark for knowledge in-
Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Dragomir tensive language tasks. In Proceedings of the 2021
Radev, and Arman Cohan. 2023b. On learning to Conference of the North American Chapter of the
summarize with large language models as references. Association for Computational Linguistics: Human
CoRR, abs/2305.14239. Language Technologies, pages 2523–2544.
Zihan Liu, Mostofa Patwary, Ryan Prenger, Shrimai
Prabhumoye, Wei Ping, Mohammad Shoeybi, and Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Bryan Catanzaro. 2022. Multi-stage prompting for Dario Amodei, Ilya Sutskever, et al. 2019. Language
knowledgeable dialogue generation. models are unsupervised multitask learners. OpenAI
blog, 1(8):9.
Potsawee Manakul, Adian Liusie, and Mark J. F. Gales.
2023. Selfcheckgpt: Zero-resource black-box hal- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
lucination detection for generative large language Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
models. CoRR, abs/2303.08896. Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and former.
Ryan T. McDonald. 2020. On faithfulness and factu-
ality in abstractive summarization. In Proceedings of Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
the 58th Annual Meeting of the Association for Com- Percy Liang. 2016. Squad: 100,000+ questions for
putational Linguistics, ACL 2020, pages 1906–1919. machine comprehension of text.
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Keshav Santhanam, Omar Khattab, Jon Saad-Falcon,
Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Christopher Potts, and Matei Zaharia. 2021. Col-
Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. bertv2: Effective and efficient retrieval via
Factscore: Fine-grained atomic evaluation of factual lightweight late interaction. CoRR, abs/2112.01488.
precision in long form text generation.
Tal Schuster, Adam Fisch, and Regina Barzilay. 2021.
Rodrigo Frassetto Nogueira, Wei Yang, Kyunghyun Get your vitamin C! robust fact verification with
Cho, and Jimmy Lin. 2019. Multi-stage document contrastive evidence. In Proceedings of the 2021
ranking with BERT. CoRR, abs/1910.14424. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
OpenAI. 2023. GPT-4 technical report. CoRR,
Language Technologies, pages 624–643, Online. As-
abs/2303.08774.
sociation for Computational Linguistics.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
Sandhini Agarwal, Katarina Slama, Alex Ray, John and Jason Weston. 2021. Retrieval augmentation
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, reduces hallucination in conversation. In Findings
Maddie Simens, Amanda Askell, Peter Welinder, of the Association for Computational Linguistics:
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. EMNLP 2021, pages 3784–3803, Punta Cana, Do-
Training language models to follow instructions with minican Republic. Association for Computational
human feedback. Linguistics.
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Robert H Somers. 1962. A new asymmetric measure of
Tsvetkov. 2021. Understanding factuality in abstrac- association for ordinal variables. American sociolog-
tive summarization with FRANK: A benchmark for ical review, pages 799–811.
factuality metrics. In Proceedings of the 2021 Con-
ference of the North American Chapter of the Asso- James Thorne, Andreas Vlachos, Christos
ciation for Computational Linguistics: Human Lan- Christodoulopoulos, and Arpit Mittal. 2018.
guage Technologies, pages 4812–4829, Online. As- FEVER: a large-scale dataset for fact extraction
sociation for Computational Linguistics. and VERification. In Proceedings of the 2018
Conference of the North American Chapter of
Liangming Pan, Xiaobao Wu, Xinyuan Lu, Anh Tuan the Association for Computational Linguistics:
Luu, William Yang Wang, Min-Yen Kan, and Preslav Human Language Technologies, Volume 1 (Long
Nakov. 2023. Fact-checking complex claims with Papers), pages 809–819, New Orleans, Louisiana.
program-guided reasoning. Association for Computational Linguistics.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix, 70
Table 11: List of human prompts we tried for zero-shot knowledge generation, evaluated on the validation set of
NQ, WoW. {} represents placeholder, and ’utterance’ denotes the last utterance of the dialogue partner. We use ✓to
denote the prompt achieving the best performance.
Table 12: List of example templates we tried for few-shot knowledge generation.
Table 13: List of example templates we tried for few-shot answer generation.
evidence increases, the performance of both groups foundation language model trained on publicly
converges. These results suggest that reference available datasets and shows competitive perfor-
knowledge is dispensable, particularly when a sig- mance with the best models, including GPT-3
nificant amount of evidence is available. When (175B) and PaLM-540B.
the number of retrieved evidence surpasses ten, the ChatGPT is a sibling model to InstructGPT
impact of reference knowledge becomes negligible. (Ouyang et al., 2022) that is trained to follow in-
We hope this will provide valuable insights for fu- structions in a prompt and provide a detailed re-
ture designs of factuality assessment for generated sponse. We adopt text-davinci-003 version for
knowledge. evaluation.
Table 14: List of all models that we use in designing our framework.
Model Extrinsic
Instrinsic probability of each example as its ER score.
Fact. Rel. Coh-sent. Coh-para. Info. F1 of knowledge (F1) (Liu et al., 2022) employs a
helpful. 0.15† -0.21† 0.20† -0.21† 0.02
unigram F1 score to evaluate the quality of gener-
FLAN-T5 ated knowledge. This metric measures the overlap
validity 0.23† -0.16† 0.14† -0.10† 0.07
helpful. 0.03 0.05 0.06 -0.09† -0.01
between the generated knowledge and the reference
LL A MA knowledge by evaluating word-level matches. By
validity 0.09† 0.07 0.05 -0.06 -0.03
helpful. 0.16† 0.03 0.08 0.02 -0.04†
assessing the degree of agreement, the F1 metric
C HAT GPT provides an estimation of the knowledge quality,
validity 0.22† 0.13† 0.02† 0.09† 0.03
specifically from a relevance perspective.
Table 15: The Somers’ correlation between intrinsic NLI-weak-supervised (Kryscinski et al., 2020b)
and extrinsic metrics in zero-shot setting on NQ. Corre- train a classification model on constructed data
lation scores with p-value < 0.05 are marked with † . to perform consistency checking on (document,
sentence) pairs. We chose the factCC version as
Model Size Fact. Rel. Coh. Info Help. Val.
our baseline.
65B 0.942 0.732 0.824 0.757 0.219 0.420
LLaMA 33B 0.656 0.633 0.734 0.608 0.203 0.402 NLI-decompose-claim (Glover et al., 2022b)
7B 0.773 0.626 0.805 0.662 0.154 0.375 found that in general, sentence-level decomposi-
11B 0.584 0.685 0.778 0.673 -0.146 0.325 tion is preferable for the hypothesis side of the
FLAN-T5 3B 0.657 0.663 0.816 0.708 -0.155 0.324 NLI input. So we also decompose the generated
780M 0.506 0.729 0.793 0.729 -0.162 0.252
knowledge into sentences and then aggregate the
sentence-level scores to produce a document-level
Table 16: Performance on NQ with varying sizes of
FLAN-T5 and LLaMA as knowledge generators. The score.
max(0, .) operation in Eq.6 has been excluded to empha- NLI-multitask fine-tunes the DeBERTa-v3-large
size the sequential relationship among different sizes model on FEVER and two NLI datasets.
of FLAN-T5. Bold and Underlined results represent Exact Match (EM) (Rajpurkar et al., 2016) use Ex-
the best and second-best performances for each model, act Match to measure the percentage of predictions
respectively. that match its ground truth answers exactly.