LLM-based NLG Evaluation: Current Status and Challenges

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

LLM-based NLG Evaluation: Current Status and Challenges

Mingqi Gao∗ , Xinyu Hu∗ , Jie Ruan , Xiao Pu , Xiaojun Wan


Peking University
{gaomingqi, huxinyu, wanxiaojun}@pku.edu.cn, {ruanjie, puxiao}@stu.pku.edu.cn
arXiv:2402.01383v2 [cs.CL] 26 Feb 2024

Abstract Task Instructions


Prompt Evaluation Modes
Evaluating natural language generation (NLG) is Input Content
a vital but challenging problem in artificial intel- Evaluation Criteria
ligence. Traditional evaluation metrics mainly cap- ⋯
turing content (e.g. n-gram) overlap between sys- Specialized Fine-tuning
tem outputs and references are far from satisfactory, Evaluator LLM Human
and large language models (LLMs) such as Chat- Collaboration
GPT have demonstrated great potential in NLG
Derived Metrics
evaluation in recent years. Various automatic evalu-
ation methods based on LLMs have been proposed,
including metrics derived from LLMs, prompting Probability-based Embedding-based
LLMs, and fine-tuning LLMs with labeled evalu-
ation data. In this survey, we first give a taxon- Figure 1: Schematic representation of the four categories of LLM-
omy of LLM-based NLG evaluation methods, and based NLG evaluation.
discuss their pros and cons, respectively. We also
discuss human-LLM collaboration for NLG evalu-
emerged in 2023, the past year has seen an enormous amount
ation. Lastly, we discuss several open problems in
of research work. It is no exaggeration to say that NLG eval-
this area and point out future research directions.
uation has been revolutionized by LLMs. This article will re-
view the existing literature and provide suggestions for future
1 Introduction research in this field.
Evaluating natural language generation (NLG) is a key but This article mainly focuses on research that uses language
challenging issue. Traditional evaluation metrics like BLEU models with over one billion parameters for NLG evaluation,
[Papineni et al., 2002] and ROUGE [Lin, 2004] rely on the n- with necessary references to earlier model-based evaluation
gram overlap between model outputs and references to mea- metrics like BERTScore. To maintain focus, other types of
sure its quality. They have been criticized for low correlation generation like code generation and tasks involving images
with human judgments [Sulem et al., 2018], as surface-level are not included in the scope of this article. As shown in Fig-
matching cannot reliably evaluate text. After the rise of deep ure 1, according to how we utilize LLMs for NLG evaluation,
learning, model-based evaluation metrics like BERTScore we categorize the research work into four types:
[Zhang et al., 2020] and BARTScore [Yuan et al., 2021] have • LLM-derived Metrics (§2): developing/deriving eval-
been continuously proposed and gradually adopted to eval- uation metrics from embeddings or generation probabil-
uate the overall quality or various specific aspects of gener- ities of LLMs.
ated outputs (e.g., fluency, coherence, coverage, faithfulness,
etc.). Although better than traditional metrics, their perfor- • Prompting LLMs (§3): directly inquiring existing
mance is still not satisfactory, and their application scope is LLMs via designed prompts which involve different el-
very limited. For example, BERTScore is reference-based ements for evaluation.
and cannot be used without a reference. With the emergence • Fine-tuning LLMs (§4): using labeled evaluation data
of large language models (LLMs) like InstructGPT [Ouyang to fine-tune existing LLMs and improving their NLG
et al., 2022], they have achieved unprecedented effectiveness evaluation capabilities.
in following instructions, understanding content, and gener-
ating text. This inspired researchers to use LLMs for NLG • Human-LLM Collaborative Evaluation (§5): leverag-
evaluation. Although this is a research direction that only ing distinctive strengths of both human evaluators and
LLMs to achieve robust and nuanced evaluations in chal-
*
Equal contribution. lenging domains through human-LLM collaboration.
We will review each type of evaluation methods and discuss 2.4 Pros and Cons
the pros and cons respectively. Lastly, we will discuss future Traditional NLG evaluation approaches always fall short due
directions in this area (§6). to their surface-form similarity when the target text and ref-
erence convey the same meaning but use different expres-
2 LLM-derived Metrics sions. In contrast, LLM-derived metrics offer a remedy for
2.1 Overview the limitation and demonstrate stronger correlations with hu-
Early model-based NLG evaluation methods and metrics, man judgments benefiting from the evolving modeling tech-
such as BERTScore and BARTScore, were primarily mo- niques. However, the flaws within LLMs can lead to some
tivated by the capability of traditional pre-trained language issues, as introduced in the following:
models to generate high-quality texts. Recently, the advent Robustness. Some research has investigated the robust-
of LLMs has prompted some research to adapt these ideas ness of LLM-derived metrics and found that they lack ro-
to stronger LLMs. Their inherent, powerful linguistic abili- bustness in different attack scenarios. Specifically, [He et al.,
ties are derived for direct NLG evaluation, which is expected 2023b] develops a set of stress tests to assess the robustness
to result in better performance. Such works can be cate- of various model-based metrics on some common NLG tasks.
gorized into two main types: embedding-based metrics and They show a catalogue of the blind spots and potential errors
probability-based metrics. identified that are not detected by different metrics.
Efficiency. Compared to traditional metrics, LLM-derived
2.2 Embedding-based Metrics evaluation methods are more time-consuming and require
The embedding-based methods, like BERTScore, generally more computational resources, especially when adopting
utilize representations of language models and thus compute LLMs with quite large parameter scales. To address this,
the semantic similarity between the reference and the target [Eddine et al., 2022] proposes an approach to learning a
text to evaluate, with different possible ways of implemen- lightweight version of LLM-derived metrics, and some fast
tation. Recent work [ES et al., 2023] similarly obtains em- LLM inference and serving tools like vLLM [Kwon et al.,
beddings for the target text using the text-embedding-ada-002 2023] have been launched. However, closed-source LLMs
model through the OpenAI API. The higher the correspond- often do not make their parameters, representations, or log-
ing similarity, the closer the target text aligns with the relevant its public and available, thus making it impossible to apply
requirements, indicating a higher quality. LLM-derived methods to them.
Fairness. [Sun et al., 2022] assesses the social bias across
2.3 Probability-based Metrics various metrics for NLG evaluation on six sensitive attributes:
To better utilize the knowledge inherent in language models, race, gender, religion, physical appearance, age, and socioe-
probability-based methods like BARTScore formulate text conomic status. Their findings reveal that model-based met-
generation evaluation as conditional probability comparison, rics carry noticeably more social bias than traditional metrics.
positing that the better the quality of the target text, the higher Relevant biases can be categorized into two types: intrinsic
the likelihood that models should be able to generate it. Re- bias encoded within pre-trained language models and extrin-
cently, GPTScore [Fu et al., 2023a] has established tailored sic bias injected during the computation of similarity. There-
evaluation templates for each aspect to effectively guide mul- fore, current LLM-derived methods may have similar issues.
tiple LLMs for NLG evaluation, including GPT3 [Brown et
al., 2020], OPT [Zhang et al., 2022], and FLAN [Chung et 3 Prompting LLMs
al., 2022]. The generation probability is calculated under the
condition of source input designed with customized prompts 3.1 Overview
and evaluation aspects, which endows the GPTScore with LLMs have demonstrated unprecedented instruction under-
better flexibility in evaluation. And similar methods have standing and text generation abilities, which broadens re-
also been applied to the hallucination detection of the LLM- searchers’ imagination of automatic evaluation for NLG.
generated text [Varshney et al., 2023] with three different at- For a long time, human evaluation has been viewed as the
tempts for calculating the probability score. gold standard for NLG evaluation. Recently, some stud-
On the other hand, some works leverage the variation in ies claim that LLMs are on par with crowdsourcing anno-
probabilities under changed conditions as the evaluation met- tators in several tasks [Törnberg, 2023; Gilardi et al., 2023;
ric. FFLM [Jia et al., 2023] proposes to evaluate the faith- Ostyakova et al., 2023; Cegin et al., 2023]. So, can LLMs
fulness of the target text by calculating a combination of simulate or even be an alternative to humans in the human
probability changes based on the intuition that the genera- evaluation of NLG? Or, do the practices in human evaluation
tion probability of a given text segment increases when more or other evaluative tasks (e.g. competition race, paper review,
consistent information is provided, and vice versa. Similarly, etc.) inspire us to better NLG evaluation with LLMs? A large
DELTASCORE [Xie et al., 2023] measures the quality of dif- body of research and attempts based on prompting LLMs are
ferent story aspects according to the likelihood difference be- guided by these ideas. In these studies, instructions and the
tween pre- and post-perturbation states with LLMs including text to be evaluated are completely expressed in the prompt
GPT-3.5 (text-davinci-003) that provide logits. They believe given to LLMs, and the evaluation results are generated by
that the sensitivity to specific perturbations indicates the qual- them.
ity of related aspects, and their experiments demonstrate the Human evaluation typically includes the following ele-
effectiveness of their approach. ments:
Prompt consistent with expert human evaluators. [Kocmi and Feder-
You will be given a news article and a summary written for it. Your task is to rate the mann, 2023b] discover GPT-3.5 and GPT-4 achieve the state-
summary on one metric. Please make sure you read and understand these of-the-art accuracy of evaluating translation quality compared
instructions carefully. to human labels, outperforming all the results from the met-
Evaluation Criteria: ric shard task of WMT22 [Freitag et al., 2022]. [Wang et
Consistency (1-5) - the factual alignment between the summary and the summarized al., 2023a] experiment on five datasets across summariza-
source. A factually consistent summary contains only statements that are entailed by
the source document. Annotators were also asked to penalize summaries that
tion, story generation, and data-to-text, and ChatGPT eval-
contained hallucinated facts. uators with a rating scale from 1 to 5 or 1 to 100 have the
state-of-the-art or comparative correlations with human judg-
Source Text:
Paul Merson has restarted his row with Andros Townsend after the Tottenham
ments in most settings, compared with prior metrics. Similar
midfielder was brought on with only seven minutes remaining in his team's 0-0 draw conclusions are also observed in open-domain dialogue re-
with Burnley on Sunday. … sponse generation [Lin and Chen, 2023]. Besides English,
Summary:
[Mendonça et al., 2023] show that ChatGPT with simple rat-
Paul Merson was brought on with only seven minutes remaining …. ing prompts is a strong evaluator for multilingual dialogue
evaluation, surpassing prior metrics based on encoders.
Evaluation Form:
Answer by starting with "Rating:" and then give the explanation of the rating on the
next line by "Rationale:". Comparison. Different from absolute scoring, compari-
son refers to choosing the better of the two. [Luo et al., 2023;
Response Gao et al., 2023] use ChatGPT to compare the factual con-
Rating: 2 sistency of two summaries. AuPEL [Wang et al., 2023d]
Rationale: The summary incorrectly states that Andros Townsend …. evaluate personalized text generation from three aspects in
the form of comparison with the PaLM 2 family [Anil et al.,
Figure 2: An example of prompting LLMs to evaluate the aspect of 2023]. According to [Liusie et al., 2023], pairwise compar-
consistency of the summary. There are task instructions, evaluation ison is better than scoring when medium-sized LLMs (e.g.
criteria, input content, and evaluation methods in the prompt, as well FlanT5 [Chung et al., 2022] and Llama2 [Touvron et al.,
as the evaluation results, including the rating and explanation gener- 2023]) are adopted as evaluators.
ated by LLMs.
Ranking. Ranking can be viewed as an extended form of
comparison. In comparison, only two examples are involved
• Evaluation Methods: The way the preferences of anno- at a time, whereas in ranking, the order of more than two
tators are obtained, such as scoring and comparison. examples needs to be decided at once. [Ji et al., 2023] use
• Task Instructions: How the annotators should read or ChatGPT to rank five model-generated responses across sev-
manipulate different parts to complete the annotation. eral use cases at once, indicating the ranking preferences of
ChatGPT align with those of humans to some degree. Simi-
• Input Content: The target text to be evaluated and larly, GPTRank is a method to rank summaries in a list-wise
other required content. Other required content including manner [Liu et al., 2023c]. Moreover, [Liu et al., 2023b]
source documents, references, and external knowledge compare different evaluation methods in LLM-based summa-
is provided as needed. rization including scoring, comparison, and ranking, showing
• Evaluation Criteria: The general definition of how that the optimal evaluation method for each backbone LLM
good or bad the text to be evaluated is in a particular may vary.
aspect of quality, e.g. fluency, faithfulness.
Boolean QA. Boolean QA requires LLMs to answer ”Yes”
• Role and interaction: The roles annotators play in the or ”No” to a question. It is adopted more in scenarios
evaluation and the interactions between them. where human annotations are binary, such as grammatical-
The focus of the existing research can always be mapped ity [Hu et al., 2023], faithfulness of summaries and state-
analogously to one or more of these elements, and we orga- ments [Luo et al., 2023; Gao et al., 2023; ES et al., 2023;
nize them according to the elements they address. An exam- Hu et al., 2023], factuality of generated text [Fu et al., 2023b;
ple of prompting LLMs is shown in Figure 2. Guan et al., 2023; Manakul et al., 2023], and answerability
of generated questions [Wang et al., 2023f].
3.2 Evaluation Methods Error Analysis. Error Analysis refers to the evaluation
Scoring. Scoring is the most commonly used evaluation of a text by looking for errors that occur in the text accord-
method in human evaluation for NLG, which is naturally ap- ing to a set of predefined error categories. Multidimensional
plied to NLG evaluation by prompting LLMs. [Chiang and Quality Metrics (MQM) [Jain et al., 2023] is an error anal-
Lee, 2023a] has conducted relevant studies early, using a ysis framework prevalent in machine translation evaluation.
Likert scale from 1 to 5 to evaluate story generation and ad- According to MQM, [Lu et al., 2023; Kocmi and Federmann,
versarial attacks with InstructGPT [Ouyang et al., 2022] and 2023a] use ChatGPT or GPT-4 to automatically detect trans-
ChatGPT 1 , showing that the evaluation results of LLMs are lation quality error spans. BOOOOKSCORE [Chang et al.,
2023], an LLM-based evaluation metric, assesses the coher-
1
https://openai.com/blog/chatgpt/ ence of book summaries by identifying eight types of errors.
3.3 Task Instructions settings when evaluating machine translation: providing ref-
In human evaluation, this part usually comes in the form of erences and not providing references. [Guan et al., 2023] pro-
a task description or evaluation steps. They can also exist at vide relevant facts and context when evaluating whether a text
the same time. The task description states the annotation in conforms to the facts. Exceptionally, [Shu et al., 2023] add
a more general way, and the evaluation steps, which can be the output of other automatic evaluation metrics to the input
considered as Chain-of-Thought, explicitly describe what to of the LLM.
do at each step. In the context of prompting LLMs for NLG
evaluation, the demonstrations used in few-shot prompting 3.5 Evaluation Criteria
are also included. The evaluation targeting specific aspects is used in numerous
Form and requirements. Several studies from an studies of human evaluation for NLG, such as text summa-
Eval4NLP 2023 shared task [Leiter et al., 2023] have rization, story generation, dialogue, and text simplification.
explored task instructions in various settings. [Kim et Evaluation criteria, i.e., the definitions of aspects are key in
al., 2023a] conduct experiments on different templates and this context. Most evaluation criteria in LLM-based evalua-
lengths of task descriptions and evaluation steps. [Kotonya et tion are directly derived from human evaluation. However, a
al., 2023] generate task instructions with LLMs or improve few studies have attempted to let LLMs generate or improve
existing task instructions with LLMs. Additionally, [He et al., evaluation criteria. [Liu et al., 2023e] use a few human-rated
2023a] evaluate generative reasoning using LLMs by asking examples as seeds to let LLMs draft some candidate evalua-
them first to generate their own answers, and then conduct a tion criteria, and then further filter them based on the perfor-
quantitative analysis of the text to be evaluated. mance of LLMs using these criteria on a validation set, to ob-
Analysis and explanations. LLMs can include analysis or tain the final evaluation criteria. [Kim et al., 2023c] designed
explanation in their evaluations, which is a key point that dis- an LLM-based interactive evaluation system, which involves
tinguishes them from previous automatic evaluation metrics. using LLMs to review the evaluation criteria provided by
Early explorations into prompting LLMs for NLG evaluation users, including eliminating ambiguities in criteria, merging
mostly do not examine the impact of whether LLMs are re- criteria with overlapping meanings, and decomposing overly
quired to analyze and explain on evaluation results. However, broad criteria. Additionally, [Ye et al., 2023a] propose a hier-
[Chiang and Lee, 2023b] explore different types of evaluation archical aspect classification system with 12 subcategories,
instructions in summarization evaluation and dialogue evalu- demonstrating that under the proposed fine-grained aspect
ation, finding that explicitly asking large models to provide definitions, human evaluation and LLM-based evaluation are
analysis or explanation achieve higher correlation with hu- highly correlated. Besides, the chain-of-aspects approach im-
man judgments. Besides, the quality of the analysis and ex- proves LLMs’ ability to evaluate on a specific aspect by hav-
planation generated by LLMs itself requires additional man- ing LLMs score on some related aspects before generating
ual evaluation [Leiter et al., 2023]. [Naismith et al., 2023] the final score [Gong and Mao, 2023].
compare the explanations written by humans and generated
by GPT-4 and conduct a simple corpus analysis on the gener- 3.6 Role and Interaction
ated explanations.
In-context examples. Similarly to other fields, sometimes We include in this section the evaluation strategies that ei-
demonstrations are needed when prompting LLMs for NLG ther use the same LLMs in different ways or involve different
evaluation. Specifically, [Jain et al., 2023] use only in-context LLMs. The former can be further divided into chain-style and
examples as task instructions, relying on LLMs to evaluate network-style interactions.
the quality of summaries. In scenarios where task descrip- Chain-style interaction. Inspired by human evaluators,
tions or evaluation steps are included, [Kotonya et al., 2023] [Yuan et al., 2024] have LLMs score a batch of examples
compare the performance of LLMs as evaluators in both zero- to be evaluated each time. Specifically, the evaluation pro-
shot and one-shot settings, finding that one-shot prompting cess is divided into three stages: analysis, ranking, and scor-
does not bring improvements. Moreover, [Hasanbeig et al., ing. Similar to QA-based evaluation metrics [Durmus et al.,
2023] improve the performance of LLM evaluators by updat- 2020], [Fu et al., 2023b] assess the faithfulness of summaries
ing the in-context examples iteratively. in two stages: treating LLMs as question generators to gen-
erate a question from the summary; then having LLMs an-
3.4 Input Content swer the question using the source document. Differently,
The types of input content mainly depend on the evalua- when [Hu et al., 2023] use GPT-4 to evaluate the faithfulness
tion criteria and are relatively fixed. For most task-specific of summaries, it first asks GPT-4 to extract event units from
evaluation criteria, such as the faithfulness of a summary the summary, then verifies whether these event units meet the
[Luo et al., 2023; Gao et al., 2023], the source document requirements, and finally judges whether the event units are
is needed in addition to the target text to be evaluated. For faithful to the source document.
task-independent criteria, such as fluency [Hu et al., 2023; Network-style interaction. Unlike chain-style interac-
Chiang and Lee, 2023b], only the text to be evaluated needs tions, network-style interactions involve the dispersion and
to be provided, though many works also provide the source aggregation of information. In network-style interactions,
document [Wang et al., 2023a; Liusie et al., 2023]. Other LLMs on the same layer play similar roles. ChatEval [Chan
types of input content can be provided as required by the spe- et al., 2023] is a framework for evaluating content through
cific task. [Kocmi and Federmann, 2023b] use two different debates among multiple LLMs, with three communication
strategies designed among the three types of LLMs: One- uation through adversarial perturbations. [Shen et al., 2023]
By-One, Simultaneous-Talk, and Simultaneous-Talk-with- indicate that LLM evaluators have a lower correlation with
Summarizer. [Zhang et al., 2023b] find that under certain human assessments when scoring high-quality summaries. In
conditions, widening and deepening the network of LLMs addition, [Hada et al., 2023] state that LLM-based evaluators
can better align its evaluation with human judgments. [Saha have a bias towards high scores, especially in non-Latin lan-
et al., 2023] propose a branch-solve-merge strategy, assign- guages like Chinese and Japanese. Beyond these shortcom-
ing LLMs the roles of decomposing problems, solving them, ings of performance, both ChatGPT and GPT-4 are propri-
and aggregating answers, thereby improving the accuracy and etary models, and their opacity could lead to irreproducible
reliability of evaluations. [Wu et al., 2023] assume that dif- evaluation results.
ferent people such as politicians and the general public have
different concerns about the quality of news summaries, use
LLMs to play different roles in evaluation accordingly, and
4 Fine-tuning LLMs
aggregate the results finally. 4.1 Overview
Different LLMs. Different from having the same LLM
play different roles, some research has used different LLMs As mentioned above, despite the exciting performance of
(such as GPT-4 and Claude) in their studies. In pairwise com- prompting LLMs like ChatGPT and GPT-4 for NLG evalu-
parisons, previous work mostly used a single LLM as the ation, several shortcomings in practice are inevitable, such
evaluator, which may not be fair. In light of this, [Bai et as high costs, possibly irreproducible results, and potential
al., 2023] design a decentralized Peer-examination method, biases in LLMs. In response, recent research has shifted to-
using different LLMs as evaluators and then aggregating the wards fine-tuning smaller, open-source LLMs specifically for
results. Further, [Li et al., 2023c] let different LLMs serve evaluation purposes, aiming to achieve performance close to
as evaluators in pairwise comparisons and then have them go GPT-4 in NLG evaluation. Representative works of this type
through a round of discussion to reach the final result. Addi- include PandaLM [Wang et al., 2023e], Prometheus [Kim
tionally, [Cohen et al., 2023] evaluate the factuality of texts et al., 2023b], Shepherd [Wang et al., 2023c], TIGERScore
[Jiang et al., 2023], INSTRUCTSCORE [Xu et al., 2023],
through the interaction of two LLMs, where the LLM that
generated the text acts as the examinee and the other LLM as Auto-J [Li et al., 2023a], CritiqueLLM [Ke et al., 2023] and
the examiner. JudgeLM [Zhu et al., 2023]. Their main ideas are similar, in-
volving the elaborate construction of high-quality evaluation
3.7 Pros and Cons data, followed by accordingly fine-tuning open-source base
LLMs. Nevertheless, there are certain discrepancies in the de-
The benefits of prompting LLMs for NLG evaluation are ex- signs across different works, such as the usage of references
citing. First, for the first time, people can express evaluation and evaluation criteria. We have summarized the key differ-
criteria and evaluation methods in natural language within the ent components of these methods in Table 1 for comparison,
prompts given to LLMs, providing great flexibility. Where which will be elaborated on next.
previously people needed to design specific evaluation met-
rics for different NLG tasks or even different aspects of a sin-
4.2 Data Construction
gle task, now they only need to modify the prompts for LLMs.
Secondly, surprisingly, LLMs have the ability to generate ex- Diverse data with high-quality annotations is crucial for the
planations while assessing texts, making this approach some- fine-tuning of evaluation models, which mainly involves task
what interpretable. Furthermore, in many NLG task, prompt- scenarios, inputs, target texts to evaluate, and evaluation re-
ing LLMs for evaluation has achieved state-of-the-art corre- sults. Early NLG evaluation research primarily focused on
lations with human judgments. conventional NLG tasks, such as summarization and dialogue
However, as many studies have pointed out, this type of generation. Thus, the task scenarios, inputs, and target texts
approach still has many limitations. [Wang et al., 2023b] refer to the corresponding NLP task, source inputs of the
note that when using ChatGPT and GPT-4 for pairwise com- task, and outputs generated by specialized systems based on
parisons, the order of the two texts can affect the evaluation task requirements, respectively. And mainstream datasets for
results, which is known as position bias. To alleviate this these tasks predominantly employ human annotators to pro-
issue, [Li et al., 2023d] propose a strategy of splitting, align- vide evaluation results, which are often considered reliable.
ing, and then merging the two texts to be evaluated into the With the recent rise of LLMs, the spectrum of NLG tasks
prompt. Also, LLM evaluators tend to favor longer, more ver- has been broadened to scenarios of instruction and response
bose responses [Zheng et al., 2023] and responses generated that are more aligned with human needs. Traditional tasks
by themselves [Liu et al., 2023a]. [Wu and Aji, 2023] show like summarization with corresponding source inputs can be
that compared to answers that are too short or grammatically viewed as kinds of instructions and requirements. Mean-
incorrect, answers with factual errors are considered better by while, responses generated by various general LLMs gen-
LLMs. [Liu et al., 2023d] demonstrate through adversarial erally serve as the target texts now and require more flex-
meta-evaluation that LLMs without references are not suit- ible evaluation so that the performance of different LLMs
able for evaluating dialogue responses in closed-ended sce- can be compared, promoting further developments. There-
narios: they tend to score highly on responses that conflict fore, to keep pace with the current advancement of modeling
with the facts in the dialogue history. [Zhang et al., 2023a] techniques, most evaluation methods have adopted the similar
also present the robustness issues of LLMs in dialogue eval- instruction-response scenario.
Data Construction Evaluation Method Reference
Method Base LLM
Instruction Source Annotator Scale Result Mode Details Specific Criteria Required

Reason & LLaMA


PandaLM Alpaca 52K GPT-3.5 300K Comparison Unified No
Reference 7B
LLaMA-2-Chat
Prometheus GPT-4 Construction GPT-4 100K Scoring Reason Explicit Yes
7B & 13B
Community Critique Data Overall Error Identifying LLaMA
Shepherd Human 1317 Unified No
& 9 NLP Tasks Data Judgement & Refinement 7B
23 Distinctive Text LLaMA-2
TIGERScore GPT-4 48K MQM Error Analysis Implicit No
Generation Datasets 7B & 13B
LLaMA
INSTRUCTSCORE GPT-4 Construction GPT-4 40K MQM Error Analysis Implicit Yes
7B
Real-world User Queries Scoring & LLaMA-2-Chat
AUTO-J GPT-4 4396 Reason Implicit No
from Preference Datasets Comparison 13B
AlignBench & ChatGLM-2
CritiqueLLM GPT-4 9332 Scoring Reason Unified Flexible
ChatGPT Augmentation 6B, 12B & 66B
GPT4All-LAION, ShareGPT Scoring & Vicuna
JudgeLM GPT-4 100K Reason Unified Flexible
Alpaca-GPT4 & Dolly-15K Comparison 7B, 13B & 33B

Table 1: Comparison of the different key components among the representative methods of fine-tuning LLMs.

The primary differences in these works actually lie in the Table 1. All the works require models to provide detailed
construction of instructions, with the purpose of improving information, such as reasons for their evaluation results. And
either diversity or reliability for the better generalization abil- the MQM mode can achieve more informative error analysis,
ity of the fine-tuned model. PandaLM and JudgeLM entirely offering stronger interpretability. Moreover, some works do
sample from common instruction datasets, such as Alpaca not necessarily require references and then have greater value
52K, while CritiqueLLM adopts small-scale sampling fol- in practice. And a more optimal method is to concurrently
lowed by ChatGPT augmentation. In contrast, Prometheus support both reference-based and reference-free evaluations
and INSTRUCTSCORE rely on GPT-4 to generate all the in- as JudgeLM and CritiqueLLM.
structions based on seed data, whereas Auto-J and Shepherd
use real-world data. Moreover, since large-scale human anno- 4.4 Fine-tuning Implementation
tation is impractical, most works utilize GPT-4 as the power-
The fine-tuning process is uniformly implemented by differ-
ful annotator, except for PandaLM and Shepherd, which use
ent works on their selected open-source LLMs, like LLaMA,
GPT-3.5 and human annotation on small-scale community
and respective constructed data, with some targeted set-
data, respectively. During the construction, they basically
tings. Specifically, Prometheus maintains balanced data dis-
all design detailed prompts or guidance and apply heuristic
tributions during fine-tuning, including the length and la-
filtering strategies and post-processing methods to mitigate
bel. JudgeLM eliminates potential biases by randomly swap-
noise. Overall, despite the possible higher quality of human
ping sample pairs to be compared and randomly remov-
annotation, the corresponding drawback is the difficulty in
ing references. INSTRUCTSCORE utilizes GPT-4 to pro-
constructing large-scale datasets, which in turn may hinder
vide error annotations for the intermediate outputs of the
adequate model training, while using LLMs for construction
fine-tuned model for further supervised reinforcement. And
is the opposite situation.
based on some preliminary experiments and manual analy-
4.3 Evaluation Method sis, TIGERScore determines appropriate ratios of different
types of data during fine-tuning, which are claimed to be cru-
As with prompting LLMs, the evaluation methods adopted in
cial by them. Moreover, CritiqueLLM implements separately,
these works are highly diversified, involving different evalua-
with and without references, and explores the effects of data
tion criteria, result modes, and usages of the reference. Given
and model scale. Compared to the vanilla fine-tuning setting,
that current instruction-response scenarios encompass differ-
these methods have improved the efficiency of model training
ent types of tasks, it is unsuitable to specify unified evalu-
and the robustness of evaluations.
ation criteria as in traditional NLG tasks. However, some
works still do it this way, while some other methods let LLM
annotators adaptively and implicitly reflect the required cri-
4.5 Pros and Cons
teria in their evaluations, like PandaLM, TIGERScore, and Those shortcomings of prompting LLM for NLG evaluation
AUTO-J. In particular, AUTO-J has meticulously crafted 332 can be significantly alleviated due to the customized imple-
evaluation criteria, matched to different tasks. Furthermore, mentation of data construction and model fine-tuning here.
Prometheus explicitly incorporates evaluation criteria into the For instance, most fine-tuned models range between 7B and
inputs of the model, expecting flexible evaluation based on 13B in the scale of parameters, facilitating low-cost infer-
various customized criteria. ence use and good reproducibility, with performance close
More details about the evaluation methods are shown in to GPT4 in NLG evaluation. And specific measures can be
adopted to prevent related biases found in GPT4 during dif- 5.2 Scoring and Explaining
ferent stages. Furthermore, this type of approach allows for
Automated evaluation frequently exhibits a limited correla-
continuous iteration and improvement of the model to address
tion with human judgments, while human evaluation, though
potential deficiencies or emerging issues discovered in future
reliable, is labor-intensive. [Zhang et al., 2021] present a
applications.
human-machine collaborative framework (HMCEval) which
However, some biases associated with GPT4 may still per- conceptualizes dialogue evaluation as a sample assignment
sist, as the data construction of most methods employs GPT4 problem to ensure the reliability of evaluation outcomes while
for critical evaluation annotation. On the other hand, the minimizing human effort and achieves 99% accuracy with
base open-source LLMs selected by existing works are pri- half human effort. Recently, LLMs have emerged as a cost-
marily the series of LLaMA. With the rapid updates and im- effective alternative to human evaluations. However, both
provements of open-source large models recently, it adheres humans and LLMs have limitations, including inherent sub-
to the intuition that employing a more powerful base LLM jectivity and unreliable judgments, especially in open-ended
should lead to better evaluation performance. However, this tasks with diverse requirements.
means repetitive fine-tuning processes and computational ex- To address challenges associated with inconsistent evalua-
penses from scratch since directly migrating existing fine- tion criteria in open-ended tasks and explore synergy between
tuned models to the new base LLM is difficult. humans and LLM-based evaluators, [Li et al., 2023b] pro-
Additionally, although many existing methods aspire for poses a Collaborative Evaluation pipeline (COEVAL), which
more flexible and comprehensive evaluation through fine- involves designing a checklist of task-specific criteria and
tuning, demanding excessive evaluation settings may ulti- conducting detailed evaluations where LLMs generate ini-
mately lead to poor performance or failure in model training, tial ideation and humans engage in scrutiny. Depending
as found by AUTO-J and CritiqueLLM on criteria and refer- solely on score predictions is insufficient for ensuring reli-
ences, respectively. However, there are some disagreements able evaluation and error detection, particularly when spe-
here since Prometheus and JudgeLM show different results. cific criteria demand nuanced analysis beyond straightfor-
Moreover, considering the different evaluation settings in ex- ward scoring. Building upon recent developments in ex-
isting works, it is challenging to conduct a horizontal compar- plainable NLP ([Yin and Neubig, 2022; Jung et al., 2022;
ison among them. These issues require further exploration in Ribeiro and Lundberg, 2022; Ye et al., 2023b]), COEVAL
future research. is assigned the additional task of generating explanations to
elucidate evaluation outcomes to facilitate a trustworthy col-
laborative evaluation process. Results indicate COEVAL ef-
5 Human-LLM Collaborative Evaluation fectively evaluates lengthy texts by utilizing LLMs, saving
significant time and reducing human evaluation outliers. De-
5.1 Overview spite the involvement of LLMs, human scrutiny remains es-
sential, contributing to the revision of around 20% of LLM
While LLMs demonstrate robust evaluation capabilities, there evaluation scores for enhanced reliability.
exists a need for further enhancement in terms of their relia-
bility, particularly in establishing a stronger correlation with
5.3 Broader Evaluation Tasks
human evaluation outcomes. Although human evaluation is
the gold-standard evaluation approach in NLG, it is recog- The broader evaluation of NLG models involves testing and
nized for its associated high costs and susceptibility to sub- debugging the models. Current methods often rely on highly
jective biases [van der Lee et al., 2021; Deriu et al., 2021; variable human creativity and extensive manual effort or are
Li et al., 2023b]. The robust and comprehensive capabilities limited to addressing a very specific class of bugs. [Ribeiro
exhibited by LLMs underscore considerable potential for the and Lundberg, 2022] introduce AdaTest, a process that uses
development of collaborative evaluation methodologies that LLMs in collaboration with human feedback to automatically
integrate both human and LLMs. In recent investigations, re- generate unit tests that highlight bugs in a target model, which
searchers have initiated the exploration of collaborative eval- proves to make users 5-10 times more effective at identify-
uation paradigms, which include traditional NLG evaluation ing bugs and assists users in effectively fixing bugs without
methods such as scoring and explaining [Zhang et al., 2021; introducing new ones. Moreover, LLMs have shown biases
Li et al., 2023b], broader evaluation methods such as testing and irresponsible behavior, necessitating thorough auditing
and debugging [Ribeiro and Lundberg, 2022], and auditing before deployment [Blodgett et al., 2020; Jones and Stein-
NLG models to ensure fairness [Rastogi et al., 2023]. Fur- hardt, 2022]. AdaTest++ [Rastogi et al., 2023] draw on in-
thermore, scholars [Saunders et al., 2022] are actively engag- sights from literature on human-AI collaboration and sense-
ing in efforts to address the intricate challenge of scalable making, and engage with research experts in safe and fair
oversight [Amodei et al., 2016] through the collaboration of AI, which emphasizes the importance of sensemaking and ef-
humans and LLMs. The objective is to devise strategies for fective communication between humans and AI to capitalize
effectively evaluating models in tasks that pose inherent dif- on their complementary strengths in collaborative auditing.
ficulties for human assessors. This collaborative approach AdaTest++ successfully leverages human strengths, such as
seeks to leverage the distinctive strengths of both human eval- schematization and hypothesis testing. Moreover, users iden-
uators and sophisticated language models to achieve robust tified a range of failure modes across 26 different topics in
and nuanced evaluations in challenging domains. issues that were revealed in formal audits and those that were
previously under-reported. Additionally, ensuring trustwor- NLG evaluation for low-resource languages and new
thiness in LLMs for challenging tasks [Chen et al., 2021; task scenarios. Almost all existing research focuses on En-
Nakano et al., 2021; Li et al., 2022; Menick et al., 2022] glish data. However, it is doubtful whether LLMs have simi-
poses a crucial challenge. Scalable oversight [Amodei et al., lar levels of NLG evaluation capability for texts in other lan-
2016] aims to effectively evaluate models on tasks challeng- guages, especially low-resource languages. As [Zhang et al.,
ing for humans and suggests the use of AI for assistance. 2023a] points out, we should be more cautious about using
[Saunders et al., 2022] explored providing critiques of model LLMs to evaluate texts in non-Latin languages. Additionally,
outputs as a form of assistance, demonstrating that model- existing research mainly focuses on more traditional NLG
generated critiques assist humans in identifying overlooked tasks such as translation, summarization, and dialogue. How-
flaws. ever, there are many new scenarios in reality with different re-
quirements and evaluation criteria. Research on low-resource
5.4 Pros and Cons languages and new task scenarios will provide a more com-
The advantages of human-AI collaborative evaluation lie in prehensive understanding of LLMs’ evaluation capabilities.
achieving a balance between efficiency and cost, as demon- Diverse forms of human-LLM collaborative NLG eval-
strated by COEVAL [Li et al., 2023b] achieving this equi- uation. According to the literature reviewed above, there is
librium. Additionally, there are complementary strengths little research on collaborative evaluation between humans
between humans and AI. For instance, AdaTest++ [Rastogi and LLMs. Neither humans nor LLMs are perfect, and each
et al., 2023] empowers users to consistently utilize their has its strengths. Since the ultimate goal of NLG research is
strengths throughout the auditing process, benefiting signif- to evaluate text quality more accurately and efficiently, we
icantly from LLM. Users who generate the most topics heav- believe that collaboration between humans and LLMs can
ily rely on LLM suggestions while employing their contextual achieve better results than pure human evaluation or auto-
reasoning and semantic understanding to update their mental matic evaluation. In the collaboration between humans and
models vigilantly and identify model failures. LLMs, technologies in the field of human-computer interac-
However, there are drawbacks. The evaluation results of tion may bring new implementation methods to the collabo-
LLMs may be sensitive to the formats used to query the model ration. In addition, what roles humans and LLMs should play
and might require additional support for prompt writing [Li in the evaluation and how they can better complement each
et al., 2023b; Rastogi et al., 2023]. Furthermore, the current other are still worth researching.
capability to assess confidence levels is not strong enough,
making it challenging to determine when to trust the LLM. References
Furthermore, certain level of human supervision is still nec-
essary, making it less convenient and cost-effective compared [Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob
to fully automated evaluation. Steinhardt, Paul F. Christiano, John Schulman, and
Dan Mané. Concrete problems in AI safety. CoRR,
6 Conclusions and Future Trends abs/1606.06565, 2016.
Through the above review of studies on NLG evaluation [Anil et al., 2023] Rohan Anil, Andrew M. Dai, Orhan Fi-
based on LLMs, we find that these four categories of ap- rat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos,
proaches have their respective strengths and weaknesses, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng
and most of the existing work is concentrated on prompting Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey,
LLMs. In view of this, we offer some suggestions for future Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra,
directions in this field. Erica Moreira, Mark Omernick, Kevin Robinson, Sebas-
Unified benchmarks for LLM-based NLG evaluation tian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing
approaches. As mentioned above, each of the studies that Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob
fine-tuned LLMs to construct specialized evaluation models Austin, Paul Barham, Jan Botha, James Bradbury, Sid-
uses different settings and data during testing, making them dhartha Brahma, Kevin Brooks, Michele Catasta, Yong
incomparable. In the research on prompting LLMs for NLG Cheng, Colin Cherry, Christopher A. Choquette-Choo,
evaluation, there are some publicly available human judg- Aakanksha Chowdhery, Clément Crepy, Shachi Dave,
ments on the same NLG task, such as SummEval for summa- Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Dı́az,
rization. However, the existing human judgments have many Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng,
problems. Firstly, most of the existing data only involve one Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian
type of NLG task and a single human evaluation method (e.g., Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand,
scoring), making it difficult to evaluate LLMs’ performance Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu,
on different tasks, as well as using different evaluation meth- Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Itty-
ods on the same task. Secondly, many of the texts in these cheriah, Matthew Jagielski, Wenhao Jia, Kathleen Ke-
human judgments are generated by outdated models (such as nealy, Maxim Krikun, Sneha Kudugunta, Chang Lan,
Pointer Network) and do not include texts generated by more Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei
advanced LLMs. Lastly, many human evaluation datasets are Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin,
too small in scale. There is an urgent need for large-scale, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma
high-quality human evaluation data covering various NLG Mahendru, Joshua Maynez, Vedant Misra, Maysam Mous-
tasks and evaluation methods as a benchmark. salem, Zachary Nado, John Nham, Eric Ni, Andrew Nys-
trom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Dave Cummings, Matthias Plappert, Fotios Chantzis, Eliz-
Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan abeth Barnes, Ariel Herbert-Voss, William Hebgen Guss,
Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Bren- Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor
nan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Babuschkin, Suchir Balaji, Shantanu Jain, William Saun-
Slone, Daniel Smilkov, David R. So, Daniel Sohn, Si- ders, Christopher Hesse, Andrew N. Carr, Jan Leike,
mon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vo- Joshua Achiam, Vedant Misra, Evan Morikawa, Alec
drahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Radford, Matthew Knight, Miles Brundage, Mira Mu-
Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan rati, Katie Mayer, Peter Welinder, Bob McGrew, Dario
Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech
Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Zaremba. Evaluating large language models trained on
Slav Petrov, and Yonghui Wu. Palm 2 technical report. code. CoRR, abs/2107.03374, 2021.
Computing Research Repository, arxiv:2305.10403, 2023. [Chiang and Lee, 2023a] David Cheng-Han Chiang and
[Bai et al., 2023] Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Hung-yi Lee. Can large language models be an alterna-
Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia tive to human evaluations? In ACL (1), 2023.
Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. [Chiang and Lee, 2023b] David Cheng-Han Chiang and
Benchmarking foundation models with language-model- Hung-yi Lee. A closer look into using large language mod-
as-an-examiner. CoRR, abs/2306.04181, 2023. els for automatic evaluation. In EMNLP (Findings), 2023.
[Blodgett et al., 2020] Su Lin Blodgett, Solon Barocas, [Chung et al., 2022] Hyung Won Chung, Le Hou, Shayne
Hal Daumé III, and Hanna M. Wallach. Language (tech- Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li,
nology) is power: A critical survey of ”bias” in NLP. In Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Al-
ACL, 2020. bert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac
[Brown et al., 2020] Tom B. Brown, Benjamin Mann, Nick Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan
Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yan-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, ping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-
Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Win- finetuned language models. CoRR, abs/2210.11416, 2022.
ter, Christopher Hesse, Mark Chen, Eric Sigler, Ma- [Cohen et al., 2023] Roi Cohen, May Hamri, Mor Geva, and
teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Amir Globerson. LM vs LM: detecting factual errors via
Christopher Berner, Sam McCandlish, Alec Radford, Ilya cross examination. In EMNLP, 2023.
Sutskever, and Dario Amodei. Language models are few-
shot learners. In NeurIPS, 2020. [Deriu et al., 2021] Jan Deriu, Álvaro Rodrigo, Arantxa
Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko
[Cegin et al., 2023] Jan Cegin, Jakub Simko, and Peter Agirre, and Mark Cieliebak. Survey on evaluation meth-
Brusilovsky. ChatGPT to replace crowdsourcing of para- ods for dialogue systems. Artif. Intell. Rev., 54(1):755–
phrases for intent classification: Higher diversity and com- 810, 2021.
parable model robustness. In Houda Bouamor, Juan Pino,
and Kalika Bali, editors, Proceedings of the 2023 Confer- [Durmus et al., 2020] Esin Durmus, He He, and Mona T.
ence on Empirical Methods in Natural Language Process- Diab. FEQA: A question answering evaluation framework
ing, pages 1889–1905, Singapore, December 2023. Asso- for faithfulness assessment in abstractive summarization.
ciation for Computational Linguistics. In ACL, 2020.
[Chan et al., 2023] Chi-Min Chan, Weize Chen, Yusheng [Eddine et al., 2022] Moussa Kamal Eddine, Guokan Shang,
Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Antoine J.-P. Tixier, and Michalis Vazirgiannis. Fru-
Zhiyuan Liu. Chateval: Towards better llm-based evalua- galscore: Learning cheaper, lighter and faster evaluation
tors through multi-agent debate. CoRR, abs/2308.07201, metrics for automatic text generation. In ACL (1), 2022.
2023. [ES et al., 2023] Shahul ES, Jithin James, Luis Espinosa
[Chang et al., 2023] Yapei Chang, Kyle Lo, Tanya Goyal, Anke, and Steven Schockaert. RAGAS: automated
and Mohit Iyyer. Booookscore: A systematic exploration evaluation of retrieval augmented generation. CoRR,
of book-length summarization in the era of llms. CoRR, abs/2309.15217, 2023.
abs/2310.00785, 2023. [Freitag et al., 2022] Markus Freitag, Ricardo Rei, Nitika
[Chen et al., 2021] Mark Chen, Jerry Tworek, Heewoo Jun, Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis,
Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Tom Kocmi, George F. Foster, Alon Lavie, and André F. T.
Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Martins. Results of WMT22 metrics shared task: Stop us-
Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, ing BLEU - neural metrics are better and more robust. In
Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela WMT, 2022.
Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail [Fu et al., 2023a] Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,
Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar- and Pengfei Liu. Gptscore: Evaluate as you desire. CoRR,
ian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, abs/2302.04166, 2023.
[Fu et al., 2023b] Xue-Yong Fu, Md. Tahmid Rahman [Jiang et al., 2023] Dongfu Jiang, Yishan Li, Ge Zhang,
Laskar, Cheng Chen, and Shashi Bhushan TN. Are large Wenhao Huang, Bill Yuchen Lin, and Wenhu Chen. Tiger-
language models reliable judges? A study on the factual- score: Towards building explainable metric for all text
ity evaluation capabilities of llms. CoRR, abs/2311.00681, generation tasks. CoRR, abs/2310.00752, 2023.
2023. [Jones and Steinhardt, 2022] Erik Jones and Jacob Stein-
[Gao et al., 2023] Mingqi Gao, Jie Ruan, Renliang Sun, hardt. Capturing failures of large language models via hu-
Xunjian Yin, Shiping Yang, and Xiaojun Wan. Human- man cognitive biases. In NeurIPS, 2022.
like summarization evaluation with chatgpt. CoRR, [Jung et al., 2022] Jaehun Jung, Lianhui Qin, Sean Welleck,
abs/2304.02554, 2023. Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras,
[Gilardi et al., 2023] Fabrizio Gilardi, Meysam Alizadeh, and Yejin Choi. Maieutic prompting: Logically consistent
and Maël Kubli. Chatgpt outperforms crowd-workers for reasoning with recursive explanations. In EMNLP, 2022.
text-annotation tasks. CoRR, abs/2303.15056, 2023. [Ke et al., 2023] Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu,
[Gong and Mao, 2023] Peiyuan Gong and Jiaxin Mao. Coas- Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng,
core: Chain-of-aspects prompting for NLG evaluation. Yuxiao Dong, Hongning Wang, Jie Tang, and Minlie
CoRR, abs/2312.10355, 2023. Huang. Critiquellm: Scaling llm-as-critic for effective
and explainable evaluation of large language model gen-
[Guan et al., 2023] Jian Guan, Jesse Dodge, David Wad- eration. CoRR, abs/2311.18702, 2023.
den, Minlie Huang, and Hao Peng. Language models
hallucinate, but may excel at fact verification. CoRR, [Kim et al., 2023a] Joonghoon Kim, Saeran Park, Kiyoon
abs/2310.14564, 2023. Jeong, Sangmin Lee, Seung Hun Han, Jiyoon Lee, and Pil-
sung Kang. Which is better? exploring prompting strategy
[Hada et al., 2023] Rishav Hada, Varun Gumma, Adrian for llm-based metrics. CoRR, abs/2311.03754, 2023.
de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit
[Kim et al., 2023b] Seungone Kim, Jamin Shin, Yejin Cho,
Choudhury, Kalika Bali, and Sunayana Sitaram. Are large
language model-based evaluators the solution to scaling up Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun,
multilingual evaluation? CoRR, abs/2309.07462, 2023. Seongjin Shin, Sungdong Kim, James Thorne, and Min-
joon Seo. Prometheus: Inducing fine-grained evaluation
[Hasanbeig et al., 2023] Hosein Hasanbeig, Hiteshi Sharma, capability in language models. CoRR, abs/2310.08491,
Leo Betthauser, Felipe Vieira Frujeri, and Ida Momenne- 2023.
jad. ALLURE: auditing and improving llm-based eval- [Kim et al., 2023c] Tae Soo Kim, Yoonjoo Lee, Jamin Shin,
uation of text using iterative in-context-learning. CoRR,
Young-Ho Kim, and Juho Kim. Evallm: Interactive eval-
abs/2309.13701, 2023.
uation of large language model prompts on user-defined
[He et al., 2023a] Hangfeng He, Hongming Zhang, and Dan criteria. CoRR, abs/2309.13633, 2023.
Roth. Socreval: Large language models with the socratic [Kocmi and Federmann, 2023a] Tom Kocmi and Christian
method for reference-free reasoning evaluation. CoRR, Federmann. GEMBA-MQM: detecting translation quality
abs/2310.00074, 2023. error spans with GPT-4. In WMT, 2023.
[He et al., 2023b] Tianxing He, Jingyu Zhang, Tianle Wang, [Kocmi and Federmann, 2023b] Tom Kocmi and Christian
Sachin Kumar, Kyunghyun Cho, James R. Glass, and Yu- Federmann. Large language models are state-of-the-art
lia Tsvetkov. On the blind spots of model-based evaluation evaluators of translation quality. In EAMT, 2023.
metrics for text generation. In ACL (1), 2023.
[Kotonya et al., 2023] Neema Kotonya, Saran Krishnasamy,
[Hu et al., 2023] Yebowen Hu, Kaiqiang Song, Sangwoo Joel R. Tetreault, and Alejandro Jaimes. Little giants: Ex-
Cho, Xiaoyang Wang, Hassan Foroosh, and Fei Liu. De- ploring the potential of small llms as evaluation metrics in
cipherpref: Analyzing influential factors in human prefer- summarization in the eval4nlp 2023 shared task. CoRR,
ence judgments via GPT-4. In EMNLP, 2023. abs/2311.00686, 2023.
[Jain et al., 2023] Sameer Jain, Vaishakh Keshava, Swar- [Kwon et al., 2023] Woosuk Kwon, Zhuohan Li, Siyuan
nashree Mysore Sathyendra, Patrick Fernandes, Pengfei Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu,
Liu, Graham Neubig, and Chunting Zhou. Multi- Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient
dimensional evaluation of text summarization with in- memory management for large language model serving
context learning. In ACL (Findings), 2023. with pagedattention. In SOSP, 2023.
[Ji et al., 2023] Yunjie Ji, Yan Gong, Yiping Peng, Chao Ni, [Leiter et al., 2023] Christoph Leiter, Juri Opitz, Daniel
Peiyan Sun, Dongyu Pan, Baochang Ma, and Xiangang Deutsch, Yang Gao, Rotem Dror, and Steffen Eger. The
Li. Exploring chatgpt’s ability to rank content: A prelimi- eval4nlp 2023 shared task on prompting large language
nary study on consistency with human preferences. CoRR, models as explainable metrics. CoRR, abs/2310.19792,
abs/2303.07610, 2023. 2023.
[Jia et al., 2023] Qi Jia, Siyu Ren, Yizhu Liu, and Kenny Q. [Li et al., 2022] Yujia Li, David Choi, Junyoung Chung,
Zhu. Zero-shot faithfulness evaluation for text summariza- Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom
tion with foundation language model. In EMNLP, 2023. Eccles, James Keeling, Felix Gimeno, Agustin Dal
Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- large language models. Computing Research Repository,
son d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen arxiv:2307.07889, 2023.
Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, [Lu et al., 2023] Qingyu Lu, Baopu Qiu, Liang Ding, Lip-
James Molloy, Daniel J. Mankowitz, Esme Sutherland
ing Xie, and Dacheng Tao. Error analysis prompting en-
Robson, Pushmeet Kohli, Nando de Freitas, Koray
ables human-like translation evaluation in large language
Kavukcuoglu, and Oriol Vinyals. Competition-level code
models: A case study on chatgpt. CoRR, abs/2303.13809,
generation with alphacode. Science, 378(6624):1092–
2023.
1097, 2022.
[Li et al., 2023a] Junlong Li, Shichao Sun, Weizhe Yuan, [Luo et al., 2023] Zheheng Luo, Qianqian Xie, and Sophia
Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge Ananiadou. Chatgpt as a factual inconsistency eval-
for evaluating alignment. CoRR, abs/2310.05470, 2023. uator for abstractive text summarization. CoRR,
abs/2303.15621, 2023.
[Li et al., 2023b] Qintong Li, Leyang Cui, Lingpeng Kong,
and Wei Bi. Collaborative evaluation: Exploring the syn- [Manakul et al., 2023] Potsawee Manakul, Adian Liusie,
ergy of large language models and humans for open-ended and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-
generation evaluation. CoRR, abs/2310.19740, 2023. box hallucination detection for generative large language
models. In EMNLP, 2023.
[Li et al., 2023c] Ruosen Li, Teerth Patel, and Xinya Du.
PRD: peer rank and discussion improve large language [Mendonça et al., 2023] John Mendonça, Patrı́cia Pereira,
model based evaluations. CoRR, abs/2307.02762, 2023. João Paulo Carvalho, Alon Lavie, and Isabel Trancoso.
[Li et al., 2023d] Zongjie Li, Chaozheng Wang, Pingchuan Simple LLM prompting is state-of-the-art for robust and
multilingual dialogue evaluation. In Proceedings of The
Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, and Yang
Eleventh Dialog System Technology Challenge, 2023.
Liu. Split and merge: Aligning position biases in large
language model based evaluators. CoRR, abs/2310.01432, [Menick et al., 2022] Jacob Menick, Maja Trebacz,
2023. Vladimir Mikulik, John Aslanides, H. Francis Song,
[Lin and Chen, 2023] Yen-Ting Lin and Yun-Nung Chen. Martin J. Chadwick, Mia Glaese, Susannah Young,
Llm-eval: Unified multi-dimensional automatic evaluation Lucy Campbell-Gillingham, Geoffrey Irving, and Nat
for open-domain conversations with large language mod- McAleese. Teaching language models to support answers
els. In NLP4ConvAI 2023, 2023. with verified quotes. CoRR, abs/2203.11147, 2022.
[Lin, 2004] Chin-Yew Lin. Rouge: A package for automatic [Naismith et al., 2023] Ben Naismith, Phoebe Mulcaire, and
evaluation of summaries. In Text summarization branches Jill Burstein. Automated evaluation of written discourse
out, 2004. coherence using GPT-4. In BEA@ACL, 2023.
[Liu et al., 2023a] Yang Liu, Dan Iter, Yichong Xu, Shuo- [Nakano et al., 2021] Reiichiro Nakano, Jacob Hilton,
hang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim,
NLG evaluation using gpt-4 with better human alignment. Christopher Hesse, Shantanu Jain, Vineet Kosaraju,
In EMNLP, 2023. William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloun-
[Liu et al., 2023b] Yixin Liu, Alexander R. Fabbri, Jiawen dou, Gretchen Krueger, Kevin Button, Matthew Knight,
Benjamin Chess, and John Schulman. Webgpt: Browser-
Chen, Yilun Zhao, Simeng Han, Shafiq Joty, Pengfei Liu,
assisted question-answering with human feedback. CoRR,
Dragomir Radev, Chien-Sheng Wu, and Arman Cohan.
abs/2112.09332, 2021.
Benchmarking generation and evaluation capabilities of
large language models for instruction controllable summa- [Ostyakova et al., 2023] Lidiia Ostyakova, Veronika Smilga,
rization. CoRR, abs/2311.09184, 2023. Kseniia Petukhova, Maria Molchanova, and Daniel Ko-
[Liu et al., 2023c] Yixin Liu, Alexander R. Fabbri, Pengfei rnev. ChatGPT vs. crowdsourcing vs. experts: Annotat-
Liu, Dragomir Radev, and Arman Cohan. On learning ing open-domain conversations with speech functions. In
to summarize with large language models as references. Svetlana Stoyanchev, Shafiq Joty, David Schlangen, On-
CoRR, abs/2305.14239, 2023. drej Dusek, Casey Kennington, and Malihe Alikhani, ed-
itors, Proceedings of the 24th Annual Meeting of the Spe-
[Liu et al., 2023d] Yongkang Liu, Shi Feng, Daling Wang, cial Interest Group on Discourse and Dialogue, pages
Yifei Zhang, and Hinrich Schütze. Evaluate what you 242–254, Prague, Czechia, September 2023. Association
can’t evaluate: Unassessable generated responses quality. for Computational Linguistics.
CoRR, abs/2305.14658, 2023.
[Ouyang et al., 2022] Long Ouyang, Jeffrey Wu, Xu Jiang,
[Liu et al., 2023e] Yuxuan Liu, Tianchi Yang, Shaohan Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex
Deng, Feng Sun, and Qi Zhang. Calibrating llm-based Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke
evaluator. CoRR, abs/2309.13308, 2023. Miller, Maddie Simens, Amanda Askell, Peter Welinder,
[Liusie et al., 2023] Adian Liusie, Potsawee Manakul, and Paul F. Christiano, Jan Leike, and Ryan Lowe. Training
Mark J. F. Gales. Llm comparative assessment: Zero- language models to follow instructions with human feed-
shot nlg evaluation through pairwise comparisons using back. In NeurIPS, 2022.
[Papineni et al., 2002] Kishore Papineni, Salim Roukos, Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas
Todd Ward, and Wei-Jing Zhu. Bleu: a method for au- Scialom. Llama 2: Open foundation and fine-tuned chat
tomatic evaluation of machine translation. In ACL, 2002. models. CoRR, abs/2307.09288, 2023.
[Rastogi et al., 2023] Charvi Rastogi, Marco Túlio Ribeiro, [van der Lee et al., 2021] Chris van der Lee, Albert Gatt,
Nicholas King, Harsha Nori, and Saleema Amershi. Sup- Emiel van Miltenburg, and Emiel Krahmer. Human eval-
porting human-ai collaboration in auditing llms with llms. uation of automatically generated text: Current trends
In AIES, 2023. and best practice guidelines. Comput. Speech Lang.,
[Ribeiro and Lundberg, 2022] Marco Túlio Ribeiro and 67:101151, 2021.
Scott M. Lundberg. Adaptive testing and debugging of [Varshney et al., 2023] Neeraj Varshney, Wenlin Yao, Hong-
NLP models. In ACL (1), 2022. ming Zhang, Jianshu Chen, and Dong Yu. A stitch in
[Saha et al., 2023] Swarnadeep Saha, Omer Levy, Asli Ce- time saves nine: Detecting and mitigating hallucinations
likyilmaz, Mohit Bansal, Jason Weston, and Xian Li. of llms by validating low-confidence generation. CoRR,
Branch-solve-merge improves large language model eval- abs/2307.03987, 2023.
uation and generation. CoRR, abs/2310.15123, 2023.
[Wang et al., 2023a] Jiaan Wang, Yunlong Liang, Fandong
[Saunders et al., 2022] William Saunders, Catherine Yeh, Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu,
Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and and Jie Zhou. Is ChatGPT a good NLG evaluator? a pre-
Jan Leike. Self-critiquing models for assisting human eval- liminary study. In Proceedings of the 4th New Frontiers in
uators. CoRR, abs/2206.05802, 2022. Summarization Workshop, 2023.
[Shen et al., 2023] Chenhui Shen, Liying Cheng, Xuan-Phi
[Wang et al., 2023b] Peiyi Wang, Lei Li, Liang Chen, Dawei
Nguyen, Yang You, and Lidong Bing. Large language
Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and
models are not yet human-level evaluators for abstractive
Zhifang Sui. Large language models are not fair evalua-
summarization. In EMNLP (Findings), 2023.
tors. CoRR, abs/2305.17926, 2023.
[Shu et al., 2023] Lei Shu, Nevan Wichers, Liangchen Luo,
Yun Zhu, Yinxiao Liu, Jindong Chen, and Lei Meng. [Wang et al., 2023c] Tianlu Wang, Ping Yu, Xiaoqing Ellen
Fusion-eval: Integrating evaluators with llms. CoRR, Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-
abs/2311.09204, 2023. Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-
Zarandi, and Asli Celikyilmaz. Shepherd: A critic for lan-
[Sulem et al., 2018] Elior Sulem, Omri Abend, and Ari Rap- guage model generation. CoRR, abs/2308.04592, 2023.
poport. BLEU is not suitable for the evaluation of text
simplification. In EMNLP, 2018. [Wang et al., 2023d] Yaqing Wang, Jiepu Jiang, Mingyang
[Sun et al., 2022] Tianxiang Sun, Junliang He, Xipeng Qiu, Zhang, Cheng Li, Yi Liang, Qiaozhu Mei, and Michael
Bendersky. Automated evaluation of personalized
and Xuanjing Huang. Bertscore is unfair: On social bias
text generation using large language models. CoRR,
in language model-based metrics for text generation. In
abs/2310.11593, 2023.
EMNLP, 2022.
[Törnberg, 2023] Petter Törnberg. Chatgpt-4 outperforms [Wang et al., 2023e] Yidong Wang, Zhuohao Yu, Zhengran
experts and crowd workers in annotating political twitter Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya
messages with zero-shot learning. CoRR, abs/2304.06588, Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun
2023. Zhang, and Yue Zhang. Pandalm: An automatic evalua-
tion benchmark for LLM instruction tuning optimization.
[Touvron et al., 2023] Hugo Touvron, Louis Martin, Kevin
CoRR, abs/2306.05087, 2023.
Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, [Wang et al., 2023f] Zifan Wang, Kotaro Funakoshi, and
Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Manabu Okumura. Automatic answerability evaluation for
Canton-Ferrer, Moya Chen, Guillem Cucurull, David Es- question generation. CoRR, abs/2309.12546, 2023.
iobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian [Wu and Aji, 2023] Minghao Wu and Alham Fikri Aji. Style
Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, over substance: Evaluation biases for large language mod-
Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan els. CoRR, abs/2307.03025, 2023.
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura, [Wu et al., 2023] Ning Wu, Ming Gong, Linjun Shou, Shin-
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana ing Liang, and Daxin Jiang. Large language models
Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, are diverse role-players for summarization evaluation. In
Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin NLPCC (1), 2023.
Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, [Xie et al., 2023] Zhuohan Xie, Miao Li, Trevor Cohn, and
Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael
Jey Han Lau. Deltascore: Fine-grained story evaluation
Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh
with perturbations. In EMNLP (Findings), 2023.
Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan,
Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, An- [Xu et al., 2023] Wenda Xu, Danqing Wang, Liangming
gela Fan, Melanie Kambadur, Sharan Narang, Aurélien Pan, Zhenqiao Song, Markus Freitag, William Wang, and
Lei Li. INSTRUCTSCORE: towards explainable text gen-
eration evaluation with automatic feedback. In EMNLP,
2023.
[Ye et al., 2023a] Seonghyeon Ye, Doyoung Kim, Sungdong
Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo,
James Thorne, Juho Kim, and Minjoon Seo. FLASK:
fine-grained language model evaluation based on align-
ment skill sets. CoRR, abs/2307.10928, 2023.
[Ye et al., 2023b] Xi Ye, Srinivasan Iyer, Asli Celikyilmaz,
Veselin Stoyanov, Greg Durrett, and Ramakanth Pa-
sunuru. Complementary explanations for effective in-
context learning. In ACL (Findings), 2023.
[Yin and Neubig, 2022] Kayo Yin and Graham Neubig. In-
terpreting language models with contrastive explanations.
In EMNLP, 2022.
[Yuan et al., 2021] Weizhe Yuan, Graham Neubig, and
Pengfei Liu. Bartscore: Evaluating generated text as text
generation. In NeurIPS, 2021.
[Yuan et al., 2024] Peiwen Yuan, Shaoxiong Feng, Yiwei Li,
Xinglin Wang, Boyuan Pan, Heda Wang, and Kan Li.
Batcheval: Towards human-like text evaluation. CoRR,
abs/2401.00437, 2024.
[Zhang et al., 2020] Tianyi Zhang, Varsha Kishore, Felix
Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore:
Evaluating text generation with BERT. In ICLR, 2020.
[Zhang et al., 2021] Yangjun Zhang, Pengjie Ren, and
Maarten de Rijke. A human-machine collaborative
framework for evaluating malevolence in dialogues. In
ACL/IJCNLP (1), 2021.
[Zhang et al., 2022] Susan Zhang, Stephen Roller, Naman
Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen,
Christopher Dewan, Mona T. Diab, Xian Li, Xi Vic-
toria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer,
Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali
Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT:
open pre-trained transformer language models. CoRR,
abs/2205.01068, 2022.
[Zhang et al., 2023a] Chen Zhang, Luis Fernando D’Haro,
Yiming Chen, Malu Zhang, and Haizhou Li. A com-
prehensive analysis of the effectiveness of large lan-
guage models as automatic dialogue evaluators. CoRR,
abs/2312.15407, 2023.
[Zhang et al., 2023b] Xinghua Zhang, Bowen Yu, Haiyang
Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and
Yongbin Li. Wider and deeper LLM networks are fairer
LLM evaluators. CoRR, abs/2308.01862, 2023.
[Zheng et al., 2023] Lianmin Zheng, Wei-Lin Chiang, Ying
Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao
Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging
llm-as-a-judge with mt-bench and chatbot arena. CoRR,
abs/2306.05685, 2023.
[Zhu et al., 2023] Lianghui Zhu, Xinggang Wang, and Xin-
long Wang. Judgelm: Fine-tuned large language models
are scalable judges. CoRR, abs/2310.17631, 2023.

You might also like