LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Table 1: Comparison of the different key components among the representative methods of fine-tuning LLMs.
The primary differences in these works actually lie in the Table 1. All the works require models to provide detailed
construction of instructions, with the purpose of improving information, such as reasons for their evaluation results. And
either diversity or reliability for the better generalization abil- the MQM mode can achieve more informative error analysis,
ity of the fine-tuned model. PandaLM and JudgeLM entirely offering stronger interpretability. Moreover, some works do
sample from common instruction datasets, such as Alpaca not necessarily require references and then have greater value
52K, while CritiqueLLM adopts small-scale sampling fol- in practice. And a more optimal method is to concurrently
lowed by ChatGPT augmentation. In contrast, Prometheus support both reference-based and reference-free evaluations
and INSTRUCTSCORE rely on GPT-4 to generate all the in- as JudgeLM and CritiqueLLM.
structions based on seed data, whereas Auto-J and Shepherd
use real-world data. Moreover, since large-scale human anno- 4.4 Fine-tuning Implementation
tation is impractical, most works utilize GPT-4 as the power-
The fine-tuning process is uniformly implemented by differ-
ful annotator, except for PandaLM and Shepherd, which use
ent works on their selected open-source LLMs, like LLaMA,
GPT-3.5 and human annotation on small-scale community
and respective constructed data, with some targeted set-
data, respectively. During the construction, they basically
tings. Specifically, Prometheus maintains balanced data dis-
all design detailed prompts or guidance and apply heuristic
tributions during fine-tuning, including the length and la-
filtering strategies and post-processing methods to mitigate
bel. JudgeLM eliminates potential biases by randomly swap-
noise. Overall, despite the possible higher quality of human
ping sample pairs to be compared and randomly remov-
annotation, the corresponding drawback is the difficulty in
ing references. INSTRUCTSCORE utilizes GPT-4 to pro-
constructing large-scale datasets, which in turn may hinder
vide error annotations for the intermediate outputs of the
adequate model training, while using LLMs for construction
fine-tuned model for further supervised reinforcement. And
is the opposite situation.
based on some preliminary experiments and manual analy-
4.3 Evaluation Method sis, TIGERScore determines appropriate ratios of different
types of data during fine-tuning, which are claimed to be cru-
As with prompting LLMs, the evaluation methods adopted in
cial by them. Moreover, CritiqueLLM implements separately,
these works are highly diversified, involving different evalua-
with and without references, and explores the effects of data
tion criteria, result modes, and usages of the reference. Given
and model scale. Compared to the vanilla fine-tuning setting,
that current instruction-response scenarios encompass differ-
these methods have improved the efficiency of model training
ent types of tasks, it is unsuitable to specify unified evalu-
and the robustness of evaluations.
ation criteria as in traditional NLG tasks. However, some
works still do it this way, while some other methods let LLM
annotators adaptively and implicitly reflect the required cri-
4.5 Pros and Cons
teria in their evaluations, like PandaLM, TIGERScore, and Those shortcomings of prompting LLM for NLG evaluation
AUTO-J. In particular, AUTO-J has meticulously crafted 332 can be significantly alleviated due to the customized imple-
evaluation criteria, matched to different tasks. Furthermore, mentation of data construction and model fine-tuning here.
Prometheus explicitly incorporates evaluation criteria into the For instance, most fine-tuned models range between 7B and
inputs of the model, expecting flexible evaluation based on 13B in the scale of parameters, facilitating low-cost infer-
various customized criteria. ence use and good reproducibility, with performance close
More details about the evaluation methods are shown in to GPT4 in NLG evaluation. And specific measures can be
adopted to prevent related biases found in GPT4 during dif- 5.2 Scoring and Explaining
ferent stages. Furthermore, this type of approach allows for
Automated evaluation frequently exhibits a limited correla-
continuous iteration and improvement of the model to address
tion with human judgments, while human evaluation, though
potential deficiencies or emerging issues discovered in future
reliable, is labor-intensive. [Zhang et al., 2021] present a
applications.
human-machine collaborative framework (HMCEval) which
However, some biases associated with GPT4 may still per- conceptualizes dialogue evaluation as a sample assignment
sist, as the data construction of most methods employs GPT4 problem to ensure the reliability of evaluation outcomes while
for critical evaluation annotation. On the other hand, the minimizing human effort and achieves 99% accuracy with
base open-source LLMs selected by existing works are pri- half human effort. Recently, LLMs have emerged as a cost-
marily the series of LLaMA. With the rapid updates and im- effective alternative to human evaluations. However, both
provements of open-source large models recently, it adheres humans and LLMs have limitations, including inherent sub-
to the intuition that employing a more powerful base LLM jectivity and unreliable judgments, especially in open-ended
should lead to better evaluation performance. However, this tasks with diverse requirements.
means repetitive fine-tuning processes and computational ex- To address challenges associated with inconsistent evalua-
penses from scratch since directly migrating existing fine- tion criteria in open-ended tasks and explore synergy between
tuned models to the new base LLM is difficult. humans and LLM-based evaluators, [Li et al., 2023b] pro-
Additionally, although many existing methods aspire for poses a Collaborative Evaluation pipeline (COEVAL), which
more flexible and comprehensive evaluation through fine- involves designing a checklist of task-specific criteria and
tuning, demanding excessive evaluation settings may ulti- conducting detailed evaluations where LLMs generate ini-
mately lead to poor performance or failure in model training, tial ideation and humans engage in scrutiny. Depending
as found by AUTO-J and CritiqueLLM on criteria and refer- solely on score predictions is insufficient for ensuring reli-
ences, respectively. However, there are some disagreements able evaluation and error detection, particularly when spe-
here since Prometheus and JudgeLM show different results. cific criteria demand nuanced analysis beyond straightfor-
Moreover, considering the different evaluation settings in ex- ward scoring. Building upon recent developments in ex-
isting works, it is challenging to conduct a horizontal compar- plainable NLP ([Yin and Neubig, 2022; Jung et al., 2022;
ison among them. These issues require further exploration in Ribeiro and Lundberg, 2022; Ye et al., 2023b]), COEVAL
future research. is assigned the additional task of generating explanations to
elucidate evaluation outcomes to facilitate a trustworthy col-
laborative evaluation process. Results indicate COEVAL ef-
5 Human-LLM Collaborative Evaluation fectively evaluates lengthy texts by utilizing LLMs, saving
significant time and reducing human evaluation outliers. De-
5.1 Overview spite the involvement of LLMs, human scrutiny remains es-
sential, contributing to the revision of around 20% of LLM
While LLMs demonstrate robust evaluation capabilities, there evaluation scores for enhanced reliability.
exists a need for further enhancement in terms of their relia-
bility, particularly in establishing a stronger correlation with
5.3 Broader Evaluation Tasks
human evaluation outcomes. Although human evaluation is
the gold-standard evaluation approach in NLG, it is recog- The broader evaluation of NLG models involves testing and
nized for its associated high costs and susceptibility to sub- debugging the models. Current methods often rely on highly
jective biases [van der Lee et al., 2021; Deriu et al., 2021; variable human creativity and extensive manual effort or are
Li et al., 2023b]. The robust and comprehensive capabilities limited to addressing a very specific class of bugs. [Ribeiro
exhibited by LLMs underscore considerable potential for the and Lundberg, 2022] introduce AdaTest, a process that uses
development of collaborative evaluation methodologies that LLMs in collaboration with human feedback to automatically
integrate both human and LLMs. In recent investigations, re- generate unit tests that highlight bugs in a target model, which
searchers have initiated the exploration of collaborative eval- proves to make users 5-10 times more effective at identify-
uation paradigms, which include traditional NLG evaluation ing bugs and assists users in effectively fixing bugs without
methods such as scoring and explaining [Zhang et al., 2021; introducing new ones. Moreover, LLMs have shown biases
Li et al., 2023b], broader evaluation methods such as testing and irresponsible behavior, necessitating thorough auditing
and debugging [Ribeiro and Lundberg, 2022], and auditing before deployment [Blodgett et al., 2020; Jones and Stein-
NLG models to ensure fairness [Rastogi et al., 2023]. Fur- hardt, 2022]. AdaTest++ [Rastogi et al., 2023] draw on in-
thermore, scholars [Saunders et al., 2022] are actively engag- sights from literature on human-AI collaboration and sense-
ing in efforts to address the intricate challenge of scalable making, and engage with research experts in safe and fair
oversight [Amodei et al., 2016] through the collaboration of AI, which emphasizes the importance of sensemaking and ef-
humans and LLMs. The objective is to devise strategies for fective communication between humans and AI to capitalize
effectively evaluating models in tasks that pose inherent dif- on their complementary strengths in collaborative auditing.
ficulties for human assessors. This collaborative approach AdaTest++ successfully leverages human strengths, such as
seeks to leverage the distinctive strengths of both human eval- schematization and hypothesis testing. Moreover, users iden-
uators and sophisticated language models to achieve robust tified a range of failure modes across 26 different topics in
and nuanced evaluations in challenging domains. issues that were revealed in formal audits and those that were
previously under-reported. Additionally, ensuring trustwor- NLG evaluation for low-resource languages and new
thiness in LLMs for challenging tasks [Chen et al., 2021; task scenarios. Almost all existing research focuses on En-
Nakano et al., 2021; Li et al., 2022; Menick et al., 2022] glish data. However, it is doubtful whether LLMs have simi-
poses a crucial challenge. Scalable oversight [Amodei et al., lar levels of NLG evaluation capability for texts in other lan-
2016] aims to effectively evaluate models on tasks challeng- guages, especially low-resource languages. As [Zhang et al.,
ing for humans and suggests the use of AI for assistance. 2023a] points out, we should be more cautious about using
[Saunders et al., 2022] explored providing critiques of model LLMs to evaluate texts in non-Latin languages. Additionally,
outputs as a form of assistance, demonstrating that model- existing research mainly focuses on more traditional NLG
generated critiques assist humans in identifying overlooked tasks such as translation, summarization, and dialogue. How-
flaws. ever, there are many new scenarios in reality with different re-
quirements and evaluation criteria. Research on low-resource
5.4 Pros and Cons languages and new task scenarios will provide a more com-
The advantages of human-AI collaborative evaluation lie in prehensive understanding of LLMs’ evaluation capabilities.
achieving a balance between efficiency and cost, as demon- Diverse forms of human-LLM collaborative NLG eval-
strated by COEVAL [Li et al., 2023b] achieving this equi- uation. According to the literature reviewed above, there is
librium. Additionally, there are complementary strengths little research on collaborative evaluation between humans
between humans and AI. For instance, AdaTest++ [Rastogi and LLMs. Neither humans nor LLMs are perfect, and each
et al., 2023] empowers users to consistently utilize their has its strengths. Since the ultimate goal of NLG research is
strengths throughout the auditing process, benefiting signif- to evaluate text quality more accurately and efficiently, we
icantly from LLM. Users who generate the most topics heav- believe that collaboration between humans and LLMs can
ily rely on LLM suggestions while employing their contextual achieve better results than pure human evaluation or auto-
reasoning and semantic understanding to update their mental matic evaluation. In the collaboration between humans and
models vigilantly and identify model failures. LLMs, technologies in the field of human-computer interac-
However, there are drawbacks. The evaluation results of tion may bring new implementation methods to the collabo-
LLMs may be sensitive to the formats used to query the model ration. In addition, what roles humans and LLMs should play
and might require additional support for prompt writing [Li in the evaluation and how they can better complement each
et al., 2023b; Rastogi et al., 2023]. Furthermore, the current other are still worth researching.
capability to assess confidence levels is not strong enough,
making it challenging to determine when to trust the LLM. References
Furthermore, certain level of human supervision is still nec-
essary, making it less convenient and cost-effective compared [Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob
to fully automated evaluation. Steinhardt, Paul F. Christiano, John Schulman, and
Dan Mané. Concrete problems in AI safety. CoRR,
6 Conclusions and Future Trends abs/1606.06565, 2016.
Through the above review of studies on NLG evaluation [Anil et al., 2023] Rohan Anil, Andrew M. Dai, Orhan Fi-
based on LLMs, we find that these four categories of ap- rat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos,
proaches have their respective strengths and weaknesses, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng
and most of the existing work is concentrated on prompting Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey,
LLMs. In view of this, we offer some suggestions for future Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra,
directions in this field. Erica Moreira, Mark Omernick, Kevin Robinson, Sebas-
Unified benchmarks for LLM-based NLG evaluation tian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing
approaches. As mentioned above, each of the studies that Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob
fine-tuned LLMs to construct specialized evaluation models Austin, Paul Barham, Jan Botha, James Bradbury, Sid-
uses different settings and data during testing, making them dhartha Brahma, Kevin Brooks, Michele Catasta, Yong
incomparable. In the research on prompting LLMs for NLG Cheng, Colin Cherry, Christopher A. Choquette-Choo,
evaluation, there are some publicly available human judg- Aakanksha Chowdhery, Clément Crepy, Shachi Dave,
ments on the same NLG task, such as SummEval for summa- Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Dı́az,
rization. However, the existing human judgments have many Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng,
problems. Firstly, most of the existing data only involve one Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian
type of NLG task and a single human evaluation method (e.g., Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand,
scoring), making it difficult to evaluate LLMs’ performance Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu,
on different tasks, as well as using different evaluation meth- Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Itty-
ods on the same task. Secondly, many of the texts in these cheriah, Matthew Jagielski, Wenhao Jia, Kathleen Ke-
human judgments are generated by outdated models (such as nealy, Maxim Krikun, Sneha Kudugunta, Chang Lan,
Pointer Network) and do not include texts generated by more Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei
advanced LLMs. Lastly, many human evaluation datasets are Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin,
too small in scale. There is an urgent need for large-scale, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma
high-quality human evaluation data covering various NLG Mahendru, Joshua Maynez, Vedant Misra, Maysam Mous-
tasks and evaluation methods as a benchmark. salem, Zachary Nado, John Nham, Eric Ni, Andrew Nys-
trom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Dave Cummings, Matthias Plappert, Fotios Chantzis, Eliz-
Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan abeth Barnes, Ariel Herbert-Voss, William Hebgen Guss,
Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Bren- Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor
nan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Babuschkin, Suchir Balaji, Shantanu Jain, William Saun-
Slone, Daniel Smilkov, David R. So, Daniel Sohn, Si- ders, Christopher Hesse, Andrew N. Carr, Jan Leike,
mon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vo- Joshua Achiam, Vedant Misra, Evan Morikawa, Alec
drahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Radford, Matthew Knight, Miles Brundage, Mira Mu-
Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan rati, Katie Mayer, Peter Welinder, Bob McGrew, Dario
Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech
Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Zaremba. Evaluating large language models trained on
Slav Petrov, and Yonghui Wu. Palm 2 technical report. code. CoRR, abs/2107.03374, 2021.
Computing Research Repository, arxiv:2305.10403, 2023. [Chiang and Lee, 2023a] David Cheng-Han Chiang and
[Bai et al., 2023] Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Hung-yi Lee. Can large language models be an alterna-
Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia tive to human evaluations? In ACL (1), 2023.
Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. [Chiang and Lee, 2023b] David Cheng-Han Chiang and
Benchmarking foundation models with language-model- Hung-yi Lee. A closer look into using large language mod-
as-an-examiner. CoRR, abs/2306.04181, 2023. els for automatic evaluation. In EMNLP (Findings), 2023.
[Blodgett et al., 2020] Su Lin Blodgett, Solon Barocas, [Chung et al., 2022] Hyung Won Chung, Le Hou, Shayne
Hal Daumé III, and Hanna M. Wallach. Language (tech- Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li,
nology) is power: A critical survey of ”bias” in NLP. In Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Al-
ACL, 2020. bert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac
[Brown et al., 2020] Tom B. Brown, Benjamin Mann, Nick Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan
Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yan-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, ping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-
Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Win- finetuned language models. CoRR, abs/2210.11416, 2022.
ter, Christopher Hesse, Mark Chen, Eric Sigler, Ma- [Cohen et al., 2023] Roi Cohen, May Hamri, Mor Geva, and
teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Amir Globerson. LM vs LM: detecting factual errors via
Christopher Berner, Sam McCandlish, Alec Radford, Ilya cross examination. In EMNLP, 2023.
Sutskever, and Dario Amodei. Language models are few-
shot learners. In NeurIPS, 2020. [Deriu et al., 2021] Jan Deriu, Álvaro Rodrigo, Arantxa
Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko
[Cegin et al., 2023] Jan Cegin, Jakub Simko, and Peter Agirre, and Mark Cieliebak. Survey on evaluation meth-
Brusilovsky. ChatGPT to replace crowdsourcing of para- ods for dialogue systems. Artif. Intell. Rev., 54(1):755–
phrases for intent classification: Higher diversity and com- 810, 2021.
parable model robustness. In Houda Bouamor, Juan Pino,
and Kalika Bali, editors, Proceedings of the 2023 Confer- [Durmus et al., 2020] Esin Durmus, He He, and Mona T.
ence on Empirical Methods in Natural Language Process- Diab. FEQA: A question answering evaluation framework
ing, pages 1889–1905, Singapore, December 2023. Asso- for faithfulness assessment in abstractive summarization.
ciation for Computational Linguistics. In ACL, 2020.
[Chan et al., 2023] Chi-Min Chan, Weize Chen, Yusheng [Eddine et al., 2022] Moussa Kamal Eddine, Guokan Shang,
Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Antoine J.-P. Tixier, and Michalis Vazirgiannis. Fru-
Zhiyuan Liu. Chateval: Towards better llm-based evalua- galscore: Learning cheaper, lighter and faster evaluation
tors through multi-agent debate. CoRR, abs/2308.07201, metrics for automatic text generation. In ACL (1), 2022.
2023. [ES et al., 2023] Shahul ES, Jithin James, Luis Espinosa
[Chang et al., 2023] Yapei Chang, Kyle Lo, Tanya Goyal, Anke, and Steven Schockaert. RAGAS: automated
and Mohit Iyyer. Booookscore: A systematic exploration evaluation of retrieval augmented generation. CoRR,
of book-length summarization in the era of llms. CoRR, abs/2309.15217, 2023.
abs/2310.00785, 2023. [Freitag et al., 2022] Markus Freitag, Ricardo Rei, Nitika
[Chen et al., 2021] Mark Chen, Jerry Tworek, Heewoo Jun, Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis,
Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Tom Kocmi, George F. Foster, Alon Lavie, and André F. T.
Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Martins. Results of WMT22 metrics shared task: Stop us-
Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, ing BLEU - neural metrics are better and more robust. In
Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela WMT, 2022.
Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail [Fu et al., 2023a] Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,
Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar- and Pengfei Liu. Gptscore: Evaluate as you desire. CoRR,
ian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, abs/2302.04166, 2023.
[Fu et al., 2023b] Xue-Yong Fu, Md. Tahmid Rahman [Jiang et al., 2023] Dongfu Jiang, Yishan Li, Ge Zhang,
Laskar, Cheng Chen, and Shashi Bhushan TN. Are large Wenhao Huang, Bill Yuchen Lin, and Wenhu Chen. Tiger-
language models reliable judges? A study on the factual- score: Towards building explainable metric for all text
ity evaluation capabilities of llms. CoRR, abs/2311.00681, generation tasks. CoRR, abs/2310.00752, 2023.
2023. [Jones and Steinhardt, 2022] Erik Jones and Jacob Stein-
[Gao et al., 2023] Mingqi Gao, Jie Ruan, Renliang Sun, hardt. Capturing failures of large language models via hu-
Xunjian Yin, Shiping Yang, and Xiaojun Wan. Human- man cognitive biases. In NeurIPS, 2022.
like summarization evaluation with chatgpt. CoRR, [Jung et al., 2022] Jaehun Jung, Lianhui Qin, Sean Welleck,
abs/2304.02554, 2023. Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras,
[Gilardi et al., 2023] Fabrizio Gilardi, Meysam Alizadeh, and Yejin Choi. Maieutic prompting: Logically consistent
and Maël Kubli. Chatgpt outperforms crowd-workers for reasoning with recursive explanations. In EMNLP, 2022.
text-annotation tasks. CoRR, abs/2303.15056, 2023. [Ke et al., 2023] Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu,
[Gong and Mao, 2023] Peiyuan Gong and Jiaxin Mao. Coas- Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng,
core: Chain-of-aspects prompting for NLG evaluation. Yuxiao Dong, Hongning Wang, Jie Tang, and Minlie
CoRR, abs/2312.10355, 2023. Huang. Critiquellm: Scaling llm-as-critic for effective
and explainable evaluation of large language model gen-
[Guan et al., 2023] Jian Guan, Jesse Dodge, David Wad- eration. CoRR, abs/2311.18702, 2023.
den, Minlie Huang, and Hao Peng. Language models
hallucinate, but may excel at fact verification. CoRR, [Kim et al., 2023a] Joonghoon Kim, Saeran Park, Kiyoon
abs/2310.14564, 2023. Jeong, Sangmin Lee, Seung Hun Han, Jiyoon Lee, and Pil-
sung Kang. Which is better? exploring prompting strategy
[Hada et al., 2023] Rishav Hada, Varun Gumma, Adrian for llm-based metrics. CoRR, abs/2311.03754, 2023.
de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit
[Kim et al., 2023b] Seungone Kim, Jamin Shin, Yejin Cho,
Choudhury, Kalika Bali, and Sunayana Sitaram. Are large
language model-based evaluators the solution to scaling up Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun,
multilingual evaluation? CoRR, abs/2309.07462, 2023. Seongjin Shin, Sungdong Kim, James Thorne, and Min-
joon Seo. Prometheus: Inducing fine-grained evaluation
[Hasanbeig et al., 2023] Hosein Hasanbeig, Hiteshi Sharma, capability in language models. CoRR, abs/2310.08491,
Leo Betthauser, Felipe Vieira Frujeri, and Ida Momenne- 2023.
jad. ALLURE: auditing and improving llm-based eval- [Kim et al., 2023c] Tae Soo Kim, Yoonjoo Lee, Jamin Shin,
uation of text using iterative in-context-learning. CoRR,
Young-Ho Kim, and Juho Kim. Evallm: Interactive eval-
abs/2309.13701, 2023.
uation of large language model prompts on user-defined
[He et al., 2023a] Hangfeng He, Hongming Zhang, and Dan criteria. CoRR, abs/2309.13633, 2023.
Roth. Socreval: Large language models with the socratic [Kocmi and Federmann, 2023a] Tom Kocmi and Christian
method for reference-free reasoning evaluation. CoRR, Federmann. GEMBA-MQM: detecting translation quality
abs/2310.00074, 2023. error spans with GPT-4. In WMT, 2023.
[He et al., 2023b] Tianxing He, Jingyu Zhang, Tianle Wang, [Kocmi and Federmann, 2023b] Tom Kocmi and Christian
Sachin Kumar, Kyunghyun Cho, James R. Glass, and Yu- Federmann. Large language models are state-of-the-art
lia Tsvetkov. On the blind spots of model-based evaluation evaluators of translation quality. In EAMT, 2023.
metrics for text generation. In ACL (1), 2023.
[Kotonya et al., 2023] Neema Kotonya, Saran Krishnasamy,
[Hu et al., 2023] Yebowen Hu, Kaiqiang Song, Sangwoo Joel R. Tetreault, and Alejandro Jaimes. Little giants: Ex-
Cho, Xiaoyang Wang, Hassan Foroosh, and Fei Liu. De- ploring the potential of small llms as evaluation metrics in
cipherpref: Analyzing influential factors in human prefer- summarization in the eval4nlp 2023 shared task. CoRR,
ence judgments via GPT-4. In EMNLP, 2023. abs/2311.00686, 2023.
[Jain et al., 2023] Sameer Jain, Vaishakh Keshava, Swar- [Kwon et al., 2023] Woosuk Kwon, Zhuohan Li, Siyuan
nashree Mysore Sathyendra, Patrick Fernandes, Pengfei Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu,
Liu, Graham Neubig, and Chunting Zhou. Multi- Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient
dimensional evaluation of text summarization with in- memory management for large language model serving
context learning. In ACL (Findings), 2023. with pagedattention. In SOSP, 2023.
[Ji et al., 2023] Yunjie Ji, Yan Gong, Yiping Peng, Chao Ni, [Leiter et al., 2023] Christoph Leiter, Juri Opitz, Daniel
Peiyan Sun, Dongyu Pan, Baochang Ma, and Xiangang Deutsch, Yang Gao, Rotem Dror, and Steffen Eger. The
Li. Exploring chatgpt’s ability to rank content: A prelimi- eval4nlp 2023 shared task on prompting large language
nary study on consistency with human preferences. CoRR, models as explainable metrics. CoRR, abs/2310.19792,
abs/2303.07610, 2023. 2023.
[Jia et al., 2023] Qi Jia, Siyu Ren, Yizhu Liu, and Kenny Q. [Li et al., 2022] Yujia Li, David Choi, Junyoung Chung,
Zhu. Zero-shot faithfulness evaluation for text summariza- Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom
tion with foundation language model. In EMNLP, 2023. Eccles, James Keeling, Felix Gimeno, Agustin Dal
Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- large language models. Computing Research Repository,
son d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen arxiv:2307.07889, 2023.
Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, [Lu et al., 2023] Qingyu Lu, Baopu Qiu, Liang Ding, Lip-
James Molloy, Daniel J. Mankowitz, Esme Sutherland
ing Xie, and Dacheng Tao. Error analysis prompting en-
Robson, Pushmeet Kohli, Nando de Freitas, Koray
ables human-like translation evaluation in large language
Kavukcuoglu, and Oriol Vinyals. Competition-level code
models: A case study on chatgpt. CoRR, abs/2303.13809,
generation with alphacode. Science, 378(6624):1092–
2023.
1097, 2022.
[Li et al., 2023a] Junlong Li, Shichao Sun, Weizhe Yuan, [Luo et al., 2023] Zheheng Luo, Qianqian Xie, and Sophia
Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge Ananiadou. Chatgpt as a factual inconsistency eval-
for evaluating alignment. CoRR, abs/2310.05470, 2023. uator for abstractive text summarization. CoRR,
abs/2303.15621, 2023.
[Li et al., 2023b] Qintong Li, Leyang Cui, Lingpeng Kong,
and Wei Bi. Collaborative evaluation: Exploring the syn- [Manakul et al., 2023] Potsawee Manakul, Adian Liusie,
ergy of large language models and humans for open-ended and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-
generation evaluation. CoRR, abs/2310.19740, 2023. box hallucination detection for generative large language
models. In EMNLP, 2023.
[Li et al., 2023c] Ruosen Li, Teerth Patel, and Xinya Du.
PRD: peer rank and discussion improve large language [Mendonça et al., 2023] John Mendonça, Patrı́cia Pereira,
model based evaluations. CoRR, abs/2307.02762, 2023. João Paulo Carvalho, Alon Lavie, and Isabel Trancoso.
[Li et al., 2023d] Zongjie Li, Chaozheng Wang, Pingchuan Simple LLM prompting is state-of-the-art for robust and
multilingual dialogue evaluation. In Proceedings of The
Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, and Yang
Eleventh Dialog System Technology Challenge, 2023.
Liu. Split and merge: Aligning position biases in large
language model based evaluators. CoRR, abs/2310.01432, [Menick et al., 2022] Jacob Menick, Maja Trebacz,
2023. Vladimir Mikulik, John Aslanides, H. Francis Song,
[Lin and Chen, 2023] Yen-Ting Lin and Yun-Nung Chen. Martin J. Chadwick, Mia Glaese, Susannah Young,
Llm-eval: Unified multi-dimensional automatic evaluation Lucy Campbell-Gillingham, Geoffrey Irving, and Nat
for open-domain conversations with large language mod- McAleese. Teaching language models to support answers
els. In NLP4ConvAI 2023, 2023. with verified quotes. CoRR, abs/2203.11147, 2022.
[Lin, 2004] Chin-Yew Lin. Rouge: A package for automatic [Naismith et al., 2023] Ben Naismith, Phoebe Mulcaire, and
evaluation of summaries. In Text summarization branches Jill Burstein. Automated evaluation of written discourse
out, 2004. coherence using GPT-4. In BEA@ACL, 2023.
[Liu et al., 2023a] Yang Liu, Dan Iter, Yichong Xu, Shuo- [Nakano et al., 2021] Reiichiro Nakano, Jacob Hilton,
hang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim,
NLG evaluation using gpt-4 with better human alignment. Christopher Hesse, Shantanu Jain, Vineet Kosaraju,
In EMNLP, 2023. William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloun-
[Liu et al., 2023b] Yixin Liu, Alexander R. Fabbri, Jiawen dou, Gretchen Krueger, Kevin Button, Matthew Knight,
Benjamin Chess, and John Schulman. Webgpt: Browser-
Chen, Yilun Zhao, Simeng Han, Shafiq Joty, Pengfei Liu,
assisted question-answering with human feedback. CoRR,
Dragomir Radev, Chien-Sheng Wu, and Arman Cohan.
abs/2112.09332, 2021.
Benchmarking generation and evaluation capabilities of
large language models for instruction controllable summa- [Ostyakova et al., 2023] Lidiia Ostyakova, Veronika Smilga,
rization. CoRR, abs/2311.09184, 2023. Kseniia Petukhova, Maria Molchanova, and Daniel Ko-
[Liu et al., 2023c] Yixin Liu, Alexander R. Fabbri, Pengfei rnev. ChatGPT vs. crowdsourcing vs. experts: Annotat-
Liu, Dragomir Radev, and Arman Cohan. On learning ing open-domain conversations with speech functions. In
to summarize with large language models as references. Svetlana Stoyanchev, Shafiq Joty, David Schlangen, On-
CoRR, abs/2305.14239, 2023. drej Dusek, Casey Kennington, and Malihe Alikhani, ed-
itors, Proceedings of the 24th Annual Meeting of the Spe-
[Liu et al., 2023d] Yongkang Liu, Shi Feng, Daling Wang, cial Interest Group on Discourse and Dialogue, pages
Yifei Zhang, and Hinrich Schütze. Evaluate what you 242–254, Prague, Czechia, September 2023. Association
can’t evaluate: Unassessable generated responses quality. for Computational Linguistics.
CoRR, abs/2305.14658, 2023.
[Ouyang et al., 2022] Long Ouyang, Jeffrey Wu, Xu Jiang,
[Liu et al., 2023e] Yuxuan Liu, Tianchi Yang, Shaohan Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex
Deng, Feng Sun, and Qi Zhang. Calibrating llm-based Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke
evaluator. CoRR, abs/2309.13308, 2023. Miller, Maddie Simens, Amanda Askell, Peter Welinder,
[Liusie et al., 2023] Adian Liusie, Potsawee Manakul, and Paul F. Christiano, Jan Leike, and Ryan Lowe. Training
Mark J. F. Gales. Llm comparative assessment: Zero- language models to follow instructions with human feed-
shot nlg evaluation through pairwise comparisons using back. In NeurIPS, 2022.
[Papineni et al., 2002] Kishore Papineni, Salim Roukos, Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas
Todd Ward, and Wei-Jing Zhu. Bleu: a method for au- Scialom. Llama 2: Open foundation and fine-tuned chat
tomatic evaluation of machine translation. In ACL, 2002. models. CoRR, abs/2307.09288, 2023.
[Rastogi et al., 2023] Charvi Rastogi, Marco Túlio Ribeiro, [van der Lee et al., 2021] Chris van der Lee, Albert Gatt,
Nicholas King, Harsha Nori, and Saleema Amershi. Sup- Emiel van Miltenburg, and Emiel Krahmer. Human eval-
porting human-ai collaboration in auditing llms with llms. uation of automatically generated text: Current trends
In AIES, 2023. and best practice guidelines. Comput. Speech Lang.,
[Ribeiro and Lundberg, 2022] Marco Túlio Ribeiro and 67:101151, 2021.
Scott M. Lundberg. Adaptive testing and debugging of [Varshney et al., 2023] Neeraj Varshney, Wenlin Yao, Hong-
NLP models. In ACL (1), 2022. ming Zhang, Jianshu Chen, and Dong Yu. A stitch in
[Saha et al., 2023] Swarnadeep Saha, Omer Levy, Asli Ce- time saves nine: Detecting and mitigating hallucinations
likyilmaz, Mohit Bansal, Jason Weston, and Xian Li. of llms by validating low-confidence generation. CoRR,
Branch-solve-merge improves large language model eval- abs/2307.03987, 2023.
uation and generation. CoRR, abs/2310.15123, 2023.
[Wang et al., 2023a] Jiaan Wang, Yunlong Liang, Fandong
[Saunders et al., 2022] William Saunders, Catherine Yeh, Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu,
Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and and Jie Zhou. Is ChatGPT a good NLG evaluator? a pre-
Jan Leike. Self-critiquing models for assisting human eval- liminary study. In Proceedings of the 4th New Frontiers in
uators. CoRR, abs/2206.05802, 2022. Summarization Workshop, 2023.
[Shen et al., 2023] Chenhui Shen, Liying Cheng, Xuan-Phi
[Wang et al., 2023b] Peiyi Wang, Lei Li, Liang Chen, Dawei
Nguyen, Yang You, and Lidong Bing. Large language
Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and
models are not yet human-level evaluators for abstractive
Zhifang Sui. Large language models are not fair evalua-
summarization. In EMNLP (Findings), 2023.
tors. CoRR, abs/2305.17926, 2023.
[Shu et al., 2023] Lei Shu, Nevan Wichers, Liangchen Luo,
Yun Zhu, Yinxiao Liu, Jindong Chen, and Lei Meng. [Wang et al., 2023c] Tianlu Wang, Ping Yu, Xiaoqing Ellen
Fusion-eval: Integrating evaluators with llms. CoRR, Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-
abs/2311.09204, 2023. Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-
Zarandi, and Asli Celikyilmaz. Shepherd: A critic for lan-
[Sulem et al., 2018] Elior Sulem, Omri Abend, and Ari Rap- guage model generation. CoRR, abs/2308.04592, 2023.
poport. BLEU is not suitable for the evaluation of text
simplification. In EMNLP, 2018. [Wang et al., 2023d] Yaqing Wang, Jiepu Jiang, Mingyang
[Sun et al., 2022] Tianxiang Sun, Junliang He, Xipeng Qiu, Zhang, Cheng Li, Yi Liang, Qiaozhu Mei, and Michael
Bendersky. Automated evaluation of personalized
and Xuanjing Huang. Bertscore is unfair: On social bias
text generation using large language models. CoRR,
in language model-based metrics for text generation. In
abs/2310.11593, 2023.
EMNLP, 2022.
[Törnberg, 2023] Petter Törnberg. Chatgpt-4 outperforms [Wang et al., 2023e] Yidong Wang, Zhuohao Yu, Zhengran
experts and crowd workers in annotating political twitter Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya
messages with zero-shot learning. CoRR, abs/2304.06588, Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun
2023. Zhang, and Yue Zhang. Pandalm: An automatic evalua-
tion benchmark for LLM instruction tuning optimization.
[Touvron et al., 2023] Hugo Touvron, Louis Martin, Kevin
CoRR, abs/2306.05087, 2023.
Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, [Wang et al., 2023f] Zifan Wang, Kotaro Funakoshi, and
Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Manabu Okumura. Automatic answerability evaluation for
Canton-Ferrer, Moya Chen, Guillem Cucurull, David Es- question generation. CoRR, abs/2309.12546, 2023.
iobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian [Wu and Aji, 2023] Minghao Wu and Alham Fikri Aji. Style
Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, over substance: Evaluation biases for large language mod-
Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan els. CoRR, abs/2307.03025, 2023.
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura, [Wu et al., 2023] Ning Wu, Ming Gong, Linjun Shou, Shin-
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana ing Liang, and Daxin Jiang. Large language models
Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, are diverse role-players for summarization evaluation. In
Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin NLPCC (1), 2023.
Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, [Xie et al., 2023] Zhuohan Xie, Miao Li, Trevor Cohn, and
Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael
Jey Han Lau. Deltascore: Fine-grained story evaluation
Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh
with perturbations. In EMNLP (Findings), 2023.
Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan,
Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, An- [Xu et al., 2023] Wenda Xu, Danqing Wang, Liangming
gela Fan, Melanie Kambadur, Sharan Narang, Aurélien Pan, Zhenqiao Song, Markus Freitag, William Wang, and
Lei Li. INSTRUCTSCORE: towards explainable text gen-
eration evaluation with automatic feedback. In EMNLP,
2023.
[Ye et al., 2023a] Seonghyeon Ye, Doyoung Kim, Sungdong
Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo,
James Thorne, Juho Kim, and Minjoon Seo. FLASK:
fine-grained language model evaluation based on align-
ment skill sets. CoRR, abs/2307.10928, 2023.
[Ye et al., 2023b] Xi Ye, Srinivasan Iyer, Asli Celikyilmaz,
Veselin Stoyanov, Greg Durrett, and Ramakanth Pa-
sunuru. Complementary explanations for effective in-
context learning. In ACL (Findings), 2023.
[Yin and Neubig, 2022] Kayo Yin and Graham Neubig. In-
terpreting language models with contrastive explanations.
In EMNLP, 2022.
[Yuan et al., 2021] Weizhe Yuan, Graham Neubig, and
Pengfei Liu. Bartscore: Evaluating generated text as text
generation. In NeurIPS, 2021.
[Yuan et al., 2024] Peiwen Yuan, Shaoxiong Feng, Yiwei Li,
Xinglin Wang, Boyuan Pan, Heda Wang, and Kan Li.
Batcheval: Towards human-like text evaluation. CoRR,
abs/2401.00437, 2024.
[Zhang et al., 2020] Tianyi Zhang, Varsha Kishore, Felix
Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore:
Evaluating text generation with BERT. In ICLR, 2020.
[Zhang et al., 2021] Yangjun Zhang, Pengjie Ren, and
Maarten de Rijke. A human-machine collaborative
framework for evaluating malevolence in dialogues. In
ACL/IJCNLP (1), 2021.
[Zhang et al., 2022] Susan Zhang, Stephen Roller, Naman
Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen,
Christopher Dewan, Mona T. Diab, Xian Li, Xi Vic-
toria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer,
Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali
Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT:
open pre-trained transformer language models. CoRR,
abs/2205.01068, 2022.
[Zhang et al., 2023a] Chen Zhang, Luis Fernando D’Haro,
Yiming Chen, Malu Zhang, and Haizhou Li. A com-
prehensive analysis of the effectiveness of large lan-
guage models as automatic dialogue evaluators. CoRR,
abs/2312.15407, 2023.
[Zhang et al., 2023b] Xinghua Zhang, Bowen Yu, Haiyang
Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and
Yongbin Li. Wider and deeper LLM networks are fairer
LLM evaluators. CoRR, abs/2308.01862, 2023.
[Zheng et al., 2023] Lianmin Zheng, Wei-Lin Chiang, Ying
Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao
Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging
llm-as-a-judge with mt-bench and chatbot arena. CoRR,
abs/2306.05685, 2023.
[Zhu et al., 2023] Lianghui Zhu, Xinggang Wang, and Xin-
long Wang. Judgelm: Fine-tuned large language models
are scalable judges. CoRR, abs/2310.17631, 2023.