[orcid=0000-0001-7571-6722] \cormark[1] \credit Conceptualization, Formal analysis, Investigation, Methodology, Writing - original draft

[orcid=0009-0004-8612-4078] \cormark[1] \creditConceptualization, Formal analysis, Software, Methodology

[orcid=0000-0002-6299-5467] \cormark[1] \creditResources, Formal analysis, Validation, Visualization, Writing - original draft

[orcid=0000-0002-1991-6698] \creditInvestigation, Data curation

[orcid=0000-0002-3619-675X] \creditWriting - review & editing

[orcid=0000-0001-6075-4224] \creditWriting - review & editing

[orcid=0000-0002-0795-366X] \cormark[2] \creditProject administration, Writing - review & editing

[orcid=0000-0001-5128-5649] \cormark[2] \creditFunding acquisition, Writing - review & editing

[orcid=0000-0002-3491-5968] \creditWriting - review & editing

1]organization=Tsinghua Shenzhen International Graduate School, Tsinghua University, city=Shenzhen, postcode=518055, state=Guangdong, country=China 2]organization=School of Computer Science and Engineering, Central South University, city=Changsha, postcode=410083, state=Hunan, country=China 3]organization=The Hong Kong University of Science and Technology (Guangzhou), city=Guangzhou, postcode=511453, state=Guangdong, country=China 4]organization=Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), city=Shenzhen, postcode=518132, state=Guangdong, country=China 5]organization=University of Illinois Chicago, city=Chicago, postcode=60607, state=Illinois, country=USA

\cortext

[cor1]indicates equal contribution. \cortext[cor2]Corresponding authors.

Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction

Yinghui Li [email protected] Shang Qin [email protected] Haojing Huang [email protected] Yangning Li [email protected] Libo Qin [email protected] Xuming Hu [email protected] Wenhao Jiang [email protected] Hai-Tao Zheng [email protected] Philip S. Yu [email protected] [ [ [ [ [

Abstract

Recently, Large Language Models (LLMs) have been widely studied by researchers for their roles in various downstream NLP tasks. As a fundamental task in the NLP field, Chinese Grammatical Error Correction (CGEC) aims to correct all potential grammatical errors in the input sentences. Previous studies have shown that LLMs’ performance as correctors on CGEC remains unsatisfactory due to the challenging nature of the task. To promote the CGEC field to better adapt to the era of LLMs, we rethink the roles of LLMs in the CGEC task so that they can be better utilized and explored in CGEC. Considering the rich grammatical knowledge stored in LLMs and their powerful semantic understanding capabilities, we utilize LLMs as explainers to provide explanation information to the CGEC small models during error correction, aiming to enhance performance. We also use LLMs as evaluators to bring more reasonable CGEC evaluations, thus alleviating the troubles caused by the subjectivity of the CGEC task. In particular, our work is also an active exploration of how LLMs and small models better collaborate in downstream tasks. Extensive experiments ¹¹1Our code will be made public after peer review. and detailed analyses on widely used datasets verify the effectiveness of our intuition and the proposed methods.

keywords:

Natural Language Processing \sepLarge Language Models \sepChinese Grammatical Error Correction

{highlights}

LLMs are utilized as explainers to provide auxiliary information during CGEC.

We employ LLMs as evaluators to achieve more objective CGEC assessments.

Our work actively explores how LLMs and small models collaborate in downstream tasks.

Extensive experiments demonstrate the effectiveness of our methods on CGEC datasets.

1 Introduction

Large Language Models (LLMs) are undoubtedly the hottest topic in the AI and NLP community. Due to the unified paradigm for various tasks and amazing emergent ability, an increasing number of researchers and works have begun to focus on how to better apply LLMs to downstream task scenarios, such as sequence understanding Yu et al. (2023), financial analysis Wu et al. (2023), and medical healthcare Wang et al. (2023). In the vast field of Chinese NLP research, Chinese Grammatical Error Correction (CGEC) has long been regarded as a fundamental task Ma et al. (2022). The CGEC task aims to correct all possible grammatical errors in the input sentence, which is challenging because it requires the models to have a comprehensive understanding ability for the complex semantics of the text. In the era of LLMs, some works have explored the possibility of LLMs for CGEC Fang et al. (2023); Li et al. (2023). Their consensus is that even with supervised fine-tuning on CGEC data, the performance of LLMs on the CGEC task is still unsatisfactory. The main reason is that the relatively free generation paradigm makes the sentences generated by LLMs often unable to meet the minimum change principle pursued by CGEC. Therefore, adapting and applying LLMs in the CGEC field have encountered a stagnant dilemma.

To address this dilemma, our work rethinks the proper utilization of LLMs to promote the development of the CGEC field. Overviewing recent GEC research trends, the subjectivity and explainability of GEC have received great attention Ye et al. (2023); Song et al. (2023); Kaneko and Okazaki (2023). As illustrated in Figure 1, a grammatically incorrect sentence often has different correction methods to keep its meaning unchanged and its grammar correct. Therefore, enabling evaluators to perform comprehensively and flexibly has always been an unsolved challenge. In addition, we also see from Figure 1 that the explanation of the incorrect sentence contains instructive information and knowledge for error correction. If we can obtain high-quality explanations of incorrect sentences, it will undoubtedly improve the CGEC performance. The basis for high-quality explanations of ungrammatical sentences is rich grammatical knowledge, while flexible CGEC evaluation requires the evaluator to have comprehensive semantic understanding capabilities. Intuitively, for LLMs, the massive training corpus gives them sufficient grammatical knowledge, and the emergence phenomenon gives them excellent semantic understanding capabilities. More importantly, the two processes of explanation and evaluation are not restricted by the minimum change principle, and they can give enough free space to the generation paradigm of LLMs.

Refer to caption — Figure 1: The example of subjectivity and explainability of CGEC. The explanation is produced by ChatGPT.

Motivated by the above intuitions, we believe that LLMs can be leveraged to provide high-quality explanations and accurate evaluations for small CGEC models. Therefore, we propose an EXplanation-AugMented training framework (EXAM) and a SEmantic-incorporated Evaluation framework (SEE) for CGEC based on LLMs. Specifically, (1) EXAM mines broad explanation information (including error types, reference corrections, and error explanations) related to grammatically incorrect sentences from LLMs, and then utilizes mined information to enhance the training of small models, thereby ultimately improving the CGEC performance of small models. (2) SEE requires LLMs to balance the edits annotated in the golden data with the evaluated model’s edits, ensuring they do not alter the original semantics of the input sentence. This ensures more accurate and comprehensive evaluation results that consider both grammar and semantics. Extensive experiments and detailed analyses demonstrate the effectiveness and competitiveness of our proposed methods. In summary, our technical contributions and impacts are in four folds:

•

We propose SEE, which aims to empower the evaluation of more subjective CGEC tasks through the intervention of LLMs.
•

We propose EXAM, which utilizes LLMs as explainers to enhance the training of small models. This approach enables small models not only to surpass LLMs on traditional metrics but also to demonstrate competitive performance under our proposed SEE.
•

For CGEC field, we reposition the roles of LLMs to give full play to the strengths of LLMs and promote the adaptation of LLMs to the CGEC task.
•

For LLMs community, our work explores collaborative cooperation between LLMs and small models on downstream tasks and, to a certain extent, reveals how LLMs and small models can coexist and thrive in the future.

2 Related Work

In the era of LLMs, considering the superior performance of LLMs Liu et al. (2022); Dong et al. (2023); Liu et al. (2023); Li et al. (2023); Huang et al. (2023); Li et al. (2023), researchers have invested lots of energy in studying LLMs for GEC tasks Li et al. (2022a, b); Zhang et al. (2023); Ye et al. (2022); Ma et al. (2023).

First, some works evaluate LLMs on GEC Fang et al. (2023); Penteado and Perez (2023); Qu and Wu (2023); Li et al. (2023); Kwon et al. (2023); Ye et al. (2023); Huang et al. (2023); Davis et al. (2024). In general, GEC-related tasks are challenging for LLMs. There are many reasons for this challenge, such as the inconvenience caused to LLMs by the minimum change principle. To address the challenges, some researchers also focus on training LLMs on GEC data Fan et al. (2023); Zhang et al. (2023); Su et al. (2023). Still unsatisfactory, even after supervised fine-tuning, the performance of LLMs still cannot prove that LLMs have fully adapted to the GEC field. For example, the $\text{F}_{0.5}$ scores reported by GrammarGPT Fan et al. (2023) still do not exceed 40.0. As a result, researchers begin to pay attention to whether LLMs can have other roles in the GEC field, instead of directly acting as the corrector. Kaneko and Okazaki (2023) propose to improve the GEC performance by letting LLMs predict edit spans. Östling et al. (2023) and Sottana et al. (2023) explore the potential of using LLMs as evaluators for English and Swedish GEC tasks. Song et al. (2023) and Kaneko and Okazaki (2023) propose the new task of grammar error explanation and have proved the ability of LLMs to explain grammatical error. However, they do not go further to utilize the explanation information in training GEC models. To the best of our knowledge, our work is the first to comprehensively think about and design how to make full use of LLMs in the training and evaluation process of GEC small models. More importantly, our work rethinks how LLMs and small models should coexist and progress together in the era of LLMs, contributing their respective strengths to the advancement of downstream tasks.

3 Motivation and Methodology

3.1 Motivation

Minimum Change Principle

In the long-term GEC or CGEC research, the setting followed by researchers is the “minimum change principle”, that is, an ideal model should be able to convert grammatically incorrect sentences into correct sentences with minimal changes or editing costs. However, with the development of deep learning and Pre-trained Language Models, the enhancement of model capabilities has conflicted with this principle because it limits the model’s space for self-development to a certain extent. Especially with the emergence of LLMs, the performance obtained by directly using LLMs to complete the GEC task is not satisfactory. Many observations and empirical results indicate that the key reason for the unsatisfactory performance of LLMs on CGEC is that the relatively freer text generation mode of LLMs is unsuitable for the GEC task. For example, LLMs often produce sentences that are grammatically correct and semantically consistent with the erroneous input sentence, but the literal text differs significantly from the input sentence. This situation often fails in traditional evaluation metrics, resulting in the low performance of LLMs.

LLMs as Explainer

Given the limitations of directly employing LLMs as correctors due to the minimum change principle, can we adopt an alternative approach to leverage LLMs more effectively for CGEC and circumvent the constraints imposed by this principle? First, let’s consider what humans do when they encounter grammatical errors, particularly when they are unsure how to correct them. The most direct and effective solution is to turn to a teacher or grammar reference book. Then, the teacher or reference book would give specific explanations or reasons for grammatical errors to help humans make corrections successfully. Drawing inspiration from human actions, why can’t we consider LLMs as explainers similar to teachers or reference books? As mentioned in the previous paragraph, the fact that LLMs can generate grammatically correct sentences means that LLMs store rich grammatical knowledge. Therefore, we believe that if explanations related to error sentences can be obtained from LLMs and utilized in the training of small models, then these explanations embodying grammatical knowledge from LLMs can definitely enhance the performance of small models. In particular, the role of LLMs as explainers does not need to be limited by the minimum change principle, and it is a simple yet effective process for LLMs to use their own grammatical knowledge to explain incorrect sentences.

LLMs as Evaluator

Considering the subjective nature of the CGEC task, a sentence with grammatical errors often has different correction methods. We argue that the ideal evaluation that can truly reflect the CGEC performance should consider the correction results given by the model as comprehensively as possible. As long as the model provides a sentence that is consistent with the original meaning of the incorrect sentence and has no grammatical errors, its correction should be considered successful. Suppose we want to achieve this ideal evaluation from the perspective of dataset construction. In that case, we need to manually annotate the dataset with as many correct reference sentences corresponding to the incorrect sentences as possible. However, such an annotation process is expensive and time-consuming. Even though there are already multi-reference datasets such as MuCGEC Zhang et al. (2022), we still believe that automatic evaluation based on such datasets is not flexible enough because the fixed reference correct sentences of the dataset are still limited after all. Motivated by the process of teachers correcting students’ sentences with grammatical errors, why can’t we utilize LLMs as evaluators to play the role of a teacher reviewing grammatical errors? Intuitively, LLMs not only store rich grammatical knowledge but also have an excellent ability to perceive text semantics. Therefore, we believe that they are fully qualified to be flexible and excellent teachers (i.e., evaluators) who review the answers of models in the GEC task.

3.2 Explanation-Augmented Training

As introduced in the above section, we propose the EXplanation-AugMented training framework (EXAM) (as illustrated in Figure 2) to mine explanation information and grammatical knowledge from LLMs and inject them into small models, ultimately achieving the purpose of using LLMs to enhance the performance of small models. Based on our understanding of the CGEC task, we divide the explanation information (note that the “explanation” we consider here is the LLMs analysis of incorrect sentences in a broad sense) we want to obtain from LLMs into three categories:

Error Types

We believe that if the CGEC model knows the type of grammatical errors in the sentence to be corrected, it will help it reduce the search scope when correcting errors, thereby enabling it to make better corrections. Therefore, we ask LLMs to identify the error types based on the input sentences containing errors. Specifically, we pre-define types of common grammatical errors involving punctuation errors, spelling errors, word errors, syntax errors, etc. Then, we provide the defined error type schema along with the prompt to the LLMs, instructing them to choose only among the types we specified in the instruction prompt.

References

We observe that LLMs have a notable ability to generate correct sentences from incorrect ones, but the sentences they produce are not highly controllable. Although the sentences corrected by LLMs cannot be used as the final result, we believe they should serve as intermediate references for small models. Using corrections from LLMs as references can provide valuable cues to the small models, thereby enhancing their performance. Therefore, we also guide LLMs to make corrections they think are reasonable for the incorrect sentences and send the corrections provided by LLMs as references to the small model.

Explanations

To obtain high-quality explanations from LLMs, we define three dimensions of criteria to constrain LLMs: (1) Fluency aims to ensure that the explanation text generated by LLMs has no grammatical errors and is fluent in expression; (2) Rationality requires LLMs to explain grammatical errors as clearly and naturally as possible; (3) Comprehensiveness is to ensure that all grammatical errors in the incorrect sentences can be explained as much as possible. Additionally, we also ask LLMs to rank multiple grammatical errors in a sentence according to error severity, that is, to generate explanations for important errors first.

After LLMs explain the samples in the dataset, we concatenate the obtained error types, references, and explanations to the front of the original input sentences. We then send the combined text to the small CGEC models for their training or inference. In summary, the design of EXAM is simple and intuitive. LLMs and small models each perform their respective duties and give full play to their advantages. The stored grammatical knowledge of LLMs is mined without additional fine-tuning. The small models take advantage of the alignment of supervised learning to downstream tasks with low training costs and obtain guidance from LLMs’ task-related knowledge.

3.3 Semantic-incorporated Evaluation

To address the issue that traditional CGEC evaluation cannot flexibly adapt to the subjective nature of CGEC because they rely entirely on dataset annotation, we design the SEmantic-incorporated Evaluation framework (SEE). This framework utilizes LLMs to comprehensively evaluate CGEC by considering complex semantics.

Specifically, we first perform comparison and alignment preprocessing on the texts of error sentences and predicted sentences to obtain the predicted edits of the predicted text compared to the incorrect sentences. We then require LLMs to evaluate each predicted edit from three dimensions based on grammatical analysis and semantic understanding of error sentences, golden sentences, and predicted sentences: (1) Correct Edit ( $\mathbf{N}_{\text{CE}}$ ) indicates that LLMs judge the predicted edit to be effective in correcting the grammatical errors of the original sentence; (2) Wrong Edit ( $\mathbf{N}_{\text{WE}}$ ) signifies that LLMs determine that the predicted edit to be invalid and unable to correct grammatical errors; (3) Reasonable Edit ( $\mathbf{N}_{\text{RE}}$ ) refers to model edits not included in golden annotations, but which do not introduce new grammatical errors and do not affect the original semantics of the sentence. Usually, this type of edit involves some intonation particles and might be incorrectly classified as an incorrect edit by traditional metrics because it is not accounted for in the dataset annotations. From these three dimensions we have designed, we can see that, unlike different from traditional evaluation indicators, LLMs do not require precise text matching to determine whether the predicted edit exists in the golden edit set. Instead, the validity of the predicted edit is assessed more flexibly, taking into account the semantics of the text more comprehensively. In addition, it is worth mentioning that to make LLMs’ judgment on edits more accurate, we also input the explanation information obtained in EXAM into LLMs at the same time when SEE evaluates.

Based on the above three values derived from LLMs, we can calculate Precision, Recall, and $\text{F}_{0.5}$ scores as follows:

\displaystyle\text{P}=\frac{\mathbf{N}_{\text{CE}}}{\mathbf{N}_{\text{CE}}+% \mathbf{N}_{\text{WE}}},

(1)

\displaystyle\text{R}=\frac{\mathbf{N}_{\text{CE}}}{\mathbf{N}_{\text{golden}}},

(2)

\displaystyle\text{F}_{0.5}=\frac{(1+0.5^{2})\times\text{P}\times\text{R}}{0.5% ^{2}\times\text{P}+\text{R}},

(3)

where $\mathbf{N}_{\text{golden}}$ is the length of the golden edit set for the incorrect sentence. The $\text{F}_{0.5}$ score is widely used in GEC-related studies because GEC is an application that pays more attention to precision. Furthermore, to better explain the mechanism of SEE, we provide an evaluation example in Figure 3.

To enable LLMs to perform the tasks we designed for EXAM and SEE, we input both prompts and task demonstration examples into the LLMs to facilitate their adherence to our instructions through in-context learning. Due to the limitation of pages, the specific contents of our designed prompts for instructing LLMs to accomplish corresponding goals are presented in Appendix C.

4 Experiments

4.1 Experiment Setup

Datasets

We mainly use the HSK dataset Zhang (2009) as training data. In our experiments, there are two settings for the use of training data: (1) Full HSK data, that is, using all 156,870 samples for model training; (2) Sampled HSK data, we randomly sample approximately 10% of the HSK data, that is, 15,000 samples for model training. In terms of test data, the CGEC data can be divided into two types of test data according to the source of the grammatical error sentences, namely Chinese-as-Second-Language (CSL) and Chinese native speaker data. To ensure the breadth of our experiment, we select the NLPCC test data Zhao et al. (2018) which is the CSL data, and the NaCGEC benchmark Ma et al. (2022) which is Chinese native speaker data as the test sets of our experiment. The NLPCC test data contains 2,000 samples and NaCGEC contains 5,869 incorrect sentences.

Evaluation Metrics

To ensure the comparability of our experiments with previous CGEC works, in addition to using our own designed SEE to evaluate P/R/ $\text{F}_{0.5}$ , we also report the widely used traditional word/character-level P/R/ $\text{F}_{0.5}$ . Particularly, as in the previous work Zhang et al. (2022), we also apply the MaxMatch scorer Dahlmeier and Ng (2012) and PKUNLP word segmentation tool Zhao et al. (2018) to obtain the word-level performance. Therefore, to verify the effectiveness of our designed EXAM, we also conduct human evaluation experiments to provide the real performance of the models from a human perspective.

Baselines and Base Models

The current mainstream CGEC models are mainly divided into two categories, namely Seq2Seq and Seq2Edit models. Since our EXAM framework is model-agnostic, we select the representative Seq2Seq and Seq2Edit models as baselines: (1) BART-Large Katsumata and Komachi (2020) and mT5-Base Xue et al. (2021) are Seq2Seq models for text generation and can be straightforwardly trained for CGEC; (2) GECToR-Chinese Omelianchuk et al. (2020) is the most widely used Seq2Edit method for CGEC. In addition, we select GPT-3.5-Turbo OpenAI (2023) and Qwen-72B-Chat Alibaba (2023) as the explainer-LLMs respectively. As for the evaluator-LLMs in SEE, we recommend the most advanced GPT-4-Turbo OpenAI (2023).

LLMs as Correctors

We selected two LLMs as Correctors to serve as baselines for comparison with our method. Specifically, we chose Qwen-72B-Chat and GPT-3.5-Turbo as our LLMs. We crafted a detailed prompt to ensure the LLMs deeply understood the task’s significance when directly correcting Chinese grammatical errors (See Appendix A). Additionally, we experimented with in-context learning to enhance the performance of the LLMs. The experimental results and analysis of “LLMs as Correctors” are presented in Appendix B.

Training Data	Model	Word-Level			Character-Level			SEE
		P	R	$\textbf{F}_{0.5}$	P	R	$\textbf{F}_{0.5}$	P	R	$\textbf{F}_{0.5}$
None	GPT-3.5-Turbo	24.36	28.01	25.01	27.71	29.19	27.99	53.82	30.14	46.51
None	Qwen-72B-Chat	27.88	32.85	28.75	32.42	34.97	32.90	67.20	35.01	56.76
Sampled (15K)	mT5-Base	16.10	8.93	13.87	30.25	8.77	20.30	58.36	9.89	29.47
Full (156K)	mT5-Base	24.08	16.74	22.14	38.37	17.14	30.75	67.37	19.37	45.05
Sampled (15K)	w/ EXAM (GPT)	25.21^↑	17.76^↑	23.26^↑	39.04^↑	18.16^↑	31.74^↑	69.29^↑	20.27^↑	46.70^↑
Sampled (15K)	w/ EXAM (Qwen)	26.41^↑	20.57^↑	25.00^↑	38.76^↑	21.81^↑	33.55^↑	69.76^↑	22.63^↑	49.25^↑
Sampled (15K)	BART-Large	19.46	14.77	18.30	32.07	13.67	25.27	62.94	12.18	34.33
Full (156K)	BART-Large	28.35	22.30	26.89	39.10	22.75	34.19	63.16	17.31	41.29
Sampled (15K)	w/ EXAM (GPT)	28.33^↑	23.38^↑	27.17^↑	39.61^↑	23.87^↑	35.00^↑	68.55^↑	23.31^↑	49.38^↑
Sampled (15K)	w/ EXAM (Qwen)	27.91^↑	22.24^↑	26.55^↑	40.01^↑	22.90^↑	34.81^↑	62.94^↑	22.18^↑	46.02^↑
Sampled (15K)	GECToR-Chinese	10.85	6.40	9.53	29.49	4.65	14.26	55.60	4.41	16.74
Full (156K)	GECToR-Chinese	18.26	10.99	16.12	27.03	11.99	21.60	48.32	12.21	30.36
Sampled (15K)	w/ EXAM (GPT)	18.09^↑	12.74^↑	16.69^↑	27.53^↑	12.71^↑	22.32^↑	49.46^↑	12.05^↑	30.51^↑
Sampled (15K)	w/ EXAM (Qwen)	17.31^↑	12.06^↑	15.92^↑	25.95^↑	11.63^↑	20.82^↑	48.98^↑	11.49^↑	29.63^↑

Table 1: Performance of various models on the NLPCC test set. Note that 15K and 156K represent the amount of HSK data. ^↑ means that EXAM has improved performance compared to the baselines with the same training data.

Implementation Details

We utilize Chinese-BART-Large Shao et al. (2021), Mengzi-T5-Base (Chinese) Zhang et al. (2021), Chinese-Struct-Bert-Large Wang et al. (2020) to initialize small models. For open-source LLMs, we run their inference process on 4 NVIDIA A100 GPUs. For closed-source LLMs, we directly access them through the official APIs. It is worth noting that in all our reported experiments, EXAM provides only one error type/reference/explanation information for each incorrect sentence. Because our experiments are only verification experiments, for better performance, researchers can obtain more explanation information to enhance the small models in EXAM. The specific prompts used by our method are in Appendix C, and other implementation details and hyperparameter selection are in Appendix D.

Method	Word- $\text{F}_{0.5}$	Char- $\text{F}_{0.5}$
BART-Large	18.30	25.27
+ Error Types	21.74^↑	29.12^↑
+ References	23.88^↑	33.49^↑
+ Explanations	21.52^↑	29.84
+ Error Types + References	24.21^↑	33.66^↑
+ Error Types + Explanations	23.29^↑	32.54^↑
+ References + Explanations	25.18^↑	33.74^↑
BART-Large w/ EXAM (GPT)	27.17	35.00

Table 2: Ablation results for fine-grained explanation information. The training data for all models is 15K sampled HSK data. The test data is NLPCC. Note that the BART-Large w/ EXAM (GPT) is equivalent to BART-Large+Error Types+References+Explanations.

4.2 Main Results

Our main results on NLPCC are presented in Table 1, we also provide main results and analyses on NaCGEC in Appendix E and Table 6.

Main Results of EXAM

From Table 1, we can know that: (1) With the same amount of training data, EXAM generally brings significant improvements to all baselines under all evaluation metrics. (2) With only 10% of the labeled training data, small models enhanced by EXAM achieve performance equivalent to or better than that of training with the full amount of data. (3) The model-agnostic nature of EXAM enables it to bring stable gains no matter what LLMs are selected, or for small models of Large/Base scale.

Main Results of SEE

From Table 1, we see that: (1) The evaluation results of SEE are basically consistent in trend with traditional metrics, which shows the correctness of SEE. (2) Especially for the results of LLMs, we observe that SEE achieves a huge numerical difference from the results obtained by traditional metrics, which indicates that SEE is more suitable for GEC evaluation in the era of LLMs. Note that the base model of SEE is GPT-4-Turbo, which is different from the evaluated LLMs, so it will not cause unfair evaluation. Moreover, we propose the hypothesis that our SEE evaluation metric aligns more closely with human judgment, which will be discussed in Section 4.3.3. Therefore, small models trained with EXAM can exhibit competitive performance with LLMs. This could be more meaningful in real-world scenarios, as deploying a small model is more cost-effective and offers faster response times.

4.3 Analyses and Discussion

4.3.1 The Impact of Fine-grained Explanation Information on EXAM

The main results of EXAM are derived from three kinds of information error types/references/explanations from LLMs. Therefore, it is necessary to conduct ablation studies on the three kinds of information to assess their respective contributions to EXAM. As shown in Table 2, we conduct ablation experiments on NLPCC test data with GPT-3.5-Turbo as the base model of EXAM and BART-Large as the enhanced small model. We can see that each type of information can bring significant improvements to BART-Large when executed individually, demonstrating the correctness of our choice of obtaining information from LLMs. In particular, the references have the greatest improvement for the small model, which shows that the correction results made by LLMs can bring good reference and guidance to the small model, and a good reference correction result can bring the most direct gain to the small model. Furthermore, we see that when various types of information are used in pairs, performance can be further improved compared to individual information. This shows that the compatibility between the three types of information we designed is very good and would not affect each other.

Method	Word- $\text{F}_{0.5}$	Char- $\text{F}_{0.5}$
BART-Large	18.30	25.27
Train (No gold) / Test (No gold)	27.17^-	35.00^-
Train (Gold) / Test (No gold)	21.57^↓	28.93^↓
Train (No gold) / Test (Gold)	25.98^↓	37.56^↑
Train (Gold) / Test (Gold)	43.10^↑	60.40^↑
BART-Large w/ EXAM (GPT)	27.17	35.00

Table 3: The impact of golden annotation information. The training data is 15K sampled HSK data. The test data is NLPCC. Note that the BART-Large w/ EXAM (GPT) is equivalent to Train (No gold) / Test (No gold).

4.3.2 The Impact of Golden Annotation Information on EXAM

To further explore the performance upper bound of EXAM, in the process of using LLMs to obtain training and test data for the small model, we input the golden sentences annotated by the dataset into the LLMs to observe the performance changes of the small model. In other words, we want to observe how the quality of the explanation information generated by LLMs changes when they are provided with golden sentences as input. In Table 3, we are surprised to find that when we add golden sentences in the process of LLMs generating training data or generating test data, the model performance declines compared to not adding golden sentences in both processes (i.e., Train (No gold)/ Test (No gold)). This is an interesting and counter-intuitive phenomenon, and we believe it highlights the difference and gap between the generative paradigm of LLMs and the golden sentences annotated in the dataset. If LLMs are only allowed to see golden sentences during training or testing, the explanation information they generate will differ significantly from what they would typically produce on their own. This discrepancy can create a gap between the training and test data of the small model, leading to performance degradation. Therefore, we can also understand why there is a huge performance gain when inputting golden sentences to LLMs in both training and testing processes. In this case, LLMs generate sentences similar to golden sentences in both training data and test data.

4.3.3 Human Evaluation for SEE

The design motivation of SEE is to use LLMs to bring evaluation more consistent with the human perspective to CGEC. Therefore, we conduct human evaluation experiments to observe whether SEE or traditional metrics are closer to human. Specifically, we randomly select 200 test samples from NLPCC, then have three annotators to independently evaluate the models’ correct results. We calculate the average P/R/ $\text{F}_{0.5}$ scores of human evaluation based on the judegments from the three annotators. From Figure 4, we see that: (1) For various models, SEE’s evaluation is closer to human evaluation than traditional evaluation, which shows that our designed SEE can more realistically measure the CGEC performance than traditional evaluation. (2) SEE’s evaluation of LLMs differs very little from human evaluation, indicating that SEE is more suitable for the evaluation of LLMs. (3) Unlike the cases where evaluation results for small models fall below human evaluation, SEE’s evaluation of LLMs can slightly surpasses human evaluation results. This is because SEE relies on another LLM (i.e., GPT-4-Turbo) for its evaluation process, indicating better understanding among LLMs.

Error Sentence	{CJK*}UTF8gbsn这段话给我们有道理的，虽然现在黑暗，但等着、忍着，光明会到来的。
Golden Sentence	{CJK*}UTF8gbsn这段话是有道理的，虽然现在黑暗，但等着、忍着，光明就会到来。
Golden Sentence	This paragraph is reasonable. Although it is dark now, if we wait and endure, the light will come.
Error Type	{CJK*}UTF8gbsn标点误用，句式杂糅
GPT-3.5-Turbo	Misuse of punctuation, mixed sentence structures
Reference	{CJK*}UTF8gbsn这段话给我们的道理：虽然现在黑暗，但等着、忍着，光明会到来的。
GPT-3.5-Turbo	{CJK*}UTF8gbsn这段话给我们的道理：虽然现在黑暗，但等着、忍着，光明会到来的。
Explanation	{CJK*}UTF8gbsn在这个句子中，动词“给我们”和“有道理”成分赘余，此处改为“给我们的道理”，
GPT-3.5-Turbo	{CJK*}UTF8gbsn并且下句为道理的内容，因此不应该使用“，”而应该使用表示承接的“：”。
Error Type	{CJK*}UTF8gbsn词语冗余, 逻辑不通
Qwen-72B-Chat	Redundant words, illogical logic
Reference	{CJK*}UTF8gbsn这段话给我们提供了有道理的启示，虽然现在处于黑暗时期，但只要坚持等待、
Qwen-72B-Chat	{CJK*}UTF8gbsn忍受，光明总会到来的。
Explanation	{CJK*}UTF8gbsn“给我们”与“有道理的”都表示“这段话”的意义，造成了词语冗余。应将“给我们”改
Qwen-72B-Chat	{CJK*}UTF8gbsn为“提供了”。接着，“虽然现在黑暗，但等着、忍着，光明会到来的”缺乏逻辑连贯
Qwen-72B-Chat	{CJK*}UTF8gbsn性，应补充关联词语“虽然”。

Table 4: We mark changes made by LLMs in orange and poor explanations given by LLMs in red.

4.4 Case Observation

To verify the correctness of our motivation for using LLMs as explainers, and to demonstrate the explanation information generated by EXAM, we give cases in Table 4 of GPT-3.5-Turbo and Qwen-72B-Chat acting as the explainer respectively. As seen in Table 4, although the two LLMs make different error-type judgments, both provide reasonable explanations for their respective judgments. Regarding the reference corrections they give, we see that Qwen-72B-Chat prefers free generation compared to GPT-3.5-Turbo. Of course, we think the corrected sentence generated by Qwen-72B-Chat is more fluent and reasonable. For the explanations of grammatical errors made in the incorrect sentence, we can see that both LLMs give quality explanations to a certain extent. Although there are some minor flaws, overall, they can give explanations that can be helpful for humans or small models to improve. Additionally, we include more cases where LLMs offer explanations and evaluations in the form of data supplementary materials.

5 Conclusion

In this paper, focusing on the dilemma that LLMs cannot achieve satisfactory results as correctors on CGEC, we rethink how LLMs should be effectively utilized in the CGEC task. To fully exploit the rich grammatical knowledge and powerful semantic understanding ability of LLMs, and bypass the main reason why the LLMs corrector is not suitable for the CGEC task, that is, the minimum change principle, we propose the training framework EXAM that uses LLMs as explainers to enhance CGEC small models, and the novel evaluation method SEE that utilizes LLMs as evaluators to give more reasonable evaluation of the CGEC task. Extensive empirical results and analyses show that our work is a meaningful exploration of how LLMs and small models can coexist and make progress together on downstream tasks such as CGEC.

Limitations

Currently, the main limitation of our work is the scope of the languages. As we all know, GEC in various languages has its application significance, so it is valuable to apply our methods to other languages further. The main reason why we did not apply our methods to languages such as English is that there are many differences in the types of grammatical errors and grammatical rules that CGEC and EGEC focus on. Therefore, the prompts of EXAM and SEE need to be re-customized when applied to the English scenario. The purpose of our paper is to rethink how LLMs should be appropriately utilized in the GEC field. Changing prompts to adapt to new languages is not the main technical contribution and innovation we pursue. In the future, to enhance the impact of our work and serve a wider community, we will expand EXAM and SEE to the English scenario.

Ethics Statement

Our used data and models (including LLMs) are all publicly available academic resources. We also paid for closed-source LLMs that require charging for APIs, so there is no ethical issue about data or models in our work.

Acknowledgments

This research is supported by National Natural Science Foundation of China (Grant No. 62276154), Research Center for Computer Network (Shenzhen) Ministry of Education, the Natural Science Foundation of Guangdong Province (Grant No. 2023A1515012914), Shenzhen Science and Technology Program (Grant No. WDZC2023112809143 7002), Basic Research Fund of Shenzhen City (Grant No. JCYJ20210324120012033 and GJHZ202402183000101), the Major Key Project of PCL for Experiments and Applications (PCL2021A06).

Declaration of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

The data used in our work will be made available on request.

\printcredits

References

Yu et al. (2023) T. Yu, C. Jiang, C. Lou, S. Huang, X. Wang, W. Liu, J. Cai, Y. Li, Y. Li, K. Tu, et al., Seqgpt: An out-of-the-box large language model for open domain sequence understanding, arXiv preprint arXiv:2308.10529 (2023).
Wu et al. (2023) S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, G. Mann, Bloomberggpt: A large language model for finance, arXiv preprint arXiv:2303.17564 (2023).
Wang et al. (2023) H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, T. Liu, Huatuo: Tuning llama model with chinese medical knowledge, arXiv preprint arXiv:2304.06975 (2023).
Ma et al. (2022) S. Ma, Y. Li, R. Sun, Q. Zhou, S. Huang, D. Zhang, Y. Li, R. Liu, Z. Li, Y. Cao, H. Zheng, Y. Shen, Linguistic rules-based corpus generation for native chinese grammatical error correction, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Association for Computational Linguistics, 2022, pp. 576–589. URL: https://doi.org/10.18653/v1/2022.findings-emnlp.40. doi:10.18653/V1/2022.FINDINGS-EMNLP.40.
Fang et al. (2023) T. Fang, S. Yang, K. Lan, D. F. Wong, J. Hu, L. S. Chao, Y. Zhang, Is chatgpt a highly fluent grammatical error correction system? A comprehensive evaluation, CoRR abs/2304.01746 (2023).
Li et al. (2023) Y. Li, H. Huang, S. Ma, Y. Jiang, Y. Li, F. Zhou, H.-T. Zheng, Q. Zhou, On the (in) effectiveness of large language models for chinese text correction, arXiv preprint arXiv:2307.09007 (2023).
Ye et al. (2023) J. Ye, Y. Li, Q. Zhou, Y. Li, S. Ma, H.-T. Zheng, Y. Shen, CLEME: Debiasing multi-reference evaluation for grammatical error correction, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 2023, pp. 6174–6189. URL: https://aclanthology.org/2023.emnlp-main.378. doi:10.18653/v1/2023.emnlp-main.378.
Song et al. (2023) Y. Song, K. Krishna, R. Bhatt, K. Gimpel, M. Iyyer, Gee! grammar error explanation with large language models, CoRR abs/2311.09517 (2023).
Kaneko and Okazaki (2023) M. Kaneko, N. Okazaki, Controlled generation with prompt insertion for natural language explanations in grammatical error correction, CoRR abs/2309.11439 (2023).
Liu et al. (2022) R. Liu, Y. Li, L. Tao, D. Liang, H. Zheng, Are we ready for a new paradigm shift? A survey on visual deep MLP, Patterns 3 (2022) 100520.
Dong et al. (2023) C. Dong, Y. Li, H. Gong, M. Chen, J. Li, Y. Shen, M. Yang, A survey of natural language generation, ACM Comput. Surv. 55 (2023) 173:1–173:38.
Liu et al. (2023) A. Liu, X. Hu, L. Wen, P. S. Yu, A comprehensive evaluation of chatgpt’s zero-shot text-to-sql capability, CoRR abs/2303.13547 (2023).
Li et al. (2023) Y. Li, S. Ma, X. Wang, S. Huang, C. Jiang, H. Zheng, P. Xie, F. Huang, Y. Jiang, Ecomgpt: Instruction-tuning large language models with chain-of-task tasks for e-commerce, CoRR abs/2308.06966 (2023).
Huang et al. (2023) S. Huang, S. Ma, Y. Li, M. Huang, W. Zou, W. Zhang, H. Zheng, Lateval: An interactive llms evaluation benchmark with incomplete information from lateral thinking puzzles, CoRR abs/2308.10855 (2023).
Li et al. (2023) Y. Li, Z. Xu, S. Chen, H. Huang, Y. Li, Y. Jiang, Z. Li, Q. Zhou, H. Zheng, Y. Shen, Towards real-world writing assistance: A chinese character checking benchmark with faked and misspelled characters, CoRR abs/2311.11268 (2023).
Li et al. (2022a) Y. Li, Q. Zhou, Y. Li, Z. Li, R. Liu, R. Sun, Z. Wang, C. Li, Y. Cao, H. Zheng, The past mistake is the future wisdom: Error-driven contrastive probability optimization for chinese spell checking, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Association for Computational Linguistics, 2022a, pp. 3202–3213. URL: https://doi.org/10.18653/v1/2022.findings-acl.252. doi:10.18653/V1/2022.FINDINGS-ACL.252.
Li et al. (2022b) Y. Li, S. Ma, Q. Zhou, Z. Li, Y. Li, S. Huang, R. Liu, C. Li, Y. Cao, H. Zheng, Learning from the dictionary: Heterogeneous knowledge guided fine-tuning for chinese spell checking, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Association for Computational Linguistics, 2022b, pp. 238–249. URL: https://doi.org/10.18653/v1/2022.findings-emnlp.18. doi:10.18653/V1/2022.FINDINGS-EMNLP.18.
Zhang et al. (2023) D. Zhang, Y. Li, Q. Zhou, S. Ma, Y. Li, Y. Cao, H. Zheng, Contextual similarity is more valuable than character similarity: An empirical study for chinese spell checking, in: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, IEEE, 2023, pp. 1–5. URL: https://doi.org/10.1109/ICASSP49357.2023.10095675. doi:10.1109/ICASSP49357.2023.10095675.
Ye et al. (2022) J. Ye, Y. Li, S. Ma, R. Xie, W. Wu, H. Zheng, Focus is what you need for chinese grammatical error correction, CoRR abs/2210.12692 (2022).
Ma et al. (2023) S. Ma, Y. Li, H. Huang, S. Huang, Y. Li, H. Zheng, Y. Shen, Progressive multi-task learning framework for chinese text error correction, CoRR abs/2306.17447 (2023).
Penteado and Perez (2023) M. C. Penteado, F. Perez, Evaluating GPT-3.5 and GPT-4 on grammatical error correction for brazilian portuguese, CoRR abs/2306.15788 (2023).
Qu and Wu (2023) F. Qu, Y. Wu, Evaluating the capability of large-scale language models on chinese grammatical error correction task, CoRR abs/2307.03972 (2023).
Kwon et al. (2023) S. Y. Kwon, G. Bhatia, E. M. B. Nagoudi, M. Abdul-Mageed, Beyond english: Evaluating llms for arabic grammatical error correction, in: H. Sawaf, S. R. El-Beltagy, W. Zaghouani, W. Magdy, A. Abdelali, N. Tomeh, I. A. Farha, N. Habash, S. Khalifa, A. Keleg, H. Haddad, I. Zitouni, K. Mrini, R. N. Al-Matham (Eds.), Proceedings of ArabicNLP 2023, Singapore (Hybrid), December 7, 2023, Association for Computational Linguistics, 2023, pp. 101–119. URL: https://aclanthology.org/2023.arabicnlp-1.9.
Ye et al. (2023) J. Ye, Y. Li, Y. Li, H. Zheng, Mixedit: Revisiting data augmentation and beyond for grammatical error correction, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, Association for Computational Linguistics, 2023, pp. 10161–10175. URL: https://aclanthology.org/2023.findings-emnlp.681.
Huang et al. (2023) H. Huang, J. Ye, Q. Zhou, Y. Li, Y. Li, F. Zhou, H. Zheng, A frustratingly easy plug-and-play detection-and-reasoning module for chinese spelling check, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, Association for Computational Linguistics, 2023, pp. 11514–11525. URL: https://aclanthology.org/2023.findings-emnlp.771.
Davis et al. (2024) C. Davis, A. Caines, Ø. E. Andersen, S. Taslimipoor, H. Yannakoudakis, Z. Yuan, C. Bryant, M. Rei, P. Buttery, Prompting open-source and commercial language models for grammatical error correction of english learner text, CoRR abs/2401.07702 (2024).
Fan et al. (2023) Y. Fan, F. Jiang, P. Li, H. Li, Grammargpt: Exploring open-source llms for native chinese grammatical error correction with supervised fine-tuning, in: F. Liu, N. Duan, Q. Xu, Y. Hong (Eds.), Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Foshan, China, October 12-15, 2023, Proceedings, Part III, volume 14304 of Lecture Notes in Computer Science, Springer, 2023, pp. 69–80. URL: https://doi.org/10.1007/978-3-031-44699-3_7. doi:10.1007/978-3-031-44699-3_7.
Zhang et al. (2023) Y. Zhang, L. Cui, D. Cai, X. Huang, T. Fang, W. Bi, Multi-task instruction tuning of llama for specific scenarios: A preliminary study on writing assistance, CoRR abs/2305.13225 (2023).
Su et al. (2023) C. Su, X. Zhao, X. Qiao, M. Zhang, H. Yang, J. Zhu, M. Zhu, W. Ma, Hwcgec:hw-tsc’s 2023 submission for the nlpcc2023’s chinese grammatical error correction task, in: F. Liu, N. Duan, Q. Xu, Y. Hong (Eds.), Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Foshan, China, October 12-15, 2023, Proceedings, Part III, volume 14304 of Lecture Notes in Computer Science, Springer, 2023, pp. 59–68. URL: https://doi.org/10.1007/978-3-031-44699-3_6. doi:10.1007/978-3-031-44699-3_6.
Kaneko and Okazaki (2023) M. Kaneko, N. Okazaki, Reducing sequence length by predicting edit spans with large language models, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Association for Computational Linguistics, 2023, pp. 10017–10029. URL: https://aclanthology.org/2023.emnlp-main.619.
Östling et al. (2023) R. Östling, K. Gillholm, M. Kurfali, M. Mattson, M. Wirén, Evaluation of really good grammatical error correction, CoRR abs/2308.08982 (2023).
Sottana et al. (2023) A. Sottana, B. Liang, K. Zou, Z. Yuan, Evaluation metrics in the era of GPT-4: reliably evaluating large language models on sequence to sequence tasks, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Association for Computational Linguistics, 2023, pp. 8776–8788. URL: https://aclanthology.org/2023.emnlp-main.543.
Zhang et al. (2022) Y. Zhang, Z. Li, Z. Bao, J. Li, B. Zhang, C. Li, F. Huang, M. Zhang, Mucgec: a multi-reference multi-source evaluation dataset for chinese grammatical error correction, in: M. Carpuat, M. de Marneffe, I. V. M. Ruíz (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Association for Computational Linguistics, 2022, pp. 3118–3130. URL: https://doi.org/10.18653/v1/2022.naacl-main.227. doi:10.18653/V1/2022.NAACL-MAIN.227.
Zhang (2009) B. Zhang, Features and functions of the hsk dynamic composition corpus, International Chinese Language Education 4 (2009) 71–79.
Zhao et al. (2018) Y. Zhao, N. Jiang, W. Sun, X. Wan, Overview of the NLPCC 2018 shared task: Grammatical error correction, in: M. Zhang, V. Ng, D. Zhao, S. Li, H. Zan (Eds.), Natural Language Processing and Chinese Computing - 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26-30, 2018, Proceedings, Part II, volume 11109 of Lecture Notes in Computer Science, Springer, 2018, pp. 439–445. URL: https://doi.org/10.1007/978-3-319-99501-4_41. doi:10.1007/978-3-319-99501-4_41.
Dahlmeier and Ng (2012) D. Dahlmeier, H. T. Ng, Better evaluation for grammatical error correction, in: E. Fosler-Lussier, E. Riloff, S. Bangalore (Eds.), Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Montréal, Canada, 2012, pp. 568–572. URL: https://aclanthology.org/N12-1067.
Katsumata and Komachi (2020) S. Katsumata, M. Komachi, Stronger baselines for grammatical error correction using a pretrained encoder-decoder model, in: K.-F. Wong, K. Knight, H. Wu (Eds.), Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Suzhou, China, 2020, pp. 827–832. URL: https://aclanthology.org/2020.aacl-main.83.
Xue et al. (2021) L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel, mt5: A massively multilingual pre-trained text-to-text transformer, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Association for Computational Linguistics, 2021, pp. 483–498. URL: https://doi.org/10.18653/v1/2021.naacl-main.41. doi:10.18653/V1/2021.NAACL-MAIN.41.
Omelianchuk et al. (2020) K. Omelianchuk, V. Atrasevych, A. Chernodub, O. Skurzhanskyi, GECToR – grammatical error correction: Tag, not rewrite, in: J. Burstein, E. Kochmar, C. Leacock, N. Madnani, I. Pilán, H. Yannakoudakis, T. Zesch (Eds.), Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, Seattle, WA, USA → Online, 2020, pp. 163–170. URL: https://aclanthology.org/2020.bea-1.16. doi:10.18653/v1/2020.bea-1.16.
OpenAI (2023) OpenAI, GPT-4 technical report, CoRR abs/2303.08774 (2023).
Alibaba (2023) Alibaba, Qwen technical report, arXiv preprint arXiv:2309.16609 (2023).
Shao et al. (2021) Y. Shao, Z. Geng, Y. Liu, J. Dai, F. Yang, L. Zhe, H. Bao, X. Qiu, CPT: A pre-trained unbalanced transformer for both chinese language understanding and generation, CoRR abs/2109.05729 (2021).
Zhang et al. (2021) Z. Zhang, H. Zhang, K. Chen, Y. Guo, J. Hua, Y. Wang, M. Zhou, Mengzi: Towards lightweight yet ingenious pre-trained models for chinese, CoRR abs/2110.06696 (2021).
Wang et al. (2020) W. Wang, B. Bi, M. Yan, C. Wu, J. Xia, Z. Bao, L. Peng, L. Si, Structbert: Incorporating language structures into pre-training for deep language understanding, in: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. URL: https://openreview.net/forum?id=BJgQ4lSFPH.

Appendix A Prompt of LLMs as Corrector

To enable the LLM to directly provide corrected versions of the original sentences, we used the following prompt:

{CJK*}

UTF8gbsn请你针对给出的中文文本中的标点错误、拼写错误、词语错误和句法错误等提供合理且忠实的纠正。

例如：

请你纠正（直接输出纠正后的句子，无需任何解释）：

Appendix B Results and Analysis of LLMs as Corrector

Results

In Table 1, we observe that in the zero-shot scenario, GPT-3.5-Turbo scores 25.01 and 27.99 for Word-Level and Character-Level $\text{F}_{0.5}$ , respectively, while Qwen-72B- Chat scores 28.75 and 32.90. Under our proposed SEE evaluation method, the $\text{F}_{0.5}$ scores for GPT-3.5-Turbo and Qwen-72B-Chat are 46.51 and 56.76, respectively. Figure 5 and Figure 5 show that in the few-shot scenario, both GPT-3.5-Turbo and Qwen-72B-Chat improve their scores at the Word-Level and Character-Level.

Analysis

From these experimental results, it is evident that even with the enhancement provided by few-shot learning, there remains a significant gap in the correction capabilities of LLMs. Despite their strong language generation abilities, current LLMs score lower than smaller models under traditional evaluation metrics, which does not align with human perception, as seen in Figure 4. However, our SEE method maintains a high level of alignment with human judgment.

Appendix C Our Designed Prompts for EXAM and SEE

In order to guide LLMs to achieve our designed tasks as we expect, we carefully design the instruction prompts based on the characteristics of the CGEC task. The prompts for explanation are as shown in Figure 7, and the prompts for evaluation are as shown in Figure 8. In addition, as mentioned in the main text of this paper, to make the results generated by LLMs more accurate, we also input task examples (or demonstrations) to LLMs to stimulate their In-context Learning capabilities. Considering that the prompts with in-context learning examples added are very long, we upload the prompts with task examples in the form of software supplementary materials to facilitate peer review.

Appendix D Implementation Details and Hyperparameters

The hyperparameter values of the small models in our experiments are shown in Table 5. Besides, the loss functions for Seq2Seq models are the label-smoothed cross-entropy, and the loss function for Seq2Edit is cross-entropy.

Configurations	BART-Large	mT5-Base	GECToR-Chinese
Model type	Seq2Seq	Seq2Seq	Seq2Edit
Epochs	10	10	20 (2 cold epochs)
Batch size	256	256	128
Optimizer	Adam	Adam	Adam
$\beta_{1}$	0.9	0.9	0.9
$\beta_{2}$	0.999	0.999	0.999
$\epsilon$	$1\times 10^{-8}$	$1\times 10^{-8}$	$1\times 10^{-8}$
Learning rate	$3\times 10^{-6}$	$5\times 10^{-5}$	$1\times 10^{-5}(1\times 10^{-3}\text{for cold})$

Table 5: Hyperparameter values of the small models to be enhanced in our experiments.

Appendix E Main Results on NaCGEC

The main results of EXAM and SEE on NaCGEC are presented in Table 6. Note that the models we test on NaCGEC are all trained using HSK data. The HSK data comes from sentences with grammatical errors made by foreigners when learning Chinese, while NaCGEC comes from the grammatical errors made by native Chinese speakers in daily life. Ma et al. have proven that Chinese native CGEC data such as NaCGEC is more difficult than CSL data such as HSK because the grammatical errors made by native speakers are more subtle than those made by foreigners. Therefore, as shown in Table 6, when CGEC models trained with HSK data are tested on NaCGEC, low performance is understandable and expected.

From Table 6, we can get similar conclusions as on NLPCC. EXAM can bring stable and competitive enhancements to small models with the participation of small-scale training data, and the performance enhanced by EXAM is comparable to the performance of small models trained with full-scale data. Meanwhile, SEE can still bring reliable evaluation to CGEC models. The experiment on NaCGEC reflects the robustness of our proposed EXAM and SEE to different data sources, that is, they are effective for both CSL CGEC data and native CGEC data.

Training Data	Model	Word-Level			Character-Level			SEE
		P	R	$\textbf{F}_{0.5}$	P	R	$\textbf{F}_{0.5}$	P	R	$\textbf{F}_{0.5}$
None	GPT-3.5-Turbo	13.84	11.67	13.35	9.58	9.66	9.59	39.65	12.17	27.31z
None	Qwen-72B-Chat	14.23	11.33	13.53	10.32	8.83	9.98	32.55	4.74	23.14
Sampled (15K)	mT5-Base	5.38	0.65	2.19	4.5	0.64	2.03	36.11	4.40	14.79
Full (156K)	mT5-Base	2.78	3.72	2.93	1.98	3.17	2.14	18.25	8.20	14.65
Sampled (15K)	w/ EXAM (GPT)	11.06^↑	4.03^↑	8.20^↑	8.34^↑	3.51^↑	6.54^↑	34.26^↓	8.80^↑	21.70^↑
Sampled (15K)	w/ EXAM (Qwen)	10.51^↑	3.11^↑	7.12^↑	7.60^↑	2.55^↑	5.44^↑	32.66^↓	7.70^↑	19.81^↑
Sampled (15K)	BART-Large	7.07	2.34	5.04	5.59	2.15	4.24	29.45	5.96	16.46
Full (156K)	BART-Large	11.08	4.07	8.24	9.39	4.05	7.43	39.34	9.01	23.52
Sampled (15K)	w/ EXAM (GPT)	10.11^↑	4.48^↑	8.08^↑	8.64^↑	4.49^↑	7.29^↑	30.00^↑	9.50^↑	20.97^↑
Sampled (15K)	w/ EXAM (Qwen)	8.46^↑	3.52^↑	6.60^↑	7.06^↑	3.41^↑	5.81^↑	31.22^↑	5.99^↑	16.94^↑
Sampled (15K)	GECToR-Chinese	2.40	0.11	0.46	3.82	0.19	0.80	26.31	3.08	10.48
Full (156K)	GECToR-Chinese	8.53	1.12	3.67	4.22	0.93	2.47	27.89	3.23	11.03
Sampled (15K)	w/ EXAM (GPT)	12.08^↑	2.19^↑	6.35^↑	9.26^↑	1.87^↑	5.17^↑	30.55^↑	4.74^↑	14.62^↑
Sampled (15K)	w/ EXAM (Qwen)	11.09^↑	2.63^↑	6.74^↑	9.01^↑	1.96^↑	5.24^↑	31.35^↑	5.01^↑	15.28^↑

Table 6: Performance of various models on the NaCGEC benchmark. Note that 15K and 156K represent the amount of HSK data. ^↑ means that EXAM has improved performance compared to the baselines with the same training data.