Developing A New Grammatical Error Correction System For The Second Language Classroom

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Developing a New Grammatical Error

Correction System
for the Second Language Classroom

Sam Davidson
Department of Linguistics
University of California, Davis
1 Shields Ave., Davis, CA 95616
[email protected]

Automated grammatical error correction (GEC) for language learners provides students the
valuable opportunity to receive real-time feedback to help improve their writing skills. It also
can provide much needed relief to instructors by minimizing the number of errors in a student’s
final submission. However, research into the development of GEC and other forms of automated
written corrective feedback (AWCF) for languages other than English has been quite limited. This
dearth of research is in part due to the large demand for English instruction, but is also driven
by the limited training data available for non-English languages. Even for English, the amount
of training data is not sufficient to train the large neural models now used for GEC. Recent work
has focused on the development of artificial training data for GEC; while effective, these systems
make no effort to replicate the error patterns seen in real learner data. This paper proposes a new
method for the generation of artificial training data for GEC, which explicitly replicates the error
patterns seen in learner data, and which is readily adaptable to student proficiency and L1, as
well as to instructor pedagogical goals.

1. Introduction

The recent explosion of work in machine learning and language processing techniques
has led to a new generation of educational applications for language learners. For
example, recent work on NLP systems for the second language (L2) classroom has
resulted in grammatical error correction (GEC) systems (Nadejde and Tetreault 2019),
language readability assessment (Xia, Kochmar, and Briscoe 2016), and automated essay
scoring tools (Nadeem et al. 2019), among others.
This paper presents a proposed new system for GEC designed to provide written
corrective feedback to language learners, specifically learners of Spanish as a second
and heritage language. However, the methods described herein can be adapted to any
language for which at least a small corpus of parallel error-corrected data exists. While
corrective feedback, the process of providing feedback to language learners concern-
ing errors made in production, has not been demonstrated helpful in first language
acquisition in children, there is a broad consensus that this type of negative evidence
plays an important role in second language acquisition, especially in the classroom
setting (Tatawy 2002). Tools such as GEC are useful to language teachers who wish to
utilize written corrective feedback for language instruction. GEC tools can provide both
automated synchronous (provided while students write) and asynchronous (provided
after writing is complete) feedback to students, allowing learners to correct their own
mistakes (Shintani 2016), thereby reducing teacher workload and potentially preventing
issues related to grammatical error fossilization (Tajeddin, Alemi, and Pashmforoosh
2017).
Like many machine learning tasks, the models on which GEC tools are built rely
on annotated data for training; applications for use in the L2 language classroom are
most often trained using annotated L2 student writing. Even those models which use
artificially generated error corpora for training (including many state-of-the-art neural
machine translation (NMT) based GEC systems) rely on annotated L2 corpora for
system fine-tuning and evaluation to achieve best results (Grundkiewicz and Junczys-
Dowmunt 2019). However, due to the high cost and time investment required to de-
velop a new corpus, annotated corpora of L2 student data are quite limited in size and
availability, especially for languages other than English. This paucity of non-English
learner data is especially evident in GEC; almost all of the large parallel error-corrected
datasets used for training GEC systems are in English. As a result, the large majority
of recent work in developing data-driven GEC systems is targeted toward English
learners. Additionally, due to the limited size of corrected L2 corpora, current state-
of-the-art GEC systems for both English and non-English languages rely heavily on the
generation of artificially error-augmented data (Junczys-Dowmunt et al. 2018; Grund-
kiewicz and Junczys-Dowmunt 2019) to train their large transformer-based (Vaswani
et al. 2017) correction models. This paper proposes a new method to generate additional
training data for GEC systems by augmenting error-free text with artificial errors which
more closely approximate the error distributions seen in L2 learner data. Because the
proposed method uses actual learner error distribution data to generate its artificial
errors, the method is readily adaptable to the writing patterns of students with different
levels of proficiency and diverse L1s. To explain the potential efficacy of my proposed
method, I present and discuss detailed error distribution data drawn from a recently
published corpus of L2 Spanish, the Corpus of Written Spanish - L2 and Heritage learners
(COWS-L2H). This corpus was developed at the University of California, Davis to study
language acquisition in L2 and Heritage learners of Spanish and to serve as a resource
for building NLP tools for Spanish learners. I also present several GEC models for
Spanish trained on error-augmented text generated using various data augmentation
techniques and fine-tuned using real student data from COWS-L2H. This paper seeks
to address the following research questions:

• How closely do noising methods used in current state of the art GEC
systems simulate the error distribution seen in learner data?
• How well do current methods generalize to errors seen at various
proficiency levels and in students with different L1 languages?
• Does an approach which explicitly replicates the error distribution seen in
learner data improve the performance of a GEC system?

2. Background and Related Work

This section will cover two aspects of the use of GEC for written corrective feedback
(CF) in second language acquisition (SLA): pedagogical research into the benefits and

2
drawbacks of automated written corrective feedback (AWCF), and technical considera-
tions related to GEC systems and their application in the SLA classroom.

2.1 Pedagogical background

The use of automated written corrective feedback in the second language classroom
has a relatively short but controversial history. Corrective feed back is considered an
essential part of second language learning by many researchers. For example, Gass
(1991) and Ellis (2002) see CF in the role of “noticing”; in order to acquire a second lan-
guage, learners must be able to notice the differences between their production and the
correct form in the target language (Tatawy 2002). However, providing CF to students,
especially in a written format, is an extremely time-consuming prospect for instructors
(Shintani 2016). Automated written corrective feedback (AWCF) is seen as a way to
lighten instructor workloads and assist students in correcting their own writing (Li,
Link, and Hegelheimer 2015; Ranalli 2018). However, some researchers (Cheville 2004)
and teacher groups, such as the National Council of Teachers of English (NTCE 2014)
oppose the use of AWCF and other types of computer-mediated automated assessment.
According to the NTCE:

Automated assessment programs do not respond as human readers. While they may
promise consistency, they distort the very nature of writing as a complex and
context-rich interaction between people. They simplify writing in ways that can
mislead writers to focus more on structure and grammar than on what they are saying
by using a given structure and style (NTCE 2014).

In a study motivated by the NTCE statement and other critics of the use of auto-
mated feedback, Li, Link, and Hegelheimer (2015) demonstrate the utility of automated
written feedback in improving student writing, as indicated by both number of correct
revisions made by students and by teacher evaluation of the feedback system. Addi-
tionally, Stevenson and Phakiti (2014) demonstrate modest improvements to student
texts written using AWCF; however, the authors note that there is little evidence that
these improvements transfer to the students writing when not using AWCF (Li, Link,
and Hegelheimer 2015; Bitchener and Ferris 2012). Stevenson and Phakiti (2014) state
that more research is needed to establish that AWCF actually leads to improvement in
overall student writing proficiency. More recent work into the use of AWCF by lan-
guage learners indicates that higher proficiency language learners tend to question and
under-utilize AWCF, while lower proficiency students tend to over-rely on corrections
proposed by AWCF (Koltovskaia 2020). This finding raises an important question: how
can AWCF systems, specifically those driven by GEC, be better adapted to differences
in learner groups, such as proficiency level, L1, and previous language learning ex-
perience, to improve the ability of students to utilize AWCF systems? Additionally, it
should be noted that nearly all research in AWCF has focused on English as a second
language; the present paper proposes a GEC-based AWCF system built for use by
Spanish language learners. However, the proposed method should be applicable to any
language for which a small amount of annotated learner text is available for analysis,
and for which a larger amount of text is available for generation of synthetic training
data.

2.1.1 Effect of linguistic background. At UC Davis, several prominent groups of


students should be considered when developing Spanish-language GEC systems to

3
provide AWCF in the language classroom. Most students studying Spanish at UCD are
native English speakers, though these students are quite diverse in terms of their profi-
ciency in Spanish. A second major group of students are international students whose
L1 is a language other than English or Spanish; the majority of these students are native
speakers of Mandarin. The University of California system has a sizeable international
student population. As of Fall 2017, 14% of all UC Davis undergraduates, and 11% of
undergraduates across the UC system, are international students; of those students,
nearly 70% are from mainland China. This means that 8% of UC undergraduates,
and roughly 10% of UC Davis undergraduates, are Chinese (University of California).
This large population which speaks Mandarin and related Chinese languages poses
a challenge to language instructors due to possible effects of these students’ L1 on
the way that the students learn the target language. When dealing with such a large
population of students whose L1 is not English, do language instructors need to modify
their teaching methods to accommodate L1 transfer that is different from that of the
majority English-speaking students (Cummins 2008)? For example, Mandarin does not
use articles in its syntactic system in the same way that many Indo-European languages,
including English and Spanish, do (Snape 2009). When English-speaking students are
learning Spanish, they have prior exposure to the use of articles in their L1, despite
differences in the gender systems of English and Spanish (Ionin and Montrul 2010).
Mandarin speakers, on the other hand, come to Spanish from an L1 which has given
them little experience with the use of articles in their native tongue (Ionin and Montrul
2010). Does this difference affect the way in which Mandarin speakers acquire the use
of Spanish articles relative to their native English counterparts? But, one must also
consider the fact that these students have been exposed to the use of articles through
learning English, which all have learned as an L2 as a prerequisite to admission to
UC Davis. How does the students’ knowledge of article usage from their L2 interact
with syntactic transfer from their L1 (Cai and Cai 2015)? Which linguistic system has
the greater impact on these students’ acquisition of Spanish as an L3 (Rothman and
Cabrelli Amaro 2010)? To answer these questions, and to design an error correction
system adapted to these students’ needs, an in-depth analysis of their error patterns is
needed.
Heritage speakers of Spanish - students who come to the classroom with exposure
to Spanish from the home environment - compose a third major group of UC Davis
Spanish language students, though like native English speakers, these students’ Spanish
proficiency varies greatly. Heritage learners make up an increasingly large segment of
many Spanish program’s students (Montrul 2010). To understand the number of poten-
tial Heritage learners at California universities, 28.8% of Californians speak Spanish at
home (StatisticalAtlas.com), and 22% of students at University of California, Davis self-
identify as Hispanic (“Student Profile,” 2015). While Spanish-English bilinguals have
long represented a significant proportion of students at many universities, Spanish
departments have become more cognizant of the fact that these students have needs
which differ from those of their L2 learner peers. In response to this need, Spanish
departments at many Hispanic-serving institutions have developed courses specifically
designed to help Heritage Learners of Spanish to retain Spanish and to introduce these
learners to Spanish in an academic register.
Although Spanish departments have sought to address the specific needs of Her-
itage learners, a one-size-fits-all approach for Heritage learners is problematic, as Her-
itage learners are a linguistically and culturally diverse group whose proficiency in
Spanish, and the register and dialect of Spanish they use, varies widely from individual
to individual. The diversity of Heritage learners can be seen in the many definitions of

4
“Heritage learner” or “Heritage speaker” which have been offered by researchers. The
most widely used definition, offered by Guadalupe Valdés, defines a Heritage Learner
as a “language student who is raised in a home where a non-English language is spoken,
who speaks or at least understands the language, and who is to some degree bilingual
in that language and in English” (Polinsky and Kagan 2007; Valdés 2000). While this
definition is useful to language educators in particular, many researchers consider this
view of Heritage learners to be too narrow. Polinsky and Kagan (2007) point to two
conceptions of Heritage learners which have been proposed in the literature, and which
they term “broad” and “narrow” definitions. The “broad” conception emphasizes the
connection between cultural and linguistic heritage. For example, Fishman (2001) and
Van Deusen-Scholl (2003) both argue that a student’s status as a Heritage learner should
be based on her familial and cultural connections to the language in question. Van
Deusen-Scholl refers to such students as “learners with a heritage motivation” (AAAS
2016). Polinsky & Kagan argue that the view taken by Fishman and Van Deusen-Scholl
is too broad, since it focuses on a student’s motivation for studying a language, rather
than on linguistic knowledge. For example, a student who is learning a language for
the first time as an adult may be culturally motivated to do so, but that does not make
that student a Heritage learner/speaker under Valdés’ definition, which requires that
the student actually acquired the language in question in the home.
In addition, Lynch (2008) argues that the implementation of Heritage learner pro-
grams has somewhat impeded research into Heritage speakers’ abilities, especially with
reference to L2 learners, by implying a dichotomous relationship between these two
groups, when no such dichotomy exists. Heritage learners are often treated as ‘native
speakers’ when this label may not reflect the true extent of their linguistic abilities. Silva-
Corvalán (1994) states that Heritage speakers of Spanish in the United States exist along
a continuum ranging from standard Spanish to limited, “emblematic” usage expressing
social and cultural identity. Silva-Corvalán also points out that many Heritage speakers
exhibit limited domain knowledge of their Heritage language, as the use of minority
languages is frequently limited to the home. This limited domain knowledge results in
difficulty with abstract topics, such as politics and science, and with complex syntactic
constructions for many Heritage speakers Lynch (2008). Further, Lynch argues that the
limited social use of minority languages results in simplified grammatical systems that
introduce “innovative, that is, non-normative, elements at the lexical and discourse
levels,” and that these innovative patterns are conditioned by the dominant language
(Lynch 2008). Thus, Heritage language tends to adopt usage patterns from the majority
language surrounding it. These influences result in Heritage learners speaking a tongue
that can be markedly different from that spoken by non-Heritage native speakers of
the language, reinforcing the need to study and better understand the error patterns of
Heritage learners.
This diversity of students brings to light one of the major criticisms of most AWCF
systems: that they use a “one-size-fits-all” approach which “takes little or no account of
individual differences” (Ranalli 2018). According to Ranalli (2018), “Current-generation
AWE tools are not designed to differentiate among users’ with different L2 proficiencies,
L1s, writing skills, or educational backgrounds.” This weakness of many AWCF tools
leads to the issue pointed out by Koltovskaia (2020) in which more proficient students
under-utilize and lower proficiency users over-rely on AWCF recommendations. Bitch-
ener and Ferris (2012) specifically called on researchers to investigate the impact of
L2 proficiency and other student-specific factors on the utility of AWCF for language
learning. The only GEC research which has specifically attempted to adapt GEC to L1
and proficiency is Nadejde and Tetreault (2019). In their work, they fine-tune a general-

5
purpose GEC system with data from learners with a specific L1, proficiency, or L1-
proficiency pair. However, their method requires at least 11,000 training examples for
fine-tuning for each targeted L1 or level. This is a major limitation, as English is the only
language for which sufficient training data is available to make this method feasible.

2.1.2 Effect of target language proficiency. While the linguistic background of learners
is one major consideration in the types of errors learners make, another key component
to developing an effective AWCF system is adapting to the proficiency of learners
(Ranalli 2018; Bitchener and Ferris 2012; Nadejde and Tetreault 2019). For example,
Bitchener and Ferris (2012) argue that unfocused feedback - feedback which suggests
that an error has been made without suggesting a solution - may result in cognitive
overload for lower-proficiency students. On the other hand, more advanced students
may benefit from unfocused feedback in that it does not seek to constrain their writing
style; rather, it forces them to decide how to best resolve a potential error themselves,
thereby reinforcing language learning (Bitchener and Ferris 2012). According to Ranalli
(2018), “current generation AWE tools provide few if any options for fine-tuning feed-
back based on user characteristics or pedagogical goals.” While a first year Spanish
student may make many grammatical and stylistic errors in their writing, pointing out
errors that are beyond the scope of their learning objectives may serve to confuse the
student and reduce the utility of the AWCF system. This weakness can be overcome
by designing a GEC-based AWCF system which is trained to provide output aligned
to student L1, proficiency, and specific pedagogical goals. Additionally, the limited
research available which investigates the role of corrective feedback in improving the
use of specific linguistic forms has shown that simple errors, such as gender agreement
and use of past tense, are more readily learned from corrective feedback (Bitchener and
Ferris 2012), though it should be noted that the study in question used teacher-provided
asynchronous feedback rather than synchronous AWCF.

2.1.3 Synchronous and asynchronous feedback. One of the major advantages of AWCF,
beyond the potential time savings to instructors (Stevenson and Phakiti 2014), is the
fact that it facilitates synchronous feedback to student writers - that is, tagging of errors
and suggesting corrections in near-real-time while students are writing (Dikli 2006). In a
study which analyzed differences between synchronous (SCF) and asynchronous (ACF)
corrective feedback, Shintani (2016) found:

(1) SCF created an interactive writing process similar in some respects to oral corrective
feedback; (2) both the SCF and ACF promoted noticing-the-gap, but self-correction was
more successful in the SCF condition; (3) focus on meaning and form took place
contiguously in the SCF condition while it occurred separately in the ACF condition;
and (4) both types of feedback facilitated metalinguistic understanding of the target
feature, reflecting the unique features of writing (i.e., its slow pace, its permanency and
the need of accuracy) (Shintani 2016).

Shintani (2016)’s study did not use AWCF to provide synchronous feedback, but
rather used Google Docs to allow instructors to provide synchronous feedback as
students wrote. However, their findings clearly highlight the benefits of synchronous
feedback, assuming that the feedback provided is of sufficiently high quality and well
targeted to the student’s proficiency level.

2.1.4 Error tolerance in AWCF. The question of high quality feedback raises an addi-
tional question regarding the use of GEC-based AWCF in the language classroom: what

6
is the error-tolerance of students and instructors using AWCF systems. Unfortunately,
I am unable to find significant research into how much error students are willing to
accept when using AWCF and how much errors made by such systems impact their
usability. The fact is that the errors identified and corrections suggested by a GEC
model will include some number of errors; the current state of the art GEC system,
GECToR (Omelianchuk et al. 2020) achieves a precision of 78.9 and a recall of 58.2 (for
an F0.5 73.6) on the BEA-2019 shared task test set (Bryant et al. 2019). The relatively low
recall for these systems, indicating that the systems are missing a large portion of the
errors in the texts, is one reason that general-purpose, domain agnostic GEC systems
have not been fully integrated into commercial AWCF systems such as Grammarly and
Criterion. While the usability of general-purpose GEC for AWCF may still be somewhat
limited by system performance, a more fine-grained analysis is necessary to determine
which error types these systems are good at identifying and correcting, and if their
performance with this subset of error types outperforms more well-established and less
computationally expensive statistical and rule-based models.

2.2 Technical background

Data-driven grammatical error correction, the machine learning problem of correcting


writer errors in text, is a difficult task due to the non-deterministic nature of error
correction. For many types of errors, no single way exists to correct the error in a
text; how to resolve the error is largely a matter of choice for the corrector. Traditional
methods of dataset and prediction evaluation, such as inter-annotator agreement and
BLEU score, are not applicable to the GEC task (Bryant and Ng 2015) due to the fact that
agreement between annotators or alignment to a specific correction is not necessary for
an edit to be an appropriate correction of an error. While the correction of some errors
is relatively straightforward, such as subject-verb agreement, even these scenarios are
open to interpretation and subject to corrector choice. Should we change a pronoun
to make it agree with the verb, or change the verb to agree with the pronoun? More
difficult decisions arise when considering non-standard forms which, while acceptable
in certain registers or dialects, do not conform to an academic standard. For example,
do we correct a student who writes “I get to work at 8am” versus “I arrive at work at
8am”? While this choice may be considered stylistic rather than grammatical (Fraser
and Hodson 1978), the line between the two is often not clear. These questions are of
particular relevance in the L2 teaching field, in which grammatical error correction
tools have the potential to provide rapid feedback to students and expand instructor
bandwidth.
Fraser and Hodson (1978), make the following distinction between grammar and
usage:

Each language has its own systematic ways through which words and sentences are
assembled to convey meaning. This system is grammar. But within the general
grammar of a language, certain alternative ways of speaking and writing acquire
particular social status, and become the conventional usage habits of dialect groups.

While this distinction is easy to maintain in rule-based GEC systems, neural and
statistical systems learn transformations from training data; thus, if a correction to
style or usage is prevalent in the training data, that correction will be enforced by
the resulting system. This is particularly true of language model based systems which

7
learn the structure of the language in an unsupervised manner from large amounts of
unannotated text.

2.2.1 Rule-based and Classifier-based GEC. The simplest GEC systems, and, until quite
recently, those most commonly encountered by end-users in tools such as Microsoft
Word and Google Docs, use hand-crafted rules, regular expressions, and classification
to correct specific grammatical errors. For example, a simple regular expression can be
used to enforce the correct use of “a” or “an” depending on the following phoneme.
Similarly, spelling correction can be implemented by checking words in a text against a
dictionary list, then identifying the most-similar word based on edit-distance. How-
ever, many grammatical errors, such as subject-verb agreement mismatches, are too
complex to be identified and corrected with simple string matching rules. To overcome
this challenge, most rule-based GEC systems take advantage of part-of-speech (POS)
information and parse trees derived from automated tagging and parsing algorithms
(McCoy, Pennington, and Suri; Sidorov et al. 2013). For example, as discussed in Bryant,
Felice, and Briscoe (2017) and Sidorov et al. (2013), subject-verb mismatches can be
identified by a rule which checks that the nominal subject (nsubj) of the sentence has the
same person and number as the verb. The grammatical relation between the subject and
the verb can be obtained from a dependency parser, while the person and number are
often encoded in POS tags. The example in Figure 1.1 demonstrates how dependency
and POS tag data can be used in rules to identify grammatical errors. Sidorov et al.
(2013) further demonstrates the use of a rule to identify the correct verb form from a
verb list to resolve identified subject-verb agreement errors. Rules to correct other types
of errors can be constructed in a similar way.
A major benefit of rule-based GEC systems is that they do not require training data
to implement. This fact allows rule-based systems to be quickly built to identify and cor-
rect specific errors in low-resource languages and domains with little available training
data (Bryant, Felice, and Briscoe 2017). Additionally, the rules tend to be straightforward
to implement and can be precisely targeted ((Bryant, Felice, and Briscoe 2017)). Such
rules are effective when designing a system which targets a specific type or set of errors.
However, designing a system which is able to identify and correct a broad range of
errors can quickly become unfeasible due to the number of rules, as each error type
requires its own set of potentially complex hand-crafted rules.
While rules must be implemented manually, statistical classifiers attempt to learn
the function which determines the appropriate token in a given position from training
data. Like rule-based methods, classifier-based methods must restrict the class of errors
which they attempt to correct in order to limit the number of categories into which a
token can be classed (Bryant 2019). Much work has focused on the correction of errors in
article, verb, and preposition usage (Rozovskaya and Roth 2014). These parts-of-speech
are particularly appropriate for classifier-based GEC, as the number of potential targets
is relatively small. For example, an article classifier only need determine if the article
before a verb should be a, an, the, or omitted. Similarly, a verb classifier, which relies
on information drawn from verb inflection databases, need only determine whether
the given verb eat should be classified as eat,eats, ate, eating, or eaten. Obviously, this
task becomes more complex in languages which have richer verb morphology. The
learning of the classification function can be done using various statistical approaches,
such as naive Bayes, logistic regression, decision trees, support vector machines, and
other statistical methods (Bryant 2019).

8
Figure 1
“cat” is the nominal subject (nsubj) of “chases” and the POS tags show that both words are
singular (NN (singular noun) and VBZ (third person singular verb) rather than NNS (plural
noun) and VBP (non-3rd person present)). From (Bryant, Felice, and Briscoe 2017)

2.2.2 Language Model-based GEC. Using language models to identify and correct
errors relies on the fact that, when analyzed using a well-trained language model, the
probability of an ungrammatical sentence should be lower than that of a grammatical
one (Bryant and Briscoe 2018). For example, the probability of the sentence *"I seed
the cat" should be lower than the probability of "I saw the cat." Current techniques in
LM-based GEC involve correcting errors for only a limited subset of items in a given
sentence; for example, Bryant and Briscoe (2018) target only non-word errors (spelling
errors which result in a non-word), morphological errors such as noun number and verb
tense, and articles and prepositions. The method used by Dahlmeier and Ng (2012), Lee
and Lee (2014) and Bryant and Briscoe (2018) involves creating confusion sets for each
token in a sentence which falls into one of their predefined categories based on the part-
of-speech of the token. They create the confusion sets using various external resources,
such as spell-checkers for non-words and inflection databases for morphological errors.
Once the confusion sets have been generated, they iterate through the various changes
proposed in the confusion sets and re-score the sentence using the trained LM. They
then choose the resulting sentence with the lowest LM perplexity score as the best
correction of the target sentence (Bryant and Briscoe 2018).
Most available work which uses LMs to rank proposed corrections use N-gram
LMs such as KenLM (Heafield 2011). However, more recent work, utilizes large neural
language models like BERT (Devlin et al. 2018) to rerank proposed corrections (Kaneko
et al. 2019), or to generate corrections using BERT’s masked LM framework (Li, Anas-
tasopoulos, and Black 2020). Li, Anastasopoulos, and Black (2020) propose a two-stage
process in which they first label each token in the sequence with one of four labels:
remain, substitution, insert, delete. They then mask all tokens labeled "substitution"
and insert a mask token for the "insert" labels, allowing the BERT model to propose
corrections for these items.
They key advantage of LM-based GEC is that it does not require annotated training
data of human-corrected sentences; confusion sets can be built for any language for
which the necessary linguistic resources such as spell-checkers exist. In Bryant and
Briscoe (2018)’s system, annotated data is used only for system tuning, though system
tuning is not a strict requirement of an LM-based GEC system. The drawbacks of this
type of system are that 1) they require linguistic resources such as spell-checkers and
inflection databases to generate confusion sets necessary to generate alternate proposals
for scoring; and 2) since theses systems, at least as proposed, make changes only on the
token level, they are not capable of correcting multi-word grammatical and stylistic
errors. Additionally, as pointed out by Bryant (2019), probability is not always a perfect
proxy for grammaticality; for example, the sentence "I is the ninth letter of the alphabet"
would have a lower probability than "I am the ninth letter of the alphabet" according to

9
most LMs, despite the fact that the former sentence is actually the more appropriate in
context Bryant (2019).

2.2.3 NMT-based GEC. The machine translation approach to GEC frames error cor-
rection as a monolingual translation task in which the source and target languages
are "with errors" and "without errors," respectively (Ng et al. 2014). Grammatical error
correction can be viewed as a noisy-channel model, a task to which machine translation
is particularly well suited (Flachs, Lacroix, and Søgaard 2019) Statistical approaches
developed originally for translation between different languages have been adapted
and applied successfully to grammatical error correction (Napoles and Callison-Burch
2017; Leacock et al. 2010). Similar monolingual translation approaches have been used
for paraphrase generation (Quirk, Brockett, and Dolan 2004) and text simplification
(Coster and Kauchak 2011). Recent work has shown neural machine translation (NMT)
(Bahdanau, Cho, and Bengio 2014) to be an effective approach to the GEC task (Zhao
et al. 2019; Chollampatt and Ng 2018; Junczys-Dowmunt et al. 2018).
While framing the problem of error correction as a monolingual translation task
is promising, the approach requires parallel training data (Rei et al. 2017), which if
not publicly available, must be created by manually correcting text containing errors
or by artificially generating errors in grammatical text. Kasewa, Stenetorp, and Riedel
(2018) demonstrate the use of artificial errors to train a GEC system; however, their
method requires real-world parallel text to train their noise model used to generate
artificial errors in grammatical text. Similarly, Xie et al. (2018) use a noising model
trained on a "seed corpus" of parallel sentences to build an effective GEC system trained
on artificially generated parallel noised data. Junczys-Dowmunt et al. (2018) show that
effective neural GEC can be achieved with a relatively small amount of parallel training
data when techniques such as transfer learning are employed. Each of these approaches
were applied to error correction in English only. Grundkiewicz and Junczys-Dowmunt
(2019) expands this work, demonstrating a GEC system for German and Russian which
uses small corpora of corrected text to fine-tune a baseline system trained on artificial
data. Grundkiewicz and Junczys-Dowmunt (2019)’s MAGEC system generates artificial
data by creating confusion sets drawn from spelling correction suggestions made by
HunSpell (Ooms 2018). MAGEC then randomly selects a portion of tokens in a large
corpus of correct text, replacing each token with one of the items from that token’s
confusion set. Additionally, MAGEC deletes, transposes, and scrambles a percentage
of the words in the source dataset to mimic omission, transposition, and non-word
spelling errors. In my initial tests replicating MAGEC in Spanish, I found that the
MAGEC system performs better (F0.5 0.109) than a model trained on randomly noised
data (F0.5 0.024). Thus, while the performance of MAGEC and its predecessor system
(Grundkiewicz, Junczys-Dowmunt, and Heafield 2019) clearly leave much room for
improvement, they offer a promising path toward developing artificial training data
to allow GEC on lower-resource languages.

2.3 Current State of the Art

As with many NLP tasks, the current state-of-the-art in GEC involves using large
masked language models such as BERT (Devlin et al. 2018). The method used by Kaneko
et al. (2019), previously described in the Language Model-based GEC section of this
paper, has been refined by Omelianchuk et al. (2020) to achieve an F0.5 of 73.6 on the
combined Write & Improve (Yannakoudakis et al. 2018) and LOCNESS (Granger 1998)
test corpus used for the BEA 2019 Shared Task on Grammatical Error Correction (Bryant

10
et al. 2019). Specifically, Omelianchuk et al. (2020)’s GECTOR reframes the GEC task
as a sequence labelling task rather than a sequence transformation task. For example,
the transformation {go→goes} would instead be tagged as $VERB_FORM_VB_VBZ,
indicating the change of a base verb form to a third person singular form. Their sequence
tagging model is built using a fine-tuned BERT transformer stacked with two linear
softmax layers to assign tags to tokens.
While the tagging model proposed by Kaneko et al. (2019) and Omelianchuk et al.
(2020) is able to achieve state-of-the-art results in GEC tasks, the approach is limited
in that, even with iterative decoding which allows multiple changes to the source text,
such models are ultimately dependent on token-level transformations rather than larger,
stylistic changes which may help improve learner writing. For example, Sakaguchi et al.
(2016) argue "for a fundamental and necessary shift in the goal of GEC, from correcting
small, labeled error types, to producing text that has native fluency." Additionally,
Chollampatt, Wang, and Ng (2019) argue that GEC should take cross-sentence error
corrections into consideration, which the current SOTA systems such as GECTOR are
unable to do. While GECTOR and similar systems may perform well on GEC tasks as
currently defined, sequence generation systems, which are largely based in NMT, may
be better at achieving these types of fluency rewrites and longer-span corrections.

3. The COWS-L2H Corpus

3.1 Motivation & Corpus Details

While annotated learner corpora of English are widely available, large learner corpora
of Spanish are less common, and as a result, the field has seen little data-driven research
on the developmental processes that underlie Spanish language learning, or on the
development of NLP tools to assist teachers and students of Spanish. This may come as
unexpected, considering the fact that there exists a relatively high demand for learning
Spanish; in 2013, fifty-one percent of students enrolled in university language courses
in the United States studied Spanish (AAAS 2016) and there are over 21 million learners
of L2 Spanish across the globe (Cervantes 2019). This paucity of non-English data is
especially evident in GEC; almost all of the large parallel error-corrected datasets used
for training GEC systems are in English. As a result, the large majority of recent work
in developing data-driven GEC systems is targeted toward English learners.
Due to this lack of available data in learner Spanish, researchers in the Spanish and
Linguistics departments at the University of California, Davis developed the Corpus
of Written Spanish of L2 and Heritage Learners (COWS-L2H) (Davidson et al. 2020).
This corpus of over 1,138,000 words is composed of 4,483 personal essays written by
2,463 students enrolled in various levels of undergraduate Spanish instruction at UC
Davis. We currently have an ongoing data collection, anonymization and annotation
process, and additional essays will be added to the public release of the corpus as soon
as possible. We have recently entered into a collaboration with a Spanish university to
assist in our annotation and anonymization efforts, which we anticipate will greatly
increase the rate at which we are able to make additional data public.
The corpus contains essays from L2 (and L3) learners of Spanish in seven instruction
levels (SPA 1, 2, 3 and SPA 21, 22, 23, and 24), as well as essays written by 293 Heritage
learners, at three levels of instruction (SPA 31, 32, and 33). The distribution of the essays
across the levels is uneven due to the distribution of students enrolled in our Spanish
courses. Because more students enroll in beginning Spanish courses than in advanced
levels, a larger number of essays submitted to the corpus come from these beginner-

11
Course Level Essays Tokens
Beginner (SPA 1-3) 2,519 594,556
Intermediate (SPA 21-22) 525 140,521
Composition (SPA 23-24) 784 220,752
Heritage (SPA 31-33) 545 155,230
Unknown level 110 27,038
Total 4,483 1,138,097

Table 1
Summary of corpus composition

level courses. During each academic quarter (ten weeks of instruction), participants are
asked to write two essays in Spanish that adhere to a minimum of 250 and a maximum
of 500 words, though students enrolled in Spanish 1 are allowed to write essays with
a lower minimum word count, due to the fact that many of these students are true
beginners in L2 Spanish who would possess relatively little vocabulary and grammat-
ical resources of their own. Participants are asked to write each essay they submit in
response to one of two short prompts. Participants of all levels followed the same two
prompts during the same academic quarter, to allow lexical and syntactic comparisons
across levels which are not influenced by topic variation in the writing samples. Both
prompts are presented with a distinct brevity, to allow for a broad degree of creative
liberty and open-ended interpretation on the part of the writer. To test the effect of
prompt on student writing and promote diversity in our corpus, we periodically change
the prompts presented to students. To date we have presented six essay prompts. For the
first set of compositions, collected from 2017 to 2018, participants were asked to write
about “a famous person” and “the perfect vacation.” For essays collected from 2018 to
2019, the prompts were “a special person in your life” and “a terrible story”. Finally, for
more recent compositions, collected from early 2020 to the present ask students to write
“a description of yourself” and “a beautiful story”. We have collected an average of 900
essays in response to each of the prompts we have used to date.
Given the diverse backgrounds of our students, especially those who enroll in
courses for Heritage speakers, identifying the specific variety of Spanish in the essays is
challenging; however, our courses are generally taught using a standard variety of aca-
demic Spanish, so we expect this to be the predominant variety in the corpus. Students
provide information about their linguistic background which we include as metadata
in the corpus; this metadata may elucidate variability in usage resulting from students’
past experience with Spanish. The metadata also allows us to test the effects variables
such as L1 on student writing. Finally, the linguistic metadata may facilitate the use of
filtered subcorpora for targeted training of NLP systems; for example, as mentioned
previously, Nadejde and Tetreault (2019) demonstrate that grammatical error correction
systems benefit from adaptation to L1 and proficiency level.

3.2 Error annotation

One of the primary goals of the COWS-L2H project is to annotate grammatical errors
in the corpus in a way that writing patterns typical of Spanish as a foreign language
produced by student participants can be identified, catalogued, and easily utilized by

12
researchers who use the corpus. To this end, we have begun the process of error-tagging
the corpus based on specific error types; the first two error types for which we have
completed annotation are gender and number agreement, and usage of the Spanish a
personal. We chose to annotate these specific error types based on research questions
we wished to explore, but we intend to expand our error annotations in the future, as
our annotation scheme can be readily adapted to additional error types we choose to
annotate. Further, we encourage other researchers to adapt the annotation scheme to
the annotation of other error types and contribute their work to the COWS-L2H project.
Our current team of annotators consists of four graduate-level Spanish instructors
who have native or near-native fluency in Spanish. As previously mentioned, we are ex-
panding our error annotation project through a collaboration with a Spanish university,
which will allow us to significantly expand both the number of annotators and the scope
of our error annotation project in the near future. Our in-text error-tagging scheme is as
follows:
[error]{edit}<annotation>.
Consider the example error in (1), and its annotation in (2):
(1) Yo vivo en el ciudad.
“I live in the city.”
(2) Yo vivo en [el]{la}<ga:fm:art> ciudad.
In (2), the first set of brackets encloses the words in the error in question, the curly
brackets that follow give the corrected edit, and the angle brackets house the error tags.
In this case, the tags indicate that the error was a gender agreement error (ga), that
masculine gender was erroneously produced in place of the correct feminine gender
(fm), and that the error occurred on the article (art). A full description of the error
annotation scheme is provided with the dataset in the corpus GitHub repository.
Each essay is annotated by at least two of our four annotators to ensure the accuracy
of our annotations and the suitability of our annotation scheme. Due to the open-ended
nature of the annotation task (any token can be considered a possible position of anno-
tation), determining the best measurement for inter-annotator agreement is challenging.
In Table 2, we report Krippendorf’s α (Krippendorff 2011) considering every token as an
annotation position. Thus, if both annotators choose to not annotate a token, indicating
that the token is correct, we treat this lack of annotation as agreement. This choice makes
sense because, by not making an explicit annotation on a given token, the annotators are
implicitly labeling the token as correct. An alternative method of calculating agreement
would be to consider only positions where at least one annotator indicated an error;
however, this choice would ignore all positions at which both annotators agreed that
no error exists, which is itself a form of agreement. To put our agreement values in
more familiar context, we also report the F0.5 score, commonly used in GEC, using one
annotator as ground-truth. In terms of both Krippendorf’s α and F0.5 , our annotators
show strong agreement.

3.3 Parallel corrected text

In addition to annotation of selected errors, our goal is to include corrected versions


of the essays in the corpus. Currently, the compositions collected in this project are
corrected by two doctoral student associate instructors of Spanish. Both have native
or near-native command of Spanish, have previously taught the Spanish courses from
which the students have been recruited to participate in this project, and thus are
accustomed to recognizing, interpreting, and correcting errors made by students of L2
Spanish.

13
Error type α F0.5
Gender-Number 0.780 0.784
“a personal” 0.741 0.730
Average 0.761 0.757

Table 2
Inter-annotator agreement: error annotations

Course Level Essays Tokens Errors


Beginner 448 125,985 19,577
Intermediate 8 2,598 267
Composition 24 7,138 1,080
Heritage 89 29,025 3,108
Total 3,516 892,023 24,032

Table 3
Summary of corrected essays & error count

To date, we have corrected approximately one-fifth of the essays in the corpus, for
12,678 sentences (168,937 tokens) of corrected text. The distribution of corrected essays
is shown in Table 3. Unlike the error annotations, which target specific errors, the
corrections made to this set of essays are more holistic in the manner of an instructor
correcting a student’s work. The result of the correction process is a corrected version of
the text, from which corrections can be extracted using NLP tools such as ERRANT
(Bryant, Felice, and Briscoe 2017). Additionally, we align the original and corrected
sentences to create parallel data that can be used for training NLP systems such as
grammatical error correction. To our knowledge, our corpus represents the first parallel
dataset of corrected Spanish text available to researchers.
As with our error annotations, we are in the process of completing additional
corrections and anonymization, and will make more data publicly available as soon
as practical. As can be seen in Table 3, the largest portion of our currently annotated
corpus comes from beginning students; completing additional corrections will allow
us to present a larger number of errors from students at more advanced levels. Given
the wide variety of ways a sentence can be corrected, our goal is to have each essay
corrected by three individuals. Multiple corrections will increase error coverage in our
training data and will provide additional test references for NLP researchers who are
trying to build automated error identification and correction models.

4. Grammatical Error Correction

4.1 Motivation

While current state of the art methods for GEC achieve impressive results considering
the difficulty of the GEC task, they are not well motivated by the nature of the task.
Specifically, current state-of-the-art methods rely on generating artificial training data
in a manner which is only loosely guided by the error distribution seen in real user

14
data. Other systems limit the class of errors which they attempt to correct. I seek to
design a system which is able to correct both grammatical and stylistic errors, resulting
in fluent output to assist L2 learners in improving their writing skills. In this paper,
I present a GEC system trained on COWS-L2H data. Additionally, I demonstrate a
novel method for generating artificial training data which is more closely aligned with
the error-distribution patterns seen in learner data. Previous work which has used
structured methods of data augmentation to improve grammatical error correction
performance (Grundkiewicz and Junczys-Dowmunt 2019) has not attempted to bring
artificial training data in line with error distributions seen in real data.

4.2 Adaptation to L1 and level

To adapt a GEC system to learner L1, level, and other individual factors, the ideal
scenario would be to train a supervised sequence-to-sequence model on a sufficient
quantity of parallel original and corrected text from the target group of learners.
However, given the paucity of available parallel learner data, there is insufficient data
available to train an effective GEC model on parallel data alone; this becomes an even
greater problem as we subdivide the data based on student language and proficiency
level. The obvious solution to this problem is to generate artificial training data which
replicates the types of errors made by each group of students. In order to understand
the errors made by each student group, we need a better understanding of the types of
errors made by students who are native English, Mandarin, and Spanish speakers, as
well as error patterns across the various proficiency levels represented in the COWS-
L2H corpus.

4.2.1 Error Analysis. To determine how the error patterns of students vary by level and
L1, I conduct a detailed analysis of error patterns in the COWS-L2H corpus. An effective
analysis of error distribution in language learner writing requires error annotations
which include the location and type of each error. The manual annotation of errors is
a time-consuming and labor-intensive process which requires extensive training on the
annotation scheme to ensure consistent and accurate annotations. Currently, the COWS-
L2H corpus contains manual error annotations for two specific error types, gender and
number agreement and usage of the personal “a”. However, because I wish to use error
data from the corpus to inform a machine translation based GEC system capable of
correcting a wide variety of grammatical and stylistic errors, I must conduct a broader
evaluation of the errors contained in student writing.
Given that the COWS-L2H corpus contains a subset of 3,516 essays which have
been fully corrected by graduate instructors of Spanish, I am able to automatically
extract and tag a diverse set of errors for analysis. While this method tends to be less
accurate than manual error annotation by skilled annotators, it allows for the rapid
analysis of multiple types of errors. The corrections made by instructors include both
strictly grammatical corrections, such as replacing an erroneously selected article with
its correct counterpart, spelling and orthographic corrections, and stylistic corrections
such as word choice. See Table 4 for examples of the types of errors identified and
corrected by instructors in the COWS-L2H corpus, as well as the resulting automatic
annotations.
To automatically annotate the errors identified by the instructors, I align sentences
from the original and error-corrected subset of the COWS-L2H corpus to create a
parallel sentence dataset containing approximately 12,000 sentence pairs. Because the
corrections can include both splitting run-on sentences and merging sentence frag-

15
Original: Él era un actor y bueno persona también .
Corrected: Él fue actor y una buena persona también .
A 1 2 ||| R:AUX ||| fue ||| REQUIRED ||| -NONE- ||| 0
A 2 3 ||| U:DET |||||| REQUIRED ||| -NONE- ||| 0
A 5 5 ||| M:DET ||| una ||| REQUIRED ||| -NONE- ||| 0
A 5 6 ||| R:ADJ:FORM ||| buena ||| REQUIRED ||| -NONE- |||
0

Table 4
Example of ERRANT automated annotation

ments, I include both individual sentences and concatenated consecutive sentences in


the search space for sentence alignment. Additionally, due to sentence reordering and
merger during correction, sentences in the parallel original and corrected essays cannot
be aligned based on sentence order. Rather, I calculate the Levenshtein distance between
each original sentence and each sentence and concatenated sentence pair in the cor-
rected text. I then align the sentences with the lowest Levenshtein edit distance. While
this method does not account for merging sentence fragments, I found that splitting
of run-on sentences was a far more common in the correction process. This method
also does not account for the well-known issue of linguistically nonsensical word-
level alignments which result from the Levenshtein algorithm (Xue and Hwa 2014).
For example, as shown by Xue and Hwa (2014), because the Levenshtein algorithm
seeks only to minimize the number of edits, it is likely to align words like “repair”
and “reparations”. However, the sentence-level alignment at this stage is meant only to
identify which two sentences correspond to one another in the parallel texts; word-level
alignment to extract specific errors is completed after sentence-level alignment.
Once aligned, extracting error corrections from sentence pairs is a matter of aligning
words and identifying all edits made to transform the original sentence into its cor-
rected version. This process is achieved using the ERRor ANnotation Toolkit (ERRANT)
(Bryant, Felice, and Briscoe 2017), which locates, categorizes and annotates correc-
tions in parallel original and corrected sentences. ERRANT uses a modified version
of Damerau-Levenshtein distance (Damerau 1964) developed by Felice, Bryant, and
Briscoe (2016) to align words in parallel texts. Specifically, Felice, Bryant, and Briscoe
(2016)’s word alignment method seeks to introduce linguistic information into the align-
ment algorithm by creating a substitution cost function which considers differences in
lemma form and part-of-speech, in addition to the character-level differences used by
the original Damerau-Levenshtein algorithm. Substitutions in which the aligned words
share the same lemma form and/or part of speech (as determined by SpaCy (Honnibal
and Montani 2017)) cost less than do linguistically unrelated substitutions. As a result,
words which have similar spelling and linguistic function are more likely to be aligned.
Felice, Bryant, and Briscoe (2016) argues that the resulting alignments are more natural
and human-like than alignments generated by simple character-level alignment used
by Levenshtein.
Once word-level alignment is completed, errors can be readily identified by com-
paring the differences between the original and corrected versions of the parallel text.
ERRANT uses a set of approximately fifty ordered rules to classify each identified error
into one of three operations and one of seventeen general error classes based on the de-

16
English Error rate
Determiners 13.2%
Verbs 5.9%
Spelling 7.9%
Spanish
Determiners 10.8%
Verbs 6.6%
Spelling 6.4%
Mandarin
Determiners 14.9%
Verbs 7.7%
Spelling 3.0%

Table 5
Sample of error types as percentage of total errors for different L1s

pendency label and part of speech of both the original and corrected word form. Words
removed during correction are tagged with the operation label “U” for “unnecessary”.
For those words inserted during correction, the operation label is “M” for “missing”,
while the operation label for replacements is “R”. Most errors identified by ERRANT
correspond to part of speech tags, such as Noun and Verb. However, the system also
includes specific tags for word order, morphological, spelling, and orthographic errors.
For example, two words with the same lemma are tagged as morphological variants
(“MORPH”). Aligned words which are not identical but which have half of characters
in common are tagged as “SPELL”, while words which differ only in capitalization
are tagged “ORTH”. Additionally, two adjacent words whose order is swapped by
correction are tagged “WO” for “word order”. See Table 4 for examples of these error
types. A detailed explanation of the ERRANT system’s rule-based error tagging method
can be found in Bryant, Felice, and Briscoe (2017).
Although ERRANT was originally designed for error analysis of English texts, I
have modified the system to process and tag Spanish parallel texts. To achieve this mod-
ification, I set the SpaCy library settings used by ERRANT to the “es_core_news_sm”
Spanish model, allowing the system to utilize SpaCy’s Spanish dependency parser and
part-of-speech tagger. I also provide the system with a Spanish word list generated from
the Spanish HunSpell dictionary so that it may identify spelling errors in Spanish text.
Finally, I remove English-specific error-tagging rules. My modified ERRANT code is
available on GitHub 1 .
A sample of the error analysis data by L1 is shown in Table 5.

4.3 System Development

One goal of this paper is to compare the various methods of generating artificial error
data for training supervised GEC models using a neural NMT system. Currently, I have
completed implementation of three of the four planned methods: a random noising sim-
ilar to Zhao et al. (2019), an NMT-based noising method as proposed in Xie et al. (2018),

1 https://github.com/ucdaviscl/cowsl2h

17
Original En 1990 se fundó la radio pública Ràdio Nacional d’Andorra .
Noised 1990 se fundó la radio pública Ràdio nnliocaa años d’Andorra .

Table 6
Artificial noising of data

and the reimplementation in Spanish of the method proposed in Grundkiewicz and


Junczys-Dowmunt (2019) and Grundkiewicz, Junczys-Dowmunt, and Heafield (2019).
I am currently in the process of completing the implementation of my error injection
method based on the error analysis conducted of parallel corrected COWS-L2H data.
In addition, we are currently in the process of recruiting additional graduate students
to correct student essays, which will increase the amount of error data available for
analysis and fine-tuning.

4.3.1 Training Data. First, I generate artificial parallel training data by randomly adding
noise to a dataset of 1.3 million grammatical sentences from the Polyglot dump of
Spanish Wikipedia (Al-Rfou, Perozzi, and Skiena 2013), with 150,000 sentences set aside
for validation. Similar to the method of Zhao et al. (2019), the addition of random
noise consists of three operations – token deletion, token insertion (from the fifty most
common tokens in the corpus), and token scrambling, with each operation applied to
10% of tokens. The resulting dataset thus has approximately 30% of tokens mutated in
some way, as shown in Table 6. Adding noise to the data in this manner follows the idea
behind a denoising auto-encoder (Vincent et al. 2008) which learns underlying features
in the process of denoising data. By training on our artificial noisy data, the system
builds a language model of Spanish which enables it to construct grammatical Spanish
sentences from noisy input. While Junczys-Dowmunt et al. (2018) and Grundkiewicz
and Junczys-Dowmunt (2019) show that selective augmentation using suggestions from
a spell-checker results in better performing GEC systems, I chose to test this simpler
method of data noising as a baseline and for the purpose of demonstrating the value of
the COWS-L2H data.
Second, to generate artificial training data using a noising model as proposed in
Xie et al. (2018), I first train a simple, LSTM-based neural translation model using the
parallel corrected data from the COWS-L2H corpus. By placing the corrected sentences
on the source side, and the original sentences (containing errors) on the target side,
I create a noising model which is intended to translate from “correct” Spanish to
“learner” Spanish. I again generate the artificial training data by applying the model
described to a dataset of 1.3 million grammatical sentences from the Polyglot dump of
Spanish Wikipedia (Al-Rfou, Perozzi, and Skiena 2013), with 150,000 sentences set aside
for validation.
Third, to generate artificial training data using reverse spelling correction, I fol-
low the MAGEC method outlined in Grundkiewicz and Junczys-Dowmunt (2019). As
discussed previously, this method consists of generating confusion sets from the top
twenty suggestions from an open-source spell check system. The system then replaces
a set percentage of tokens in the source text with a suggestion from the confusion
set for the target token. Additionally, the system transposes, deletes, and scrambles a
fixed percentage of tokens. To implement this method, I modified the scripts provided
in the GitHub repository provided by Grundkiewicz and Junczys-Dowmunt (2019) to
generate confusion sets for Spanish texts. As with the previous methods, I generate

18
the artificial training data by applying the method described to 1.3 million Spanish
Wikipedia sentences.
My fourth noising method, which I am still in the process of implementing, consists
of adding artificial errors to the Polyglot Spanish Wikipedia data at the rates identified
in the COWS-L2H corpus. For example, on average, approximately 15% of determiners
contain an error across the entire COWS-L2H corpus. For Mandarin speakers, this rate
is slightly higher, and it is slightly lower for Spanish speakers. Similarly, the rate of
determiner errors by word is higher for Spanish 1 students (2.2% of words are a deter-
miner error), while for Spanish 21 students this number drops to 1.4%. To implement
this method I will first tag the source data for part of speech, using FreeLing (Padró
and Stanilovsky 2012). I will then select a percentage of each POS corresponding to the
observed error rate in the corrected COWS-L2H data; the error rate for each essay will be
drawn from the distribution of observed error rates to avoid an unrealistic uniformity in
error distribution. Additionally, as in both MAGEC and the random noising, a portion
of tokens in each sentence will be scrambled, transposed, or deleted to simulate spelling
errors and other typos at a rate similar to that observed in the parallel learner data. The
same operations are also carried out at the character level; a percentage of characters are
randomly replaced with other characters, transposed, or deleted.
While the method proposed in MAGEC does result in markedly improved GEC
performance compared to the random noising method, the confusion sets do not tend
to represent the types of errors likely to be made by students. For example, the MAGEC
confusion set for the verb “had” is “hard, head, hand, gad, has, ha, ad, hat”. The verb
form “has” is a likely error, but the remainder represent simple spelling mistakes which
can be easily recreated with character level perturbations. The MAGEC confusion set is
missing likely verb errors for “had” such as “have” and “having”. In a language such
as Spanish, which has more complex verb morphology than English, the confusion set
for a verb would be much larger. In order to create a system in which artificial errors
more closely resemble errors student learners are likely to make, the confusion sets
for each word in my vocabulary will be based on part-of-speech as well. For example,
confusion sets for verbs will consist of alternate conjugations for the same and similarly
spelled verbs, to simulate likely errors made by students choosing the incorrect verb
form. Verb morphology information is drawn from Fred Jehle’s Conjugated Spanish
Verb Database (Jehle 1987). The confusion sets for closed classes such as prepositions
and determiners will consist of fixed sets of closed class alternatives. For example, the
determiner confusion set consists of the alternate determiner forms which could result
from an incorrect gender assignment or incorrect use of an article. Nouns, adjectives,
and other non-verb open-classes will continue to follow the method of MAGEC, in
which the confusion sets consist of spelling suggestions from an open-source spell-
checker.
Once I implement the artificial data generation method described above, I can
then readily generate additional artificial datasets tuned to the error patterns observed
among specific groups of students. Tuning the training data, and the resultant GEC
system, to individual students, becomes a simple matter of changing the number and
proportion of each error type injected into the source data. Additionally, since the
injection of errors is readily controllable, it is possible to generate training sets which
focus only on a specific type, or a group of specific types, of errors. Should an instructor
wish to provide students with a GEC system designed to focus on gender-number
agreement, for example, generating the necessary training data would require injecting
only those types of errors into the source data, then using the resulting data to train a
GEC system.

19
Our learner data, which we reserve for fine-tuning and testing, consists of 10,000
parallel uncorrected and corrected sentences drawn from COWS-L2H, with 1,400 sen-
tences set aside for validation and 1,400 for testing. However, this fine-tuning data can
be modified depending on the output desired from the final GEC system. Because ER-
RANT provides span-based annotation of edits to the original student text, modifying
the text to include only changes which correct specific error types is possible. Thus, if an
instructor wishes a system to be trained only to correct determiner errors, this is possible
by modifying both the artificial noising method for pretraining data generation and the
final fine-tuning data to include only the target corrections.

4.3.2 Model and training procedure. All GEC models are trained using a neural ma-
chine translation (NMT) setup with a 6-layer Transformer (Vaswani et al. 2017) encoder
and decoder, both with embedding vector size of 512, 8-head self-attention layers, and
feed-forward layers of size 2048. I implement the model using OpenNMT-py (Klein
et al. 2017). I use the Adam optimizer (Kingma and Ba 2014) with a learning rate of 2,
a dropout of 0.1, an ADAM beta2 of 0.98 (as suggested by Vaswani et al. (2017)) and
a batch size of 2048. The decoder uses beam search with a beam size of 5. Validation
is conducted every 10,000 steps, and training is stopped if model perplexity has not
improved for four validation steps. I train all models for 200,000 steps on 1.04 million
sentences of the artificially noised Wikipedia data. After this initial training is complete,
I fine-tune the model for a maximum of 5,000 training steps, with validation every 1,000
steps, on 10,000 sentences of parallel COWS-L2H data to achieve our final models.

4.3.3 Results. I evaluate model performance using the ERRANT scorer (Bryant, Felice,
and Briscoe 2017), which although designed to compare automatically annotated En-
glish text is capable of aligning edits in Spanish text with slight modifications. I carefully
reviewed the ERRANT output after making the required modifications to ensure that
alignments were accurate. With a model trained as described above on randomly noised
pretraining data and fine-tuned using COWS-L2H data, I achieve an F0.5 score of 0.224.
As expected based on Grundkiewicz and Junczys-Dowmunt (2019)’s results in English,
German and Russian, the MAGEC method significantly outperforms the random noise
method, achieving an F0.5 score of 0.392. This increase is similar to the increase seen on
English data; when I implement my random noising system using English Wikipedia
data for pretraining and BEA-2019 data for fine-tuning, I achieve an F0.5 of 0.276 on the
BEA-2019 test set. Grundkiewicz and Junczys-Dowmunt (2019) report an F0.5 of 0.443 on
the same test set. Surprisingly, the noising model method, based on Xie et al. (2018), does
not perform better than the random noising method, achieving an F0.5 score of 0.223. I
suspect that this is the result of too little training data available to train an effective
noising model. The fact that a model trained to correct errors on the parallel COWS-
L2H data alone achieves an F0.5 of 0.101 supports the fact that there is simply too little
genuine parallel data to train an effective translation model.
While these figures are lower than state-of-the-art error correction systems for En-
glish, several considerations must be taken into account. First, far more genuine learner
data is available for system fine-tuning of English language GEC systems. Additionally,
I have not conducted model ensembling, which is common in current state-of-the-art
GEC systems. My goal in this paper is to propose a new method of artificial training data
generation and to discuss the effectiveness of various previously proposed methods;
LM ensembling is reserved for future research. Finally, I must conduct further research
to determine the overall performance of these system relative to other possible model
configurations using the COWS-L2H dataset. As the systems discussed in this paper are

20
Model Precision Recall F0.5
Random only 0.026 0.019 0.024
Noise model only 0.067 0.07 0.068
MAGEC only 0.174 0.044 0.109
Parallel only 0.094 0.139 0.101
Fine-tuned Random 0.254 0.153 0.224
Fine-tuned Noise model 0.244 0.167 0.223
Fine-tuned MAGEC 0.480 0.226 0.392

Table 7
Model results

the first NMT-based grammatical error correction systems for Spanish learners, I have
no specific baseline with which to compare my models. One important consideration is
that the manual corrections made to the COWS-L2H essays were done in the manner
of a teacher correcting a student’s writing, and thus include many corrections which
are more stylistic than grammatical in nature. I can, however, confirm the effectiveness
of my training procedure that combines artificially noised data with Spanish learner
text. In Table 7, I show results for training on artificially noised data alone, MAGEC
data alone, and parallel data alone. I also present results for final models pretrained on
the random data and MAGEC data, respectively, then fine-tuned on parallel COWS-
L2H data. As previously discussed, I believe that performance could be improved by
increasing the amount of artificial training data, obtaining more genuine parallel data
for fine-tuning. Additionally, I hypothesize that my proposed method of inserting errors
based more closely on the actual errors made by student learners will result in better
performing and more adaptable models.

5. Error Correction in the Language Classroom

One major unanswered question in using GEC to provide AWCF is how tolerant stu-
dents are of system error, and how much does system error negatively impact learning
objectives. Unfortunately, little experimental work has been conducted to determine
how much system error impacts the usability of AWCF systems, and what is an ac-
ceptable amount of error. Quinlan, Higgins, and Wolff (2009) aim for a system accuracy
of 80%, while Burstein, Chodorow, and Leacock (2003) sought to achieve an accuracy
of 90% when developing their Criterion AWCF system. However, both of these targets
seems somewhat arbitrary, and do not appear to be supported by empirical evidence.
In one of the few studies evaluating the accuracy of Criterion, Lavolette, Polio, and
Kahng found that Criterion accurately tags 75% of errors, and that students utilize the
suggestion 73% of the time. But, Ranalli (2018) reports that even Criterion, which was
designed to have a 90% accuracy fell below 50% for some types of errors.
While identifying all errors would be the ideal result, most GEC and AWCF re-
searchers argue that the more important goal is avoiding false positives, that is, flagging
something as an error that is correct (Ranalli 2018). This means that precision is more
important than recall; luckily, neural-NMT based GEC systems, such as those reported
in this paper, have much higher precision compared to recall. In fact, the precision of
the currect state-of-the-art GEC system for English, GECTOR (Omelianchuk et al. 2020)
is 78.9, which is approaching the accuracy requirement cited by Quinlan, Higgins, and

21
Wolff (2009). Given this performance, I believe that a well designed neural GEC system,
with a targeted error generation method for pretraining and sufficient learner data for
fine-tuning, can be effectively integrated into an automated corrective feedback system
designed to provide synchronous feedback to students.
Another possible issue with AWCF raised by Ranalli (2018) is that feedback should
be targeted to the proficiency level of the student and the pedagogical goals of the
instructor. One major benefit to my proposed artificial data generation method is that
creating a system which targets specific error types becomes straightforward; since
errors are inserted based on part-of-speech, one would only need to identify those
parts of speech on which they want to insert errors to create targeted training data.
Additionally, should an instructor wish to insert specific types of errors, such as gender
agreement errors, they would need only to modify the confusion sets for the target
items. Finally, for fine-tuning, the genuine learner data can be modified using the
automatically annotated ERRANT output to enforce only those edits which meet the
model designer’s guidelines.
Once an effective GEC system is trained, the decision to provide focused or un-
focused feedback is a choice which can be made based on the proficiency level and
needs of students. A GEC system can provide output identifies errors and proposes
corrections, or it can simply identify the locations of potential errors. If trained using the
sequence tagging method proposed by GECTOR (Omelianchuk et al. 2020), the model
could additionally identify the type of the error. Finally, the system could allow students
to have access to progressively more detailed information at their discretion if they are
unable to correct the error with the limited information initially provided.

6. Conclusion and Future Work

Large corpora of language learner text are critical to the development of effective NLP
tools for the L2 classroom. But for languages other than English, such corpora are
fairly small and limited in scope. Even more limited is the type of parallel original
and corrected data needed by machine learning systems to learn how to correct errors
in student texts. Modern neural machine learning models require large amounts of
training data to tune their huge number of parameters. Current learner datasets simply
do not contain enough parallel data to train effective GEC systems on parallel data
alone, as demonstrated by the F0.5 of 0.101 achieved by the genuine-data-only system
previously discussed. To solve this problem, researchers in GEC have largely turned
to generating artificial error data by purposefully injecting errors into well-written
text. The ideal manner to generate artificial data for training GEC systems remains an
open question and the target of active research. This paper proposes a new method for
generating artificial training data for GEC based on the error patterns observed in real
learner text. To that end, I have conducted an analysis of the error patterns observed
in the COWS-L2H corpus of learner Spanish, and described a new system which can
generate artificial error data that replicates these error patterns. Based on the review
of previous error generation methods, I am able to address my research questions as
follows:
1) As discussed, current methods for generating artificial training data do not
attempt to replicate actual learner error patterns in their output. Random noising is
unlikely to resemble true error patterns to any extent, with the possible exception of a
small subset of character-level transpositions. Noising models, such as those described
by Xie et al. (2018), may be capable of replicating user error patterns, but such systems
require large amounts of genuine parallel data to train an effective noising model. The

22
ineffectiveness of noising models trained with limited parallel data is confirmed by
my implementation using COWS-L2H data, which did not perform any better than
the random noise model. Finally, the MAGEC method described by Grundkiewicz and
Junczys-Dowmunt (2019) performs significantly better than the other noising models
tested, but again it makes no explicit effort to replicate user error patterns. Any resem-
blance in the errors injected by MAGEC to real-world error patterns is likely due to the
fact that the spell-check system used by MAGEC may include errors commonly made
by users in its confusion sets (such as “has” in the confusion set for “had”), but this is in
no way guaranteed. Ultimately, current methods do not appear to closely replicate the
error distribution seen in learner data in any consistent manner.
2) The only research attempting to adapt GEC to student L1 and proficiency is
Nadejde and Tetreault (2019). However, as mentioned previously, their work is currently
applicable only to English due to the amount of training data required. Current methods
for generating artificial training data, which makes GEC possible for a far larger num-
ber of languages, do not currently attempt to adapt training data to specific student
attributes such as L1 and proficiency. The method proposed in this paper would be the
first method for generating artificial data which could be tuned to student attributes
and the pedagogical goals of instructors.
3) As I have not completed implementing my method for explicitly replicating
learner error distributions in artificial parallel training data, I am not able to fully answer
my final research question at this time. However, I can say that MAGEC, which seems
to replicate some aspects of learner error patterns due to its use of an open-source spell
checker, performs far better than a system pre-trained on randomly noised data. Thus,
I hypothesize that my proposed implementation will result in better GEC performance.
Additionally, given that Nadejde and Tetreault (2019) found that adapting GEC systems
to student L1 and proficiency improves system performance, I believe that my system,
which will be readily adaptable to student attributes, will allow for more useful GEC
for language learners.
In future work, I plan to complete the implementation of the proposed method for
generating artificial parallel data for training GEC models. Once I have built a general-
purpose GEC system for Spanish learners using COWS-L2H data, I will continue to
refine the system to generate training data adapted to student L1 and proficiency level.
Finally, I plan to test whether the error patterns seen in Spanish can be extended to
learners of other, related languages, allowing the building of GEC systems for languages
for which genuine parallel training data may not exist.
The development of accurate, GEC driven AWER systems for language learners is
an important step forward in the field of computer assisted language learning. Such
tools could assist teachers with time-consuming grading and assessment, and give
students rapid feedback while writing. While annotated learner corpora of English are
widely available, large learner corpora of other languages are less common, and those
that exist do not include annotations needed for development of error correction tools
for language learners. As a result, the field has seen little research in the development
of NLP tools designed to benefit non-English language learners and teachers. This
paper proposes a new method for the training GEC systems using annotated corpora of
learner data such as COWS-L2H, a freely available corpus of Spanish learner data which
includes raw text, student demographic and linguistic information, error annotations
and parallel corrected text. Once implemented, I hope that the proposed method for
generating artificial data will lead to a workable AWCF system for Spanish learners
across a variety of proficiency levels and L1s. Finally, I hope that this research will

23
encourage further inquiry into GEC-driven AWCF for the language classroom and to
development of GEC models for other lower-resourced languages.

References
AAAS. 2016. The State of Languages in the U.S.: A Statistical Portrait.
Al-Rfou, Rami, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed Word
Representations for Multilingual NLP. In Proceedings of the Seventeenth Conference on
Computational Natural Language Learning, pages 183–192, Association for Computational
Linguistics, Sofia, Bulgaria.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by
jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bitchener, John and Dana R Ferris. 2012. Written Corrective Feedback in Second Language
Acquisition and Writing. Routledge.
Bryant, Christopher and Ted Briscoe. 2018. Language model based grammatical error correction
without annotated training data. In Proceedings of the Thirteenth Workshop on Innovative Use of
NLP for Building Educational Applications, pages 247–253.
Bryant, Christopher, Mariano Felice, Øistein E Andersen, and Ted Briscoe. 2019. The BEA-2019
shared task on grammatical error correction. In Proceedings of the Fourteenth Workshop on
Innovative Use of NLP for Building Educational Applications, pages 52–75.
Bryant, Christopher, Mariano Felice, and Edward John Briscoe. 2017. Automatic annotation and
evaluation of error types for grammatical error correction. Association for Computational
Linguistics.
Bryant, Christopher and Hwee Tou Ng. 2015. How far are we from fully automatic high quality
grammatical error correction? In Proceedings of the 53rd Annual Meeting of the Association for
Computational Linguistics and the 7th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 697–707.
Bryant, Christopher Jack. 2019. Automatic annotation of error types for grammatical error correction.
Ph.D. thesis, University of Cambridge.
Burstein, J, M Chodorow, and C Leacock. 2003. Criterion: Online essay evaluation: An
application for automated evaluation of test-taker essays. In Fifteenth Annual Conference on
Innovative Applications of Artificial Intelligence, Acapulco, Mexico.
Cai, Hansong and Luna Jing Cai. 2015. An Exploratory Study on the Role of L1 Chinese and L2
English in the Cross-Linguistic Influence in L3 French. Online Submission, 9(3):1–30.
Cervantes, Instituto. 2019. El español una lengua viva-Informe 2019.
Cheville, Julie. 2004. Automated Scoring Technologies and the Rising Influence of Error. The
English Journal, 93(4):47–52.
Chollampatt, Shamil and Hwee Tou Ng. 2018. Neural Quality Estimation of Grammatical Error
Correction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, pages 2528–2539.
Chollampatt, Shamil, Weiqi Wang, and Hwee Tou Ng. 2019. Cross-sentence grammatical error
correction. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pages 435–445.
Coster, William and David Kauchak. 2011. Simple English Wikipedia: a new text simplification
task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies: short papers-Volume 2, pages 665–669, Association for
Computational Linguistics.
Cummins, Jim. 2008. Teaching for transfer: Challenging the two solitudes assumption in
bilingual education. Encyclopedia of language and education, 5:65–75.
Dahlmeier, Daniel and Hwee Tou Ng. 2012. A beam-search decoder for grammatical error
correction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning, pages 568–578.
Damerau, Fred J. 1964. A technique for computer detection and correction of spelling errors.
Communications of the ACM, 7(3):171–176.
Davidson, Sam, Aaron Yamada, Paloma Fernandez-Mira, Agustina Carando, Claudia H
Sanchez-Gutierrez, and Kenji Sagae. 2020. Developing NLP tools with a new corpus of learner
Spanish. In Proceedings of The 12th Language Resources and Evaluation Conference, pages
7238–7243.

24
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dikli, Semire. 2006. An overview of automated scoring of essays. The Journal of Technology,
Learning and Assessment, 5(1).
Ellis, Rod. 2002. Grammar teaching: Practice or consciousness-raising. Methodology in Language
Teaching: An Anthology of Current Practice, 167:174.
Felice, Mariano, Christopher Bryant, and Ted Briscoe. 2016. Automatic Extraction of Learner
Errors in ESL Sentences Using Linguistically Enhanced Alignments. In Proceedings of COLING
2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages
825–835.
Fishman, Joshua A. 2001. Can threatened languages be saved?: Reversing language shift, revisited: A
21st century perspective, volume 116. Multilingual Matters.
Flachs, Simon, Ophélie Lacroix, and Anders Søgaard. 2019. Noisy channel for low resource
grammatical error correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP
for Building Educational Applications, pages 191–196.
Fraser, Ian S and Lynda M Hodson. 1978. Twenty-One Kicks at the Grammar Horse: Close-Up:
Grammar and Composition. English Journal, 67(9):49–54.
Gass, Susan M. 1991. Grammar instruction, selective attention and learning processes.
Foreign/Second Language Pedagogy Research, pages 134–141.
Granger, Sylviane. 1998. The Computer Learner Corpus: a versatile new source of data for SLA research.
na.
Grundkiewicz, Roman and Marcin Junczys-Dowmunt. 2019. Minimally-Augmented
Grammatical Error Correction. In Proceedings of the 5th Workshop on Noisy User-generated Text
(W-NUT 2019), pages 357–363.
Grundkiewicz, Roman, Marcin Junczys-Dowmunt, and Kenneth Heafield. 2019. Neural
grammatical error correction systems with unsupervised pre-training on synthetic data. In
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational
Applications, pages 252–263.
Heafield, Kenneth. 2011. KenLM: Faster and smaller language model queries. In Proceedings of
the Sixth Workshop on Statistical Machine Translation, pages 187–197.
Honnibal, Matthew and Ines Montani. 2017. Spacy 2: Natural language understanding with
Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1).
Ionin, Tania and Silvina Montrul. 2010. The role of L1 transfer in the interpretation of articles
with definite plurals in L2 English. Language learning, 60(4):877–925.
Jehle, Fred. 1987. A free-form dialog program in Spanish. Calico Journal, pages 11–22.
Junczys-Dowmunt, Marcin, Roman Grundkiewicz, Shubha Guha, and Kenneth Heafield. 2018.
Approaching neural grammatical error correction as a low-resource machine translation task.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 595–606.
Kaneko, Masahiro, Kengo Hotate, Satoru Katsumata, and Mamoru Komachi. 2019. TMU
Transformer System Using BERT for Re-ranking at BEA 2019 Grammatical Error Correction
on Restricted Track. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for
Building Educational Applications, pages 207–212.
Kasewa, Sudhanshu, Pontus Stenetorp, and Sebastian Riedel. 2018. Wronging a Right:
Generating Better Errors to Improve Grammatical Error Detection. arXiv preprint
arXiv:1810.00668.
Kingma, Diederik P and Jimmy Ba. 2014. ADAM: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.
Klein, Guillaume, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. 2017.
Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
Koltovskaia, Svetlana. 2020. Student engagement with automated written corrective feedback
(AWCF) provided by Grammarly: A multiple case study. Assessing Writing, 44:100450.
Krippendorff, Klaus. 2011. Computing krippendorff’s alpha-reliability.
Lavolette, Elizabeth, Charlene Polio, and Jimin Kahng. The accuracy of computer-assisted
feedback and students’ responses to it. Language, Learning & Technology.
Leacock, Claudia, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated
Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language
Technologies, 3(1):1–134.

25
Lee, Kyusong and Gary Geunbae Lee. 2014. Postech grammatical error correction system in the
CoNLL-2014 shared task. In Proceedings of the Eighteenth Conference on Computational Natural
Language Learning: Shared Task, pages 65–73.
Li, Jinrong, Stephanie Link, and Volker Hegelheimer. 2015. Rethinking the role of automated
writing evaluation (AWE) feedback in ESL writing instruction. Journal of Second Language
Writing, 27:1–18.
Li, Yiyuan, Antonios Anastasopoulos, and Alan W Black. 2020. Towards Minimal Supervision
BERT-Based Grammar Error Correction (Student Abstract). In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 13859–13860.
Lynch, Andrew. 2008. The linguistic similarities of Spanish heritage and second language
learners. Foreign Language Annals, 41(2):252–381.
McCoy, Kathleen F, Christopher A Pennington, and Linda Z Suri. English error correction: A
syntactic user model based on principled mal-rule scoring.
Montrul, Silvina. 2010. Current issues in heritage language acquisition. Annual Review of Applied
Linguistics, 30:3.
Nadeem, Farah, Huy Nguyen, Yang Liu, and Mari Ostendorf. 2019. Automated essay scoring
with discourse-aware neural models. In Proceedings of the Fourteenth Workshop on Innovative
Use of NLP for Building Educational Applications, pages 484–493, Association for Computational
Linguistics, Florence, Italy.
Nadejde, Maria and Joel Tetreault. 2019. Personalizing Grammatical Error Correction:
Adaptation to Proficiency Level and L1. In Proceedings of the 5th Workshop on Noisy
User-generated Text (W-NUT 2019), pages 27–33, Association for Computational Linguistics,
Hong Kong, China.
Napoles, Courtney and Chris Callison-Burch. 2017. Systematically adapting machine translation
for grammatical error correction. In Proceedings of the 12th Workshop on Innovative use of NLP for
Building Educational Applications, pages 345–356.
Ng, Hwee Tou, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and
Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task,
pages 1–14.
NTCE. 2014. NCTE.
Omelianchuk, Kostiantyn, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi.
2020. GECToR–Grammatical Error Correction: Tag, Not Rewrite. arXiv preprint
arXiv:2005.12592.
Ooms, Jeroen. 2018. Hunspell: High-performance stemmer, tokenizer, and spell checker. Pobrane
z https://CRAN. R-project. org/package= hunspell.
Padró, Lluís and Evgeny Stanilovsky. 2012. Freeling 3.0: Towards wider multilinguality. In
LREC2012.
Polinsky, Maria and Olga Kagan. 2007. Heritage languages: In the ‘wild’ and in the classroom.
Language and Linguistics Compass, 1(5):368–395.
Quinlan, Thomas, Derrick Higgins, and Susanne Wolff. 2009. Evaluating the construct-coverage
of the E-Rater® scoring engine. ETS Research Report Series, 2009(1):i–35.
Quirk, Chris, Chris Brockett, and William Dolan. 2004. Monolingual machine translation for
paraphrase generation. In Proceedings of the 2004 Conference on Empirical mMethods in Natural
Language Processing, pages 142–149.
Ranalli, Jim. 2018. Automated written corrective feedback: how well can students make use of
it? Computer Assisted Language Learning, 31(7):653–674.
Rei, Marek, Mariano Felice, Zheng Yuan, and Ted Briscoe. 2017. Artificial error generation with
machine translation and syntactic patterns. In Proceedings of the 12th Workshop on Innovative
Use of NLP for Building Educational Applications, pages 287–292.
Rothman, Jason and Jennifer Cabrelli Amaro. 2010. What variables condition syntactic transfer?
A look at the L3 initial state. Second Language Research, 26(2):189–218.
Rozovskaya, Alla and Dan Roth. 2014. Building a state-of-the-art grammatical error correction
system. Transactions of the Association for Computational Linguistics, 2:419–434.
Sakaguchi, Keisuke, Courtney Napoles, Matt Post, and Joel Tetreault. 2016. Reassessing the goals
of grammatical error correction: Fluency instead of grammaticality. Transactions of the
Association for Computational Linguistics, 4:169–182.
Shintani, Natsuko. 2016. The effects of computer-mediated synchronous and asynchronous
direct corrective feedback on writing: a case study. Computer Assisted Language Learning,

26
29(3):517–538.
Sidorov, Grigori, Anubhav Gupta, Martin Tozer, Dolors Catala, Angels Catena, and Sandrine
Fuentes. 2013. Rule-based system for automatic grammar correction using syntactic n-grams
for English language learning (l2). In Proceedings of the Seventeenth Conference on Computational
Natural Language Learning: Shared Task, pages 96–101.
Silva-Corvalán, Carmen. 1994. Language contact and change: Spanish in Los Angeles. ERIC.
Snape, Neal. 2009. Exploring Mandarin Chinese speakers’ L2 article use. Representational deficits
in SLA: Studies in honor of Roger Hawkins, pages 27–51.
StatisticalAtlas.com. Languages in California (State).
Stevenson, Marie and Aek Phakiti. 2014. The effects of computer-generated feedback on the
quality of writing. Assessing Writing, 19:51 – 65. Feedback in Writing: Issues and Challenges.
Tajeddin, Zia, Minoo Alemi, and Roya Pashmforoosh. 2017. Acquisition of Pragmatic Routines
by Learners of L2 English: Investigating Common Errors and Sources of Pragmatic
Fossilization. Tesl-Ej, 21(2):n2.
Tatawy, Mounira. 2002. Corrective feedback in SLA Corrective Feedback in Second Language
Acquisition. Studies in Applied Linguistics TESOL, 2.
Valdés, Guadalupe. 2000. The teaching of heritage languages: An introduction for
Slavic-teaching professionals. The Learning and Teaching of Slavic Languages and Cultures,
375:403.
Van Deusen-Scholl, Nelleke. 2003. Toward a Definition of Heritage Language: Sociopolitical and
Pedagogical Considerations. Journal of Language, Identity, and Education, 2(3):211–230.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural
Information Processing Systems, pages 5998–6008.
Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008.
Extracting and composing robust features with denoising autoencoders. In Proceedings of the
25th International Conference on Machine Learning, pages 1096–1103, ACM.
Xia, Menglin, Ekaterina Kochmar, and Ted Briscoe. 2016. Text Readability Assessment for
Second Language Learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for
Building Educational Applications, pages 12–22.
Xie, Ziang, Guillaume Genthial, Stanley Xie, Andrew Ng, and Dan Jurafsky. 2018. Noising and
denoising natural language: Diverse backtranslation for grammar correction. In Proceedings of
the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long Papers), pages 619–628.
Xue, Huichao and Rebecca Hwa. 2014. Improved correction detection in revised ESL sentences.
In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume
2: Short Papers), pages 599–604.
Yannakoudakis, Helen, Øistein E Andersen, Ardeshir Geranpayeh, Ted Briscoe, and Diane
Nicholls. 2018. Developing an automated writing placement system for ESL learners. Applied
Measurement in Education, 31(3):251–267.
Zhao, Wei, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. 2019. Improving
Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with
Unlabeled Data. arXiv preprint arXiv:1903.00138.

27
28

You might also like