Building A Statistical Machine Translation System From Scratch
Building A Statistical Machine Translation System From Scratch
Building A Statistical Machine Translation System From Scratch
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
14. ABSTRACT
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE
8
unclassified unclassified unclassified
80 80
text coverage (in %)
40 40
20 20
0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
corpus size (in thousand tokens) corpus size (in thousand tokens)
English Tamil
Figure 1: Text coverage on previously unseen text for English (left) and Tamil (right). The upper line in each
graph shows the coverage by tokens that have been seen at least once, the lower line shows the coverage by tokens
that have been seen at least 5 times. The error bars indicate standard deviation.
is being gathered. kens that had been seen at least five times before.
For the purpose of statistical NLP, it seems reason-
In this case, the post-editor clearly misinterpreted the able to assume that the lower curve gives a better indi-
translator. What the translator meant to and actually cation of how many percent of previously unseen text
did say is that information was being gathered about we can expect to be “known” to a statistical model
the schools in which migrants/war refugees who had trained on a corpus of m tokens.
arrived in Kudaanadu had found shelter. However, At a corpus size of 24,000 tokens, which is approx-
the post-editor interpreted the phrase people who mi- imately the size of the parallel corpus we were able to
grated to Kudaana as describing immigrants and as- create during our experiment, about 28% of all word
sumed that information was being gathered about their forms in previously unseen Sri Lankan Tamil text can-
education rather than their housing. not be found in the corpus, and 50% have been seen
less than 5 times. In other words, if we train a system
3 Evaluation Experiments on this data, we can expect it to stumble over every
other word! At a corpus size of 100,000 tokens, the
3.1 A Priori Considerations numbers are 17% and 33%.
The richer the morphology of a language is, the greater For English, the numbers are 9%/23% for a corpus
is the total number of distinct word forms that a given of 24K tokens and 0%/8% for a corpus of 100K to-
corpus consists of, and the smaller is the probability kens.
that a certain word form actually occurs in any given In order to boost the text coverage we built a simple
text segment. Figure 1 shows the percentage of word text stemmer for Tamil, based on the Tamil inflection
forms in unseen text that have occurred in previously tables in Steever (1990) and some additional inspec-
seen text as a function of the amount of previously tion of our parallel corpus. The stemmer uses regular
seen text. The graph on the left shows the curves for expression matching to cut off inflectional endings and
English, the one on the right the curves for Sri Lankan introduce some extra tokens for negation and certain
Tamil. The graphs show the averages of 100 runs on case markings (such as locative and genitive), which
different text fragments; the error bars indicate stan- are all marked morphologically in Tamil. It should be
dard deviation. noted that the stemmer is far from perfect and was only
The numbers were computed in the following man- intended to be an interim solution. The performance
ner: A corpus of 120,000 tokens was split into seg- increases are displayed in Figure 2. For a corpus size
100 methods, namely glossing (replacing each Tamil word
by the most likely candidate from the translation tables
80 created with the E GYPT toolkit) and Model 4 decoding
(Brown et al., 1995; Germann et al., 2001).
text coverage (in %)
a
Brown et al. (1993); Brown et al. (1995)
partially
input pegginga ? transfer correct incorrect
correctb
c
1 raw no M4 decoding 7 4 4
2 stemmed yes M4 decoding 8 3 4
3 stemmed no M4 decoding 13 2 0
4 raw no gloss 13 1 1
5a stemmed yes gloss 8 3 4
5b stemmed yes gloss 12 2 1
6 stemmed no gloss 11 2 2
a
pegging causes the training algorithm to consider a larger search space
b
correct top level category but incorrect sub-category
c
translation by maximizing the IBM Model 4 probability of the source/translation pair (Brown
et al., 1993; Brown et al., 1995)
classification might be performed by automatic pro- nificant manner. We were surprised that even with
cedures rather than humans. If we dare to accept the the poor translation performance of our system, re-
top performances of our human subjects as the ten- calls as high as 93% at a precision of 88% could be
tative upper bound of what can be achieved with the achieved. Secondly, the data shows that gaps are not
current system using a translation model trained on randomly distributed over the data, but that some ques-
85K tokens of Tamil text and the corresponding En- tions clearly seem to have been more difficult than oth-
glish translations, we can conclude that the classifica- ers. One of the particular difficult aspects of the task
tion accuracy can exceed 86% (13/15) for fine-grained was the spelling of names. Question 11, for exam-
classification and reach 100% for coarse-grained clas- ple, asked What happened to Chandra Kumar Abayas-
sification. However, given the extremely small sample ingh?. In the translations, however, it was rendered in
size in this evaluation, the evidence should not be con- simple transliteration: cantirakumaara apayacingka.
sidered conclusive. It requires a considerable degree of tenacity and imag-
ination to find this connection.
3.2.2 Document Retrieval Task
The document retrieval task and the question an- 3.2.3 Question Answering Task
swering task (see below) were combined into one task. In order to measure the performance in the ques-
The subjects received 14 texts (from the TamilNet cor- tion answering part of this evaluation, we considered
pus) and 15 lead questions plus 13 additional follow- only questions relevant to the documents that the test
up questions. Their task was to identify the docu- subjects had identified correctly. Because of the diffi-
ment(s) that contain(s) the answer to the question and culty of the task, we were lenient to some degree in the
to answer the questions asked. Typical lead questions evaluation. For example, if the correct answer was the
were questions such as What is the security situation former president of the teacher’s union and the answer
in Trinconmalee?, or Who is S. Thivakarasa?; typi- given was an official of the teacher’s union, we still
cal follow-up questions were Who is in control of the counted this as “close enough” and therefore correct.
situation?, What happened to him/her?, or How did In addition, we also allowed partially correct answers,
(other) people react to what happened?. As in the pre- that is, answers that went into the right direction but
vious experiment, each subject received the output of were not quite correct. For example, if the correct an-
a different system. swer was The army imposed a curfew on fishing, we
Table 2 shows the result of the document retrieval counted the answer the army is stopping fishing boats
task. Again, the sample size was too small to draw as partially correct. All in all, it was very difficult to
any final conclusions, but our results seem to suggest evaluate this section of the task, because it was often
the following. Firstly, the test subject in the group close to impossible to determine whether the answer
dealing with output of systems trained on the bigger was just an educated guess or actually based on the
training set tend to perform better than the ones deal- text. There were some cases where answers were par-
ing with the results of training on less data. This sug- tially or even fully correct even though the correct doc-
gests that the jump from 24K to 85K tokens of train- ument had not been identified. In retrospect we con-
ing data might improve system performance in a sig- clude that it would have been better to have the test
Table 2: Recall and precision on the document retrieval task. Test subjects were asked to identify the document(s)
containing the answers to 15 lead questions. Black dots indicate successful identification of at least one document
containing the answer.
training transl.
corpus method 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 recall precision
1 TNa glossing 67% 79%
2 TN M4dec.b 53% 80%
3 TN bothc 60% 48%
4 TNd both 67% 79%
5 B+TN.1 glossing 80% 87%
6 B+TN.1 M4dec. 67% 86%
7 B+TN.1 both 80% 86%
8 B+TN.1f both 60% 73%
9 B+TN.1g both 93% 88%
10 B+TN.2h both 87% 75%
Human Translations 100% 100%
a
TamilNet corpus only; stemmed; 1291 aligned text chunks; 23,359 tokens on Tamil side; 1000 training iterations.
b
IBM Model 4 decoding.
c
Both glossing and IBM Model 4 decoding were available to the test subject.
d
same as above, but trained with pegging option (more thorough search during training); 10 training iterations.
e
Berkeley and TamilNet corpora; 5069 aligned text chunks; 85421 tokens on Tamil side; 100 training iterations.
f
same as above; 10 training iterations.
g
same as above, trained with pegging option; 10 training iterations.
h
Berkeley and TamilNet Corpora, raw (unstemmed); 64439 tokens on Tamil side, 50 training iterations.