Building A Statistical Machine Translation System From Scratch

Building a Statistical Machine Translation System from Scratch:
How Much Bang for the Buck Can We Expect?

Ulrich Germann
USC Information Sciences Institute
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292
[email protected]
Abstract 2 Data Collection and Preparation

We report on our experience with building a 2.1 Obtaining Tamil Data
statistical MT system from scratch, includ-
Tamil data is not very difficult to find on the web.
ing the creation of a small parallel Tamil-
There are several Tamil newspapers and magazines
English corpus, and the results of a task-
with online editions, and the large international Tamil
based pilot evaluation of statistical MT sys-
community fosters the use of the Internet for the dis-
tems trained on sets of ca. 1300 and ca.
semination of information. After initial investigation
5000 parallel sentences of Tamil and English
of several web sites we decided to download our exper-
data. Our results show that even with appar-
imental corpus from www.tamilnet.com, a news
ently incomprehensible system output, hu-
site that provides local news on Sri Lanka in both
mans without any knowledge of Tamil can
Tamil and English. The Tamil and English news texts
achieve performance rates as high as 86%
on this site do not seem to be translations of each other.
accuracy for topic identification, 93% recall
The availability of a fairly large in-domain corpus of
for document retrieval, and 64% recall on
local news on Sri Lanka in English (over 2 million
question answering (plus an additional 14%
words) allowed us to train an in-domain English lan-
partially correct answers).
guage model of Sri Lankan news.
1 Introduction 2.2 Encoding and Tokenization

Crises and disasters frequently attract international at- Tamil is written in a phonematic, non-Latin script.
tention to regions of the world that have previously Several encoding schemes exist in parallel. Even
been largely ignored by the international community. though the Unicode standard includes a set of glyphs
While it is possible to stock up on emergency relief for Tamil, it is not widely used in practice. Most web
supplies and, for the worst case, weapons, regardless sites that offer Tamil language material assume Latin-1
of where exactly they are eventually going to be used, encoding and rely on special true type fonts, which of-
this cannot be done with multilingual information pro- ten are also offered for free download at those sites.
cessing technology. This technology will often have to Tamil text is therefore fairly easy to identify on web
be developed after the fact in a quick response to the sites via the face attribute of the HTML font tag. All
given situation. Multilingual data resources for sta- that is necessary is a list of Tamil font names used by
tistical approaches, such as parallel corpora, may not the different sites, and knowledge about which encod-
always be available. ings these fonts implement. While we could restrict
In the fall of 2000, we decided to put the current ourselves to one data source and encoding for our ex-
state of the art to the test with respect to the rapid con- periment, any large-scale system would have to take
struction of a machine translation system from scratch. this into account. In order to make the source text
Within one month, we would recognizable to humans who have no knowledge of
hire translators; Tamil, we decided to work with transliterated text1 .
translate as much text as possible; and
2.3 Translating the Corpus
train a statistical MT system on the data thus cre-
ated. Originally we hoped to be able to create a parallel cor-
pus of about 100,000 words on the Tamil side within
The language of choice was Tamil, which is spoken
in Sri Lanka and in the southern part of India. Tamil is one month, using several translators. Professional
a head-last language with a very rich morphology and 1
Translations, however, were produced from the original
therefore quite different from English. Tamil.
Form Approved
Report Documentation Page OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED

2. REPORT TYPE
2001 00-00-2001 to 00-00-2001
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Building a Statistical Machine Translation System from Scratch: How 5b. GRANT NUMBER
Much Bang for the Buck Can We Expect?
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION

REPORT NUMBER
Information Sciences Institute,University of Southern California,4676
Admirality Way Suite 1001,Marina del Rey,CA,90292
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT

NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT

Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE
8
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)

Prescribed by ANSI Std Z39-18
translation services in the US currently charge rates of estimated 2.5 person months. However, a good part of
about 30 cents per English word for translations from this effort led to resources that can also be used for
Tamil into English. Given that the English transla- other purposes.
tion of a Tamil text usually contains about 1.2 times
as many words as the Tamil original, the translation of 2.4 Lessons Learned for Future Projects
a corpus of 100,000 Tamil words would cost approxi- If we were to give advice for future, similar projects,
mately USD 36,000. This was far beyond our budget. we would emphasize and recommend the following:
In India, by comparison, raw translations may cost
2.4.1 Good translators are not easy to find
as little as one cent per Tamil word2 . However, out-
sourcing the translation work abroad was not feasible It is difficult to find good translators for a short-
for us, since we had neither the administrative infras- term commitment. Unless one is willing to pay a pre-
tructure nor the time to manage such an effort. Also, mium price, it is unlikely that one will find profes-
working with partners so remote would have made it sional translators who are willing to commit much of
very difficult to communicate our exact needs and to their time for a limited period of time and on short no-
implement proper quality control. tice.
We finally decided to hire as translators four enter- 2.4.2 Make the translation job attractive
ing and second-year graduate students in the depart- As foreign students, our translators would each have
ment of engineering whose native language is Tamil been allowed to work up to twenty hours per week.
and who had responded to an ad posted in the local None of them did, because the work was frustrating
mailing list for students from India. and boring, and because they found more attractive,
In order to manage the corpus translation process, long term employment on campus. Our translators’
we set up a web interface through which the transla- frustration may have been fostered by several factors:
tors could retrieve source texts and upload their trans-
lations, post-editors could post-edit text online, the the differences between Sri Lankan Tamil (the va-
project progress could be monitored, and all incoming riety used in our corpus) and the Tamil spoken in
text was available to other project members as soon as Southern India (the native language of our trans-
it was submitted. lators), which made translating, according to our
We originally assumed that translators would be translators, very difficult;
able to translate about 500 words per hour if we were
the lack of translation experience of our transla-
content with raw translations and hardly any format-
tors; and
ting, and if we allowed them to skip difficult words
or sentences. This estimate was based on an inter- our high expectations. We originally told our
nal evaluation, in which multilingual members of our translators that since they were not working on
group translated sample documents from their native site, we would expect the translation of 500
language (Arabic, German, Romanian) into English words per hour reported. When we later switched
and kept track of the time they spent on this. to hourly pay regardless of translation volume,
It turned out that our expectations were very much the translation volume picked up slightly.
exaggerated with both respect to translation speed and
the quality of translation. The actual translation speed 2.4.3 Be prepared to post-edit
for Tamil varied between 156 and 247 words per hour In professional translating, translators typically
with an average of 170 words per hour. In 139 hours of translate into their native language only. One may not
reported translation time (over a period of eventually 6 be able to find translators with English as their na-
weeks), about 24,000 words / 1,300 sentences of Tamil tive language for low density or “small” languages,
text were translated, at an effective cost of ca. 10.8 so it may be necessary to have the translations post-
cents per Tamil word (translators’ compensation plus edited by people with greater language proficiency in
administrative overhead). This figure does not include English.
the effort for manually post-editing the translations by 2.4.4 Have translators and post-editors work on
a native speaker of English (12-16 person hours). site
The overall organization of the project (source data
It is better to have translators and post-editors work
retrieval, hiring and management of the translators,
on site and ideally as teams, so that they can resolve
design and implementation of the web interface for
ambiguities and misunderstandings immediately with-
managing the project via the Internet, development of
out the delays of communicating indirectly, be it by
transliterator and stemmer, etc.) required an additional
email or other means. A post-editor who does not
2
Personal communications with Thomas Malten, University know the source language may misinterpret the trans-
of Cologne. lator, as the following case from our corpus illustrates:
100 100
80 80
text coverage (in %)

60 60
40 40
20 20
0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
corpus size (in thousand tokens) corpus size (in thousand tokens)
English Tamil
Figure 1: Text coverage on previously unseen text for English (left) and Tamil (right). The upper line in each
graph shows the coverage by tokens that have been seen at least once, the lower line shows the coverage by tokens
that have been seen at least 5 times. The error bars indicate standard deviation.
ments of 1000 tokens each. For each segment nk ,

Raw translation: Information about the schools in we computed how many of the tokens had been pre-
which people who migrated to Kudaanadu are staying viously seen in the segments n1 : : : nk 1 . The upper
is being gathered. line in the graphs shows the percentage of tokens in nk
Post-edited version: Information about the schools in that had occurred at least once before in the segments
(sic!) which immigrants to Kudaanadu are attending n1 : : : nk 1 , the lower line shows the percentage of to-
is being gathered. kens that had been seen at least five times before.
For the purpose of statistical NLP, it seems reason-
In this case, the post-editor clearly misinterpreted the able to assume that the lower curve gives a better indi-
translator. What the translator meant to and actually cation of how many percent of previously unseen text
did say is that information was being gathered about we can expect to be “known” to a statistical model
the schools in which migrants/war refugees who had trained on a corpus of m tokens.
arrived in Kudaanadu had found shelter. However, At a corpus size of 24,000 tokens, which is approx-
the post-editor interpreted the phrase people who mi- imately the size of the parallel corpus we were able to
grated to Kudaana as describing immigrants and as- create during our experiment, about 28% of all word
sumed that information was being gathered about their forms in previously unseen Sri Lankan Tamil text can-
education rather than their housing. not be found in the corpus, and 50% have been seen
less than 5 times. In other words, if we train a system
3 Evaluation Experiments on this data, we can expect it to stumble over every
other word! At a corpus size of 100,000 tokens, the
3.1 A Priori Considerations numbers are 17% and 33%.
The richer the morphology of a language is, the greater For English, the numbers are 9%/23% for a corpus
is the total number of distinct word forms that a given of 24K tokens and 0%/8% for a corpus of 100K to-
corpus consists of, and the smaller is the probability kens.
that a certain word form actually occurs in any given In order to boost the text coverage we built a simple
text segment. Figure 1 shows the percentage of word text stemmer for Tamil, based on the Tamil inflection
forms in unseen text that have occurred in previously tables in Steever (1990) and some additional inspec-
seen text as a function of the amount of previously tion of our parallel corpus. The stemmer uses regular
seen text. The graph on the left shows the curves for expression matching to cut off inflectional endings and
English, the one on the right the curves for Sri Lankan introduce some extra tokens for negation and certain
Tamil. The graphs show the averages of 100 runs on case markings (such as locative and genitive), which
different text fragments; the error bars indicate stan- are all marked morphologically in Tamil. It should be
dard deviation. noted that the stemmer is far from perfect and was only
The numbers were computed in the following man- intended to be an interim solution. The performance
ner: A corpus of 120,000 tokens was split into seg- increases are displayed in Figure 2. For a corpus size
100 methods, namely glossing (replacing each Tamil word
by the most likely candidate from the translation tables
80 created with the E GYPT toolkit) and Model 4 decoding
(Brown et al., 1995; Germann et al., 2001).
Figure 3 shows the output of the different systems

60
in comparison with the human translation.
We then conducted the following experiments.
40
3.2.1 Document Classification Task
Seven human subjects without any knowledge of
20
Tamil were given translations of a set of 15 texts (all
from the Berkeley corpus) and asked to categorize
0 them according to the following topic hierarchy:
0 20 40 60 80 100 120
corpus size (in thousand tokens)
News about Sri Lanka
Figure 2: Text coverage increase by stemming for Reports about clashes between the Sri
Tamil. The solid lines indicate text coverage for un- Lankan army and the Liberation Tigers
stemmed data (seen at least once and at least five times, Sri Lankan security-related news (arrests,
respectively), the dashed lines the text coverage for arms deals, etc.)
stemmed data. Sri Lankan political news (strikes, transport,
telecom)
concerns Sri Lanka but doesn’t fit any of the
of 24K tokens, the percentages of unknown items drop above
to 19% (from 28%; never seen before) and 36% (from News about Pakistan/India
50%; seen less than 5 times). For a training corpus
of 100K tokens, the numbers are 12% and 23% (from Nuclear tests in Pakistan and India, includ-
17%/33%). ing their aftermath (international reactions,
etc.)
3.2 Task-Based Pilot Evaluation Corruption investigation against Benazir
Buto
Given these numbers, it is obvious that one cannot ex-
News about Pakistan/India but none of the
pect much performance from a system that relies on
above
models trained on only 24K tokens of data. As a mat-
ter of fact, it is close to impossible to make any sense International news
whatsoever of the output of such a system (cf. Fig. 3). Disasters, accidents
To get an estimate of the performance with more Nelson Mandela’s birthday
training data, we augmented our corpus with a paral- Other international news
lel corpus of international news texts in Southern In- Impossible to tell
dian Tamil which was made available to us by Fred
Gey of the University of California at Berkeley (hence- Except for one duplicate set, each subject received a
forth: Berkeley corpus). This corpus contains ca. different set of translations. The sets differed in train-
3,800 sentence pairs with 75,800 Tamil tokens after ing parameters and the translation method used. Ta-
stemming (before stemming: 60,000; the difference ble 1 shows the results of this evaluation. The dif-
is due to the introduction of additional markers dur- ference between the subjects 5a and 5b, who received
ing stemming). Some of the parallel data was with- the same set of translations, suggests that the individ-
held for system evaluation; the augmented training ual classifiers’ accuracy influences the results so much
corpus (Berkeley and TamilNet corpus; short B+TN) as to blur the effect of the other parameters. There
had a size of 85K tokens on the Tamil side. The aug- seems to be a tendency for glossing to work better than
mented training corpus had a text coverage of 81% Model 4 decoding. Glossing, in our system, is a simple
(seen at least once; 75% without augmentation), and base line algorithm that provides the most likely word
67% (seen at least 5 times; 60% without augmenta- translation for each word of input. Translation can-
tion), respectively, for Sri Lankan Tamil. We trained didates and their probabilities are retrieved from the
IBM Translation Model 4 (Brown et al., 1993) both on translation table, which is part of the translation model
our corpus alone and on the augmented corpus, using trained on the parallel corpus.
the E GYPT toolkit (Knight et al., 1999; Al-Onaizan et The document classification test is foremost and
al., 1999), and then translated a number of texts us- above all a measure of the quality of the translation ta-
ing different translation models and different transfer ble for frequently occurring words. In practice, actual
Small Corpus Augmented Corpus
24K tokens (Tamil) of training data 85K tokens (Tamil) of training data
Human translation Gloss Model 4 decodinga Gloss Model 4 decoding
Government - United vanni united national government united

children united the government with
National Party party kalloya national party meet
national party united national party
meeting will not take tomorrow day
tomorrow . meets day .
place tomorrow. naTaipeRamaaTTa naTaipeRamaaTTa
The proposed friday progressive it throughout

meeting tomorrow, and united national progressive
Thursday, between same kaTcikkumiTai government the
the Peoples Front tomorrow thursday on friday health united national in throughout the
Government and the naTaipeRaviru new mothers united kaTcikkumiTai day government and the
United National political vaappu national tomorrow on thursday leno new united national party
Party regarding the information kalloya thursday new political vaappu . day thursday leno
New Political piRitoru political found on found meet piRitoru new political found
Document was tin2attiRkupa information . tin2attiRkupa meet movies .
announced as pin2pooTappaTT- pin2pooTappaTT-
postponed to a later uLLataaka found uLLataaka movies
date. is .
of lankan president
Presidential of lankan to and chandrika sankari
Secretariat sources freedom tamilian freedom now the sri lankan forces
say that this meeting now ’s lanka to advisor veLinaaTTukkuc of president
was postponed as the veLinaaTTukkuc freedom of the cen2Riruppat chandrika sankari
Sri Lankan President cen2Riruppat tamilian team to iccantippu freedom presently
Chandrika iccantippu secretariat is called piRpooTappaTT- secretariat circles
Bandaranayakka has piRpooTappaTT- minister . uLLataaka president reported .
gone to a foreign uLLataaka to secretariat circles
country. secretariat in * is reported .
At the same time it is

to be noted that the
meeting of the sub
he countries
committee that about returning returning the
examining the and iccantippu her about the military his country and the
iccantippu
political document naTaipeRumen2a and they were returning to increase
naTaipeRumen2a the
between the United and ivvaTTaaram announced . the radio .
ivvaTTaaram more
National Party and more and * is the reported .
the Government was
held yesterday
Tuesday evening
a
Brown et al. (1993); Brown et al. (1995)
Figure 3: Sample output of various systems

Table 1: Results of the Document Classification Task. Test subjects were asked to classify the translations of 15
documents into 4 major and 11 minor categories.
partially
input pegginga ? transfer correct incorrect
correctb
c
1 raw no M4 decoding 7 4 4
2 stemmed yes M4 decoding 8 3 4
3 stemmed no M4 decoding 13 2 0
4 raw no gloss 13 1 1
5a stemmed yes gloss 8 3 4
5b stemmed yes gloss 12 2 1
6 stemmed no gloss 11 2 2
a
pegging causes the training algorithm to consider a larger search space
b
correct top level category but incorrect sub-category
c
translation by maximizing the IBM Model 4 probability of the source/translation pair (Brown
et al., 1993; Brown et al., 1995)
classification might be performed by automatic pro- nificant manner. We were surprised that even with
cedures rather than humans. If we dare to accept the the poor translation performance of our system, re-
top performances of our human subjects as the ten- calls as high as 93% at a precision of 88% could be
tative upper bound of what can be achieved with the achieved. Secondly, the data shows that gaps are not
current system using a translation model trained on randomly distributed over the data, but that some ques-
85K tokens of Tamil text and the corresponding En- tions clearly seem to have been more difficult than oth-
glish translations, we can conclude that the classifica- ers. One of the particular difficult aspects of the task
tion accuracy can exceed 86% (13/15) for fine-grained was the spelling of names. Question 11, for exam-
classification and reach 100% for coarse-grained clas- ple, asked What happened to Chandra Kumar Abayas-
sification. However, given the extremely small sample ingh?. In the translations, however, it was rendered in
size in this evaluation, the evidence should not be con- simple transliteration: cantirakumaara apayacingka.
sidered conclusive. It requires a considerable degree of tenacity and imag-
ination to find this connection.
3.2.2 Document Retrieval Task
The document retrieval task and the question an- 3.2.3 Question Answering Task
swering task (see below) were combined into one task. In order to measure the performance in the ques-
The subjects received 14 texts (from the TamilNet cor- tion answering part of this evaluation, we considered
pus) and 15 lead questions plus 13 additional follow- only questions relevant to the documents that the test
up questions. Their task was to identify the docu- subjects had identified correctly. Because of the diffi-
ment(s) that contain(s) the answer to the question and culty of the task, we were lenient to some degree in the
to answer the questions asked. Typical lead questions evaluation. For example, if the correct answer was the
were questions such as What is the security situation former president of the teacher’s union and the answer
in Trinconmalee?, or Who is S. Thivakarasa?; typi- given was an official of the teacher’s union, we still
cal follow-up questions were Who is in control of the counted this as “close enough” and therefore correct.
situation?, What happened to him/her?, or How did In addition, we also allowed partially correct answers,
(other) people react to what happened?. As in the pre- that is, answers that went into the right direction but
vious experiment, each subject received the output of were not quite correct. For example, if the correct an-
a different system. swer was The army imposed a curfew on fishing, we
Table 2 shows the result of the document retrieval counted the answer the army is stopping fishing boats
task. Again, the sample size was too small to draw as partially correct. All in all, it was very difficult to
any final conclusions, but our results seem to suggest evaluate this section of the task, because it was often
the following. Firstly, the test subject in the group close to impossible to determine whether the answer
dealing with output of systems trained on the bigger was just an educated guess or actually based on the
training set tend to perform better than the ones deal- text. There were some cases where answers were par-
ing with the results of training on less data. This sug- tially or even fully correct even though the correct doc-
gests that the jump from 24K to 85K tokens of train- ument had not been identified. In retrospect we con-
ing data might improve system performance in a sig- clude that it would have been better to have the test
Table 2: Recall and precision on the document retrieval task. Test subjects were asked to identify the document(s)
containing the answers to 15 lead questions. Black dots indicate successful identification of at least one document
containing the answer.
training transl.
corpus method 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 recall precision
1 TNa glossing 67% 79%
2 TN M4dec.b 53% 80%
3 TN bothc 60% 48%
4 TNd both 67% 79%
5 B+TN.1 glossing 80% 87%
6 B+TN.1 M4dec. 67% 86%
7 B+TN.1 both 80% 86%
8 B+TN.1f both 60% 73%
9 B+TN.1g both 93% 88%
10 B+TN.2h both 87% 75%
Human Translations 100% 100%
a
TamilNet corpus only; stemmed; 1291 aligned text chunks; 23,359 tokens on Tamil side; 1000 training iterations.
b
IBM Model 4 decoding.
c
Both glossing and IBM Model 4 decoding were available to the test subject.
d
same as above, but trained with pegging option (more thorough search during training); 10 training iterations.
e
Berkeley and TamilNet corpora; 5069 aligned text chunks; 85421 tokens on Tamil side; 100 training iterations.
f
same as above; 10 training iterations.
g
same as above, trained with pegging option; 10 training iterations.
h
Berkeley and TamilNet Corpora, raw (unstemmed); 64439 tokens on Tamil side, 50 training iterations.
subjects mark up those text passages in the text that

Table 3: Accuracy on question answering. The test
justify their answers.
sets are the same as in Table 2. Only questions con-
Again, the data suggests that the difference in train-
cerning documents that were identified correctly were
ing corpus size does affect the amount of information
considered in this evaluation.
that is available from the system output. Subjects us-
ing output of a system based on a translation model test training No. of rele- partially
that was trained on only the TamilNet data tend to per- set corpus vant questions
correct correct
form worse than subjects using output from a system 1 TN 17 2 = 12% 0= 0%
based on a translation model trained on the larger cor- 2 TN 16 3 = 19% 4= 25%
pus. The poor performance on test set No. 6 may 3 TN 18 2 = 11% 0= 0%
suggest that for this task and at this level of transla- 4 TN 17 7 = 41% 5= 19%
tion quality, glossing provides more informative out- 4 B+TN.1 23 14 = 61% 6= 26%
put than Model 4 decoding. This result is not partic- 6 B+TN.1 19 5 = 26% 3= 16%
ularly surprising, since we noticed that Model 4 de- 7 B+TN.1 22 14 = 64% 3= 14%
coding tends to leave out more words than acceptable. 8 B+TN.1 19 12 = 63% 1= 5%
Clearly, this is one area where the translation model 9 B+TN.1 26 14 = 54% 5= 19%
has to be improved. 10 B+TN.2 24 8 = 33% 9= 38%
Test set 10 is the only set produced by a system human 28 24 = 86% 2= 7%
using a translation model trained on raw, unstemmed
data. It is unclear whether the poor performance on
question answering for this test set is due to a princi-
pally worse translation quality or the (lack of) tenacity 14% partially correct answers!). However, using a
and willingness of the test subject to work her way system such as the one discussed in the paper is not
through the system output. an option for actual information processing. Espe-
All in all, we were astonished by the amount of in- cially those subjects that had to deal with the output
formation that our test subjects were able to retrieve of systems trained on the smaller corpus experienced
from the material they received (the top recall for the the task as utterly frustrating and would not want to do
question answering task is 64%, plus an additional it again.
4 Conclusions performance improvement when the MT systems
are scaled up? One approach may show rapid
We have reported on our experience with rapidly improvements initially but also reach a plateau
building a statistical MT system from scratch. Within quickly, whereas another may show slow but
ca. 140 translator hours, we were able to create a par- steady improvements.
allel corpus of about 1300 sentence pairs with 24,000 Is there any potential for bootstrapping the re-
tokens on the Tamil side, at an average translation rate source creation process by using knowledge that
of approximately 170 Tamil words per hour. can be extracted from little and poor data to speed
Very clearly, the effort needed to create parallel data up the creation of more and better data?
is one of the biggest obstacles to the rapid development
These are some of the the questions that will need
of statistical MT systems for new languages.
to be addressed in future research on Quick MT.
With the output of a system which uses a translation
model trained on the small amount of parallel data that 5 Acknowledgments
we created during the course of our experiment, hu-
man test subjects achieved a recall of over 50% on This research has been funded by the DARPA TIDES
the document retrieval task but generally performed program under grant No. N66001-00-1-8914. We
poorly on question answering (less than 20%). would like to thank our translators as well as Fred
The addition of an additional corpus of 3,800 sen- Gey of the University of California at Berkeley and
tence pairs allowed us to estimate the benefits of in- Thomas Malten of the University of Cologne for their
creasing the overall corpus size by roughly 300%. kind support of our investigations into Tamil, and our
Based on our experience with translating the TamilNet test subjects for their tenacity and patience during this
corpus, this additional effort would require an addi- experiment.
tional 450 translator and 36 to 48 post-editor hours.
With the additional training data, we were able to References
produce output that increased the performance on our
Yaser Al-Onaizan, David Purdy, Jan Curin, Michael
evaluation tasks (document retrieval and question an- Jahr, Kevin Knight, John Lafferty, Dan Melamed,
swering) to up to 93% for document retrieval and 64% Noah A. Smith, Franz Josef Och, and David
for question answering. Yarowsky. 1999. Statistical machine translation.
With respect to the scenario of “MT in a month”, Final report, Center for Language and Speech Pro-
we can now make the following calculation: If we as- cessing, John Hopkins University.
sume that the average translator translates at a rate of Peter F. Brown, Stephen A. Della Pietra, Vincent J.
170 words/hour and is able to spend 6-7 hours per day Della Pietra, and Robert L. Mercer. 1993. The
on actual translations, then a translator can translate mathematics of statistical machine translation: Pa-
about 1000-1200 words per day. In order to translate a rameter estimation. Computational Linguistics,
corpus of 100,000 words within one month (assuming 19(2):263–311.
a five-day work week), we therefore need four to five
Peter Brown, John Cocke, Stephen Della Pietra, Vin-
full time translators. For this effort, we can expect
cent Della Pietra, Frederick Jelinek, Jennifer Lai,
a translation system whose performance resembles the and Robert Mercer. 1995. Method and system for
one shown in our evaluation. natural language translation. U.S. Patent 5,477,451,
This, of course, raises the following questions, Dec 19.
which we are only able to ask but not to answer at this
point. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel
Marcu, and Kenji Yamada. 2001. Fast decoding
Can the translation model and the algorithms for and optimal decoding for MT. In 39th Annual Meet-
statistical training be improved so that they re- ing of the Association for Computational Linguis-
quire less data to produce acceptable results? tics (ACL-2001).
Are there more efficient uses of scarce resources Kevin Knight, Yaser Al-Onaizan, David Purdy, Jan
(such as language experts and translators) for Curin, Michael Jahr, John Lafferty, Dan Melamed,
building a statistical (or any other) MT system Noah Smith, Franz Josef Och, and David Yarowsky.
quickly, for example the creation of less but more 1999. EGYPT: a statistical machine translation
informative data, e.g. a parallel corpus with toolkit. http://www.clsp.jhu.edu/ws99/
alignments on the word level, or the compilation projects/mt/.
of a glossary/dictionary of the most frequently Sanford B. Steever. 1990. Tamil and the Dravidian
used terms? languages. In Bernard Comrie, editor, The World’s
How do the various approaches compare with re- Major Languages, pages 725–746. Oxford Univer-
spect to the ratio of construction effort versus sity Press, New York, Oxford.

Building A Statistical Machine Translation System From Scratch

Uploaded by

Copyright:

Available Formats

Building A Statistical Machine Translation System From Scratch

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Building A Statistical Machine Translation System From Scratch

Uploaded by

Copyright:

Available Formats

Building a Statistical Machine Translation System from Scratch:

How Much Bang for the Buck Can We Expect?

Abstract 2 Data Collection and Preparation

1 Introduction 2.2 Encoding and Tokenization

1. REPORT DATE 3. DATES COVERED

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION

11. SPONSOR/MONITOR’S REPORT

12. DISTRIBUTION/AVAILABILITY STATEMENT

15. SUBJECT TERMS

Standard Form 298 (Rev. 8-98)

text coverage (in %)

ments of 1000 tokens each. For each segment nk ,

Figure 3 shows the output of the different systems

Government - United vanni united national government united

The proposed friday progressive it throughout

At the same time it is

Figure 3: Sample output of various systems

subjects mark up those text passages in the text that

You might also like