An Omnifont Open-Vocabulary OCR System For English and Arabic

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO.

6, JUNE 1999 495


An Omnifont Open-Vocabulary OCR System
for English and Arabic
Issam Bazzi, Richard Schwartz, and John Makhoul, Fellow, IEEE
AbstractWe present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on Hidden
Markov Models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. In this paper
we focus on two aspects of the OCR system. First, we address the issue of how to perform OCR on omnifont and multi-style data,
such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which
is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the
use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate
this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary.
The method includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved
character error rates of 1.1 percent on data from the University of Washington English Document Image Database and 3.3 percent
on data from the DARPA Arabic OCR Corpus.
Index TermsOptical character recognition, speech recognition, Hidden Markov Models, Omnifont OCR, language modeling,
Arabic OCR, segmentation-free recognition.

1 INTRODUCTION
HE introduction of Hidden Markov Models (HMM) to
the area of automatic speech recognition has brought
several useful aspects to this technology, some of which are:
language-independent training and recognition methodol-
ogy; no separate segmentation is required at the phoneme
and word levels; and automatic training on non-segmented
data. In previous papers [13], [18], we presented a method
for using existing continuous speech recognition technol-
ogy for OCR. After a line-finding stage, followed by a sim-
ple feature-extraction stage, the system utilizes the BBN
BYBLOS continuous speech recognition system [15] to per-
form the training and recognition.
In this paper, we present techniques for handling multi-
ple print styles using a single HMM model for each char-
acter. It had been assumed that a natural mix of data from
different fonts and styles is best for training a recognition
system, as it leads to a matched condition between training
and test. We show how the HMM conditional independ-
ence assumption leads to less than optimal recognition ac-
curacy when a natural mix of data from different styles is
used for training, and we present a method for improving
system performance by allocating training data properly
among the different styles.
In our previous papers, we focused mainly on closed-
vocabulary experiments, in which the lexicon contained all
the words in the training and test sets. In this paper, we
show how the same word-based system can be used to deal
with unlimited vocabularies with the use of a lexicon of
characters and a statistical language model at the character
level. We report on results for English and Arabic. We also
discuss the effects of the language model on the perform-
ance of the character-based recognition system and how
language model perplexity relates to the recognition rate.
This paper is organized as follows. In Section 2, we give
a short literature review on related work in the area. In Sec-
tions 3 and 4, we present an overview of the system de-
scribing our approach to using HMMs for character recog-
nition. In Section 5, we present initial results on English and
Arabic. In Section 6, we present an approach for improving
accuracy with multiple print styles. In Section 7, we show
how the word-based system can be used to perform char-
acter-level recognition with unlimited vocabulary.
2 LITERATURE REVIEW
A number of research efforts have been made that use
HMMs in off-line printed and handwriting recognition [2],
[3], [5], [23]. In all these efforts, the recognition of only a
single language is attempted. The approach we take is most
similar to those of references [1], [6], [8], [9], [10], [14] in that
they also extract features from thin slices of the image
which, in principle, could make these systems language-
independent. The approach of Elms and Illingworth [8]

is
similar in that they use vertical thin slices to extract one set
of features, but they also use horizontal slices to extract an-
other set of features which makes some presegmentation at
the character level necessary, hence making the system not
appropriate for language-independent recognition, espe-
cially for languages with connected script. They used their
system to perform recognition of printed Roman characters.
Aas and Eikvil [1]

draw a bounding box around each word
0162-8828/99/$10.00 1999 IEEE

The authors are with BBN Technologies, GTE Internetworking, Cambridge,


MA 02138 USA. E-mail: [email protected].
Manuscript received 1 May 1998; revised 13 Feb. 1999. Recommended for accep-
tance by J. Hull.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number 107677.
T
496 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 6, JUNE 1999
to be recognized and extract features from vertical thin
slices. They report results on a single printed Roman font.
Kornai [10]

also uses features extracted from vertical thin
slices to perform recognition of handwritten addresses from
the CEDAR corpus.
There has been little work in the use of HMMs for the
recognition of Arabic script [2]. Three references [3], [5],
[21] report on the use of HMMs to perform recognition of
Arabic script. Allam [3] uses contour tracing to locate
groups of connected characters; the recognition is then
performed on each such group as a whole, using features
extracted from vertical slices. The feature extraction of the
papers by Ben Amara and Belaid [5] and by Yarman-Vural
and Atici

[21] appear to be specific to Arabic and may not
be easily generalizable.
The approach presented in this paper presents a num-
ber of departures from other OCR approaches. First, our
approach is focused on the problem of language-
independent recognition; the major components of the
system (feature extraction, training, and recognition) are
intended and designed to be script-independent, and
have already been demonstrated in two very different
script families: Arabic script and Roman script. Second,
the training and recognition are performed using an ex-
isting continuous speech recognition system, with no
modification; the only difference in the OCR system is in
the preprocessing and feature extraction. Third, our ap-
proach does not perform any presegmentationneither
at the character nor at the word levels. This contrasts
with other work where presegmentation may or may not
be performed at the character level, but almost always
presegmentation is assumed at the word level which, at
least for Arabic, can be problematic. Such a segmenta-
tion-free approach is also important for the recognition
of degraded documents (e.g., fax) where characters are
often connected.
3 PROBABILISTIC FRAMEWORK
3.1 Problem Setup
For the scanned data from a line of text, the goal is to find
that sequence of characters C that maximizes P(C| X), the
probability of the sequence of characters C, given a se-
quence of feature vectors X that represents the input text.
Using Bayes Rule, we can write:
P(C| X) = P(C)P(X| C)/ P(X).
We call P(X| C) the feature model and P(C) the language model
(or grammar). P(X| C) is a model of the input data for any
particular character sequence C; P(C) is the a priori prob-
ability of the sequence of characters, which describes what
is allowable in that language and with what probability;
and P(X) is the a priori probability of the data. Since P(X) is
the same for all C, maximizing P(C| X) can be accomplished
by maximizing the product P(X| C)P(C).
The feature model P(X| C) is approximated by taking
the product of the component probabilities for the differ-
ent characters, P(X
i
| c
i
), where X
i
is the sequence of feature
vectors corresponding to character c
i
. The feature model
for each character is given by a specific HMM. The lan-
guage model P(C) is described by a lexicon of allowable
characters and words and by a statistical language model
that can provide the probability of different sequences of
characters and words. The most popular language model
used for recognition is an n-gram Markov model, which
computes P(C) by multiplying the probabilities of con-
secutive groups of n words (or characters) in the sequence
C. Typically, bigram (n = 2) and trigram (n = 3) statistical
models are used.
Grammar perplexity is usually used to measure the com-
plexity of a recognition task. Given a language model, the
test set perplexity Q on an independent test set of data is
defined as:
Q P c c c c
M
M
=
-
1 2 3
1
. . .
/
2 7 ,
where c
1
c
2
c
M
is the sequence of characters from all the
test set, and P is the probability of the whole sequence of
characters in the test set. As a first approximation, perplex-
ity measures the difficulty of a recognition task: the smaller
the perplexity, the easier the recognition task.
3.2 Overall System
In this section, we give a brief overview of our system for
OCR. Fig. 1 shows a block diagram of the OCR system.
The system depends on the estimation of character mod-
els, as well as a lexicon and grammar, from training data.
The training system takes scanned-text training data, cou-
pled with ground truth, as input. After a preprocessing
stage in which the page is deskewed and lines of text are
located, each line is divided into narrow overlapping ver-
tical windows. Then, we extract a set of simple features
for each window (see below). The character modeling
component then takes the feature vectors and the corre-
sponding ground truth and estimates the character mod-
els. Since each line of text is transcribed in terms of words,
the character modeling component also makes use of a
lexicon obtained from a large text corpus. A language
model for recognition (a grammar) is also estimated from
the same text.
The training process also makes use of orthographic
rules that depend on the type of script. For example, the
rules state that the text consists of text lines and tell
whether the lines are horizontal or vertical, and whether
the text is read from left-to-right (as in Roman script) or
right-to-left (as in Arabic script).
The recognition system in Fig. 1 has a preprocessing and
feature extraction component identical to that used in
training. Then, using the output of the feature extraction,
the recognition uses the different knowledge sources esti-
mated in the training (character models, lexicon, and
grammar) to find the character sequence that has the high-
est likelihood.
In our system, all knowledge sources, shown as ellipses
in Fig. 1, depend on the particular language or script.
However, the whole training and recognition system,
shown as rectangular boxes, is designed to be language-
independent. That means that the same basic system can
be used to recognize most of the worlds languages with
little or no modification.
BAZZI ET AL.: AN OMNIFONT OPEN-VOCABULARY OCR SYSTEM FOR ENGLISH AND ARABIC 497
4 OCR SYSTEM DETAILS
4.1 Preprocessing and Feature Extraction
In order to use HMMs, we need to compute a feature vector
as a function of an independent variable. In speech, we di-
vide the speech signal into a sequence of windows (which
we call frames) and compute a feature vector for each frame;
the independent variable then is clearly time. The same
method has been applied successfully to on-line handwrit-
ing recognition, where a feature vector is computed as a
function of time also [20]. However, in OCR, we are usually
faced with the problem of recognizing a whole page of text,
so there is no obviously natural way of defining a feature
vector as a function of some independent variable and, in
fact, different approaches have been taken in the literature
[2]. At this stage in our work, we have chosen a line of text
as our major unit for training and recognition. Therefore,
we segment a page into a set of lines (which we assume to
be horizontal, without loss of generality) and then use hori-
zontal position along the line as the independent variable.
Therefore, we scan a line of text from left to right (right to
left for Arabic script), and at each horizontal position, we
compute a feature vector that represents a narrow vertical
strip of the input, which we call a frame. The result is a fea-
ture vector as a function of horizontal position.
Prior to finding the lines, we find the skew angle of the
page and rotate the image so that the lines are horizontal.
We then use a horizontal projection of the page to help find
the lines. For each line of text, we find the top and bottom
of the line. Once each line is located, we are ready to per-
form feature extraction.
We divide a line into a sequence of overlapping frames.
Each frame is a narrow vertical strip whose width is a small
fraction (typically about 1/ 15) of the height of the line, and
the height is normalized to minimize the dependence on
font size. The overlap from one frame to the next is a sys-
tem parameter; currently, the overlap is equal to two-thirds
of the frame width (see Fig. 2). Fig. 2 also shows that each
frame is divided into 20 equal overlapping cells (again, the
cell overlap is a system parameter). The features we com-
pute are simple and script-independent:
intensity (percentage of black pixels within each cell)
as a function of vertical position;
vertical derivative of intensity (across vertical cells);
horizontal derivative of intensity (across overlapping
frames);
local slope and correlation across a window of two
cells square.
Note that we have specifically chosen not to include fea-
tures that require any form of partial recognition, such as
subcharacter pieces (e.g., lines, curves, dots), nor did we
want to include features that are specific to a particular type
of script.
Although the intensity features alone represent the entire
image, we include other features, such as vertical and hori-
zontal derivatives, and local slope and correlation, so as to
include more global information and to help overcome the
limitation imposed by the conditional independence as-
sumption inherent in HMMs. The result is a set of 80 simple
features per frame.
Fig. 1. A block diagram of the OCR system.
Fig. 2. Dividing a line of text into frames and each frame into cells.
498 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 6, JUNE 1999
4.2 HMM Character Structure
The central model of the OCR system is the HMM of each
character. For each model, we need to specify the number
of states and the allowable transitions among the states.
Associated with each state is a probability distribution over
the features. The model for a word, then, is a concatenation
of the different character models.
Our HMM character structure is a left-to-right structure
that is similar to the one used in speech (see Fig. 3). The
loops and skips in Fig. 3 allow relatively large, nonlinear
variations in horizontal position. During training and rec-
ognition, for any particular instance of a character, several
of the input frames may get mapped to each of the states
and other states may not be used at all. In our current sys-
tem, we are using 14 states for all character HMMs. This
structure was chosen subjectively to represent the charac-
ters with the greatest number of horizontal transitions.
While it would be possible to model different characters
with different numbers of states, we find it easier to use the
same number of states for all characters.
4.2.1 Probability Structure
Each state in our HMMs has an associated probability den-
sity on the feature vector. For the experiments discussed in
this paper, we employed what is known in the speech rec-
ognition literature as a tied-mixture Gaussian structure [4].
For computational reasons, we divide our 80-dimensional
feature vector into eight separate subvectors of 10 features
each, which we model as conditionally independentthus
our probability densities can be expressed as a product of
eight probabilities, one for each subvector. The probability
density corresponding to each subvector is modeled using a
mixture of a shared pool of 64 Gaussian densities. The pa-
rameters for these densities are estimated using a clustering
process [11] that is run on a subset of the training data.
Thus, each state has, for each of the eight subvectors, 64
mixture weights (one for each Gaussian) that represent
the probability density for that subvector.
4.3 Training and Recognition
The training algorithm remains unchanged from that used
in our BYBLOS speech recognition system [15]. We do not
require the scanned data to be hand segmented nor aligned,
neither at the character level nor at the word level. We only
use simple ground truth transcriptions, which specify the
sequence of characters to be found on each line. The prob-
ability densities corresponding to the states in the various
HMMs are all initialized to be uniform densities. We then
use the forward-backward training algorithm to derive es-
timates of the model parameters [17]. The resulting models
maximize the likelihood of the training data.
In addition to the character HMMs, we also compute a
lexicon and a language model. These are usually obtained
using a large text corpus. The language model is usually a
bigram or trigram model that contains the probabilities of
all pairs or triples of words. Note that only the text is
needed for language modeling; it is not necessary to have
the corresponding scanned image. In this way, much larger
amounts of text can be used to develop more powerful lan-
guage models, without having to get more scanned training
data.
The recognition algorithm is also identical to that used in
our speech recognition system. Given the output of the
analysis of a line of text, the recognition process consists
mainly in a search for the most likely sequence of characters
given the sequence of input features, the lexicon, and the
language model. Since the Viterbi algorithm would be quite
expensive when the state space includes a very large vo-
cabulary and a bigram or trigram language model, we use a
multi-pass search algorithm [19].
5 BASELINE EXPERIMENTS
In this section, we present baseline results on Arabic and
English which we use as a control for our discussion on
fonts and unlimited vocabulary.
For all our results, we measured the average Character
Error Rate (CER) following speech recognition conventions
(i.e., we added the number of substitutions, deletions, and
insertions, and divided by the total number of characters in
the transcriptions provided). Each recognized line was
aligned with ground truth (using dynamic programming)
so as to minimize the error rate just defined.
5.1 Closed-Vocabulary Arabic Experiments
5.1.1 The Arabic Corpus
For our Arabic experiments, we chose the DARPA Arabic
OCR Corpus, collected at SAIC [7]. The corpus consists of
345 pages of Arabic text (~670k characters) scanned at 600
dots per inch from a variety of sources of varying quality,
including books, magazines, newspapers, and four com-
puter fonts. Shown in Fig. 4 are examples from the DARPA
Arabic OCR Corpus. In Fig. 4a, four text lines from the four
synthetic computer fonts are shown in the order: Geeza,
Baghdad, Kufi, and Nadim. Fig. 4b shows a sample from a
magazine page, Fig. 4c shows a book sample, and Fig. 4d
shows a newspaper sample.
Associated with each image in the corpus is the text
transcription, indicating the sequence of characters on each
line. But the location of the lines and the location of the
characters within each line are not provided. The corpus
transcription contains 89 unique characters, including
punctuation and special symbols. However, the shapes of
Arabic characters can vary a great deal, depending on their
context. Fig. 5 shows examples of a few Arabic words, each
written with the characters isolated and as the word ap-
pears in normal print. Note that not all characters are con-
nected inside a word and that the shape of a character de-
pends on the neighboring characters. The various shapes,
including ligatures and context-dependent forms, were not
identified in the ground truth transcriptions.
The fact that an Arabic character can typically take up to
four or more forms made it possible to have a model for
each form of each character, expanding the character set
Fig. 3. Hidden Markov model (HMM) for characters. Shown is a seven-
state model. For OCR, we have used 14-state models.
BAZZI ET AL.: AN OMNIFONT OPEN-VOCABULARY OCR SYSTEM FOR ENGLISH AND ARABIC 499
from 89 characters to 157 forms. Therefore, we treat the dif-
ferent forms of an Arabic character as different characters.
We also added to the character set four forms of the two
most common ligatures (see Fig. 6). Fig. 7 shows the char-
acter set we used in our system for Arabic.
5.1.2 Experimental Results
We ran experiments under two conditions: unifont and
omnifont. In the unifont experiment, the system was
trained on each of the four computer fonts separately and
tested on different data from the same font. In the omnifont
experiment, 40 pages were randomly chosen from the cor-
pus, two-thirds of which were used for training and the
remaining one-third of the pages was used as the test set.
For all our training and test sets, we used only pages where
the line-finding procedure found the right number of lines
for the page. For the DARPA Arabic corpus, the right num-
ber of lines was found for more than 99 percent of the
pages.
The average CER was 0.4 percent for the unifont recog-
nition experiment and 2.6 percent for the omnifont experi-
ment. As expected, the error rate increases from the unifont
to the omnifont condition.
The conditions of this experiment were unrealistic in one
respect: The recognition system employed a closed-
vocabulary lexicon of 30k words (i.e., all the words in the
Fig. 4. Sample fonts. (a) Four computer fonts: Geeza, Baghdad, Kufi, and Nadim. (b) Sample from a magazine page. (c) Sample taken from a
book. (d) A newspaper sample.
500 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 6, JUNE 1999
test were in the lexicon). In Section 7, we will show how to
remove this constraint by using an open-vocabulary recog-
nition system that does recognition at the character level
instead of the word level.
5.2 Closed-Vocabulary English Experiments
For our English experiments, we chose the University of
Washington (UW) English Document Image Database I
[16]. This corpus contains 958 pages scanned from technical
articles. Each page is divided into relatively homogeneous
zones, each of which is classified into one of 14 categories
(e.g., text, table, text-with-special-symbols, math), and a
ground truth transcription is provided. Of a total of 13,238
zones, 10,654 are text zones. Text zones were also classified
into one of three styles: plain, italic, or bold. The label was
assigned depending on the dominant style in the zone; for
example, a plain zone could contain some italic words in it.
For these experiments, we used only data from the text
zones.
In order to deal with this data, we made only two changes
to our Arabic system: We took account of the fact that Eng-
lish is written left-to-right instead of right-to-left, and we
parameterized the system to take other sampling rates (in
this case 300 dpi instead of 600 dpi). Just like in the Arabic
system, we used 14-state HMMs to model the characters.
For our initial experiment, we used only 2,441 text zones.
Using a random choice, we took two-thirds of the zones for
training (~600k characters) and the remainder for test. We used
a closed-vocabulary lexicon of about 30k words taken from all
the words in the training and test, and the size of the character
set was 90. This character set included all uppercase and low-
ercase English characters, as well as all punctuation symbols
used in English. We obtained a CER of 1.2 percent.
A breakdown of the errors revealed that the error rate on
nonitalic characters was only 0.5 percent, while italics had a
6 percent error rate, even though 15 percent (~100k charac-
ters) of the training was italics. This observation raised two
questions: First, are italic characters more difficult to recog-
nize using our system? Second, is the use of 100k characters
of italic training insufficient for good performance? This is
what we try to answer in the next section.
6 DEALING WITH MULTIPLE PRINT STYLES
6.1 Italic Models
We began by considering the case in which we know a pri-
ori that our data is italics. In such a case, we could train a
Fig. 5. Arabic words, written (a) with the characters isolated and (b) as
they normally appear in print.
Fig. 6. The two most common ligatures.
Fig. 7. Forms and ligatures of Arabic characters used in the OCR system.
BAZZI ET AL.: AN OMNIFONT OPEN-VOCABULARY OCR SYSTEM FOR ENGLISH AND ARABIC 501
model for the expressed purpose of recognizing italics. By
using information provided in the UW corpus, we collected
a data set consisting of lines that are predominantly italic.
We trained a model using a subset of this data (~50k char-
acters) and tested on the remainder (~10k characters). In
doing this, we achieved a CER of 0.6 percent when testing
on italic data. The fact that this result was close to the 0.5
percent result for plain (nonitalic) data demonstrated that it
is not much harder to recognize italic data with our system.
It also showed that we did have enough italic training in
our original experiment.
The italic result raised an important question: Know-
ing that we have enough training for both plain and
italics, why does a system trained with a natural mix of
the two styles have a much higher error rate on one of
the two styles? This is what we try to answer in the fol-
lowing subsection.
6.2 A Mathematical Explanation and Solution
Although the use of HMMs has contributed to major ad-
vances in automatic speech recognition, there are some in-
herent limitations of this approach. One major limitation is
the statistical independence assumption, i.e., the assump-
tion that successive observations are independent, and
therefore the probability of a sequence of observations can
be written as a product of probabilities of individual obser-
vations [17]. This assumption is clearly not valid in many
cases. For example, consecutive frames of the same word
have the same font, style and noise correlation and cannot
be statistically independent.
Here, we explore the effect of the statistical independence
assumption of HMMs on training a multistyle (or multifont)
system and how this relates to the higher error rate on italic
(6 percent error rate) compared to plain (0.5 percent error
rate) when a single model is used for both styles.
We consider two cases: using separate style-dependent
models versus using a single model for all styles. When
using a style-dependent model, our objective is to find the
character c and the style style that is most likely given the
data, or maximizing P(c, style| X), where X is the sequence
of frames corresponding to c. From Bayes rule:
P c style X
P c style X P X c style
P X
,
, ,



= .
Since a character can exist in many styles, we can make
the reasonable assumption that the character and the style
are independent. The above equation can then be written as:
P c style X
P c P style P X c style
P X
,
,



= .
If we now assume that the output observations are statisti-
cally independent, then:
P c style X
P c
P X
P style P X c style
f
frames
, ,


,
=

, (1)
where X
f
is the observation at frame f.
The other condition we consider is when a single model
is used for all styles. Here, our objective is to find the char-
acter c that will maximize P(c| X), regardless of style. Simi-
lar to the exposition above, we can write:
P c X
P c P X c
P X
( | )
( ) ( | )
( )
=
and with the independence assumption, we have:
P c X
P c
P X
P X c f
frames
( | )
( )
( )
( | ) =

Expanding over all styles:
P c X
P c
P X
P style P X c style
styles
f
frames
( | )
( )
( )
( ) ( | , ) =

. (2)
Comparing (1) and (2), we note that in (1), P(style) is
used to weight P(c, style| X) only once, while in (2), P(style)
is contributing to the value of P(c| X) at every frame
through the multiplication of every probability by P(style).
Thus, in (2), the overall effect is to weight the output prob-
abilities by P(style) raised to the number of frames (usually,
we have an average of 20 to 25 frames per character). When
different styles in training are present in different propor-
tions, that difference is vastly exaggerated through the pro-
cess of exponentiation.
For example, assume that 15 percent of the training data
is italic and 85 percent is plain. Also assume that the aver-
age number of frames per character is 20. Then, the italic
data will be weighted by a factor of 0.15
20
= 3.3 10
-17
and
the plain data will be weighted by a factor of 0.85
20
= 3.9
10
-2
. By taking the ratio of the two weightings, it becomes
clear that the plain data totally dominates the estimation of
the probability model and the effect of the italic data be-
comes negligible. The result is effectively the same as if the
model were trained on plain data alone. During recogni-
tion, such a model would be expected to do well with plain
data but should result in much higher error rates for italic
data, as we have witnessed in our experiments.
Based on the mathematical argument above, we can find
the right amount of training data from each style so that the
final weighting corresponds to the natural mix between the
two styles. For our numerical example, if we choose x per-
cent (x to be determined) as the percentage of italic training,
then we should have:
x
x 100
15
85
20
-

=
Solving for x yields 47.8 percent. Therefore, if we allo-
cate our training data as 47.8 percent italic and 52.2 per-
cent plain, the net effect will be weighting the final models
with the natural mix ratio of 15 percent and 85 percent,
respectively.
Finding the right amount of training cannot be easily
generalized to more than two styles (fonts). Even in the
two-style case, we do not know the durations of each
character a priori. Therefore, we have tried to ameliorate
the exponentiation problem by following the approxi-
mate solution of using equal amounts of training from
the different styles. Having equal amounts of training
results in similar weighting for all styles. This approach
results in a somewhat unmatched condition between
training and test since the final model will have similar
weighting for all styles as opposed to the natural
weighting (15 percent to 85 percent in our example
above). However, this mismatch is far less severe than
502 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 6, JUNE 1999
that due to exponentiation and, therefore, should result
in better performance.
6.3 Balanced Training Models
To test the above solution, we performed a balanced train-
ing experiment where we used an equal amount of italic
and plain text for training (~50k characters each) but tested
the resulting model on the same data as before. The result is
shown in the last row of Table 1. Using a single model for
all data, the CER was the same 0.8 percent for the plain and
italic portions of the test. This is to be contrasted with the
second row of the table which shows the results of the pre-
vious multifont experiment, which used a natural mix for
training the model.
The use of equal amounts of data for training from each
font reduced the CER on italic from 6 percent to 0.8 percent
while the CER for the plain data increased from 0.5 percent
to 0.8 percent. So, on average, the use of a single balanced
training model reduced the overall error rate from 1.2 per-
cent to 0.8 percenta significant 33 percent overall reduc-
tion in character error rate. For comparison purposes, we
also show in the first row of Table 1 the results of the more
computationally expensive method of using multiple style-
dependent models.
7 UNLIMITED VOCABULARY OCR
The BYBLOS system is a word-based system that allows for
recognition of only a closed set of words that constitute the
systems lexicon. In order to overcome this limitation, we
perform character-level recognition by allowing the char-
acter to play the role of the word in the system. Thus, the
lexicon is a list of the possible characters and the language
model is an n-gram (bigram or trigram) on sequences of
characters.
In Sections 7.1 and 7.2, we present results on character-
based recognition for English and Arabic, respectively. As
expected, the CER increases when compared to the word-
based closed-vocabulary results. Then, in Section 7.3, we
show how good performance can be achieved for unlimited
vocabularies through a hybrid approach that combines
character and word level recognition.
7.1 Character-Based Recognition for English
The first unlimited vocabulary experiment we present here
is on English. Under the same experimental conditions as
our word-based balanced-training English experiment of
0.8 percent CER, we instead built character models and per-
formed recognition using both a bigram and a trigram
grammar on characters.
Table 2 summarizes the results of the experiments using
different language models. For each model, the table gives
the character perplexity of and the corresponding CER un-
der balanced training. The first two results are for charac-
ter-based recognition (using no lexicon of words) and the
third row is the word-based, closed-vocabulary recognition
result, given in Table 1, using the 30k-word lexicon. We can
see two effects for changing the language model (as de-
scribed in Section 3.1). First, as we go from a brigram on
characters to a trigram on characters, perplexity goes down
from 13.0 to 8.6. When we then use a lexicon and a bigram
on words, the perplexity decreases to 2.8, indicating an
easier recognition task. Second, the CER decreases when we
use a trigram instead of a bigram on characters, as can be
seen from the last column of Table 2, and it decreases fur-
ther when we use a lexicon and a bigram on words. Doing
character-based recognition for English without the use of a
lexicon allowed for unlimited vocabulary but degraded the
performance by roughly a factor of three (from 0.8 percent
to 2.1 percent).
7.2 Character-Based Recognition for Arabic
As mentioned in Section 5.1, our Arabic character set con-
sists of all the forms and ligatures shown in Fig. 7. For
training and test sets chosen from 40 pages of the DARPA
corpus, as presented in Section 5.1, we obtained a CER of
2.6 percent using our closed-vocabulary, word-based sys-
tem. Using the same training and test but performing un-
limited-vocabulary recognition with the form-based models
and a trigram language model on forms, we obtained a
CER of 4.5 percent. Similar to English, the performance de-
grades in going from word to character recognition, but in
this case with a degradation of only a factor of two (from
2.6 percent to 4.5 percent) as opposed to a factor of three for
English.
TABLE 1
CHARACTER ERROR RATES FOR ENGLISH UNDER DIFFERENT TRAINING AND TEST CONDITIONS
TABLE 2
ENGLISH CER VERSUS MODEL PERPLEXITY
BAZZI ET AL.: AN OMNIFONT OPEN-VOCABULARY OCR SYSTEM FOR ENGLISH AND ARABIC 503
7.3 A Hybrid Recognition System
Turning the system into a character-based recognition sys-
tem as we did in Sections 7.1 and 7.2 allows for any se-
quence of characters and hence for unlimited vocabulary.
However, in doing this, we lose a significant amount of
prior language information that the lexicon provides during
recognition. As we saw in the two previous subsections, not
using a lexicon increases the error rate by a factor of two to
three.
To solve this problem, we used a hybrid approach where
we perform character-based recognition together with some
higher level constraints set by a word lexicon and a uni-
gram language model at the word level. Using this hybrid
approach for English and Arabic, Table 3 summarizes the
results. The first two columns summarize our previous re-
sults for word-based and character-based recognition, while
the third column shows the results for the hybrid system.
Using the hybrid recognition system, we obtained an error
rate of 1.1 percent on the balanced training English experi-
ment and 3.3 percent on the 40 pages Arabic experiment.
These results show surprisingly significant improve-
ments over the error rates we obtained by doing only char-
acter-based recognition. The hybrid system, without having
the closed-vocabulary constraint, performed close to the
closed-vocabulary system (comparing first and third col-
umn in Table 3) which makes us believe that this can be a
good approach for unlimited vocabulary while making use
of a lexicon and some word-level language model. The de-
tails of the hybrid system will be presented in a subsequent
paper.
Comparing Arabic to English, the average error rate for
Arabic was about three times that of English (3.3 percent
versus 1.1 percent). We speculate that the higher error rate
for Arabic is due to several causes:
1) the greater similarity, and hence confusability, of Ara-
bic characters;
2) the connectedness of Arabic characters and the exis-
tence of ligatures;
3) the wider diversity of fonts in the Arabic corpus; and
4) the lower quality of some of the Arabic data.
Finally, even though the English and Arabic corpora we
used for our training and testing are also used by other re-
searchers in the field, no standard training and testing sets
have been defined that could allow for comparing our re-
sults to those of other methods used on the same corpora.
8 CONCLUSION
In this paper, we presented an omnifont open-vocabulary
OCR system for English and Arabic that is based on Hid-
den Markov Models. We showed that our HMM-based
OCR system has several benefits, two of which are: no
segmentation is required at the word or character levels,
and the system is language-independent, that is, the same
system can be used for different languages with little or
no modification.
We addressed the issue of the simultaneous recognition
of two English styles: plain and italic. We showed, mathe-
matically and empirically, that our initial high error rate on
italics, when using a single overall model, was due to the
conditional independence assumption inherent in the
HMM framework. We presented a method for ameliorating
the problem by balancing the training between the two
styles.
We also presented a technique for using our word-based
system to handle unlimited vocabularies. Using a lexicon of
only characters of the same size as the character set (e.g., 90
for English) and letting these characters play the role of
words enable our system to recognize any sequence of
characters, thus allowing for unlimited vocabulary recog-
nition. We presented results that show the effect of using
bigram and trigram language models on sequences of char-
acters to improve recognition accuracy. To recover the lexi-
con information, we combined character and word recog-
nition, and we were able to achieve open-vocabulary per-
formance of 1.1 percent CER for English and 3.3 percent for
Arabic, results which are close to the closed-vocabulary
system performance.
When performing character-based decoding, the recog-
nition speed is around 35 characters per second, while with
word-based decoding where the vocabulary is much larger,
the recognition speed is around an order of magnitude less
than the character-based recognition speed. We are cur-
rently working on a fast hybrid procedure that has a recog-
nition speed comparable to that of only character-based
recognition.
For our future work, we plan to move in two directions.
First, we plan to port the system to Chinese OCR, which
will further demonstrate the language-independence aspect
of the overall methodology. Implementing Chinese OCR
will also bring a new challenge because of the large cardi-
nality of the Chinese character set. Second, we plan to test
our system on degraded and noisy data, such as fax and
nth-generation photocopies.
ACKNOWLEDGMENT
An earlier version of this paper was presented at ICDAR in
1997 [22].
REFERENCES
[1] K. Aas and L. Eikvil, Text Page Recognition Using Grey-Level
Features and Hidden Markov Models, Pattern Recognition, vol.
29, pp. 977-985, 1996.
TABLE 3
ENGLISH AND ARABIC HYBRID RECOGNITION RESULTS
504 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 6, JUNE 1999
[2] B. Al-Badr and S. Mahmoud, Survey and Bibliography of Arabic
Optical Text Recognition, Signal Processing, vol. 41, no. 1, pp. 49-
77, 1995.
[3] M. Allam, Segmentation Versus Segmentation-Free for Recog-
nizing Arabic Text, Proc. SPIE, vol. 2,422, pp. 228-235, 1995.
[4] J. Bellegarda and D. Nahamoo, Tied Mixture Continuous Pa-
rameter Models for Large Vocabulary Isolated Speech Recogni-
tion, IEEE Intl Conf. Acoustics, Speech, Signal Processing, vol. 1,
pp. 13-16, Glasgow, Scotland, May 1989.
[5] N. Ben Amara and A. Belaid, Printed PAW Recognition Based on
Planar Hidden Markov Models, 13th Intl Conf. Pattern Recogni-
tion, vol. 2, pp. 220-224, Vienna, 1996.
[6] W. Cho, S.-W. Lee, and J.H. Kim, Modeling and Recognition of
Cursive Words With Hidden Markov Models, Pattern Recogni-
tion, vol. 28, pp. 1,941-1,953, 1995.
[7] R.B. Davidson and R.L. Hopley, Arabic and Persian OCR Train-
ing and Test Data Sets, Proc. Symp. Document Image Understanding
Technology (SDIUT97), pp. 303-307, Annapolis, Md., 1997.
[8] A.J. Elms and J. Illingworth, Modelling Polyfont Printed Char-
acters With HMMs and a Shift Invariant Hamming Distance,
Proc. Intl Conf. Document Analysis and Recognition, pp. 504-507,
Montreal, Canada, 1995.
[9] A. Kaltenmeier, T. Caesar, J.M. Gloger, and E. Mandler, Sophisti-
cated Topology of Hidden Markov Models for Cursive Script
Recognition, Proc. Intl Conf. Document Analysis and Recognition,
pp. 139-142, Tsukuba City, Japan, 1993.
[10] A. Kornai, Experimental HMM-Based Postal OCR System, Proc.
Intl Conf. Acoustics, Speech, Signal Processing, vol. 4, pp. 3,177-
3,180, Munich, Germany, 1997.
[11] J. Makhoul, S. Roucos, and H. Gish. Vector Quantization in
Speech Coding, Proc. IEEE, vol. 73, pp. 1,551-1,588, 1985.
[12] J. Makhoul and R. Schwartz, State of the Art in Continuous
Speech Recognition, Proc. Natl Acad. Sci. USA, vol. 92, pp. 9,956-
9,963, Oct. 1995.
[13] J. Makhoul, R. Schwartz, C. LaPre, C. Raphael, and I. Bazzi, Lan-
guage-Independent and Segmentation-Free Techniques for Opti-
cal Character Recognition, Document Analysis Systems Workshop,
pp. 99-114, Malvern, Pa., Oct. 1996.
[14] M. Mohamed and P. Gader, Handwritten Word Recognition
Using Segmentation-Free Hidden Markov Modeling and Seg-
mentation-Based Dynamic Programming Techniques, IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 5, pp.
548-554, May 1996.
[15] L. Nguyen, T. Anastasakos, F. Kubala, C. LaPre, J. Makhoul, R.
Schwartz, N. Yuan, G. Zavaliagkos, and Y. Zhao, The 1994
BBN/ BYBLOS Speech Recognition System, Proc. ARPA Spoken
Language Systems Technology Workshop, pp. 77-81, Austin, Texas,
Jan. 1995. San Mateo, Calif.: Morgan Kaufmann Publishers, 1995.
[16] I.T. Phillips, S. Chen, and R.M. Haralick, CD-ROM Document
Database Standard, Proc. Intl Conf. Document Analysis and Recog-
nition, pp. 478-483, Tsukuba City, Japan, Oct. 1993.
[17] L. Rabiner, A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition, Proc. IEEE, vol. 77, no. 2, pp.
257-286, Feb. 1989.
[18] R. Schwartz, C. LaPre, J. Makhoul, C. Raphael, and Y. Zhao,
Language-Independent OCR Using a Continuous Speech Rec-
ognition System, Proc. Intl Conf. Pattern Recognition, pp. 99-103,
Vienna, Aug. 1996.
[19] R. Schwartz, L. Nguyen, and J. Makhoul, Multiple-Pass Search
Strategies, C.-H. Lee, F.K. Soong, and K.K. Paliwal, eds., Auto-
matic Speech and Speaker Recognition: Advanced Topics. Kluwer Aca-
demic Publishers, 1996, pp. 429-456.
[20] T. Starner, J. Makhoul, R. Schwartz, and G. Chou, On-Line Cursive
Handwriting Recognition Using Speech Recognition Methods,
IEEE Intl Conf. Acoustics, Speech, Signal Processing, pp. V-125-128,
Adelaide, Australia, 1994.
[21] F.T. Yarman-Vural and A. Atici, A Heuristic Algorithm for Opti-
cal Character Recognition of Arabic Script, Proc. SPIE, vol. 2,727,
part 2, pp. 725-736, 1996.
[22] I. Bazzi, C. LaPre, J. Makhoul, and R. Schwartz, Omnifont and
Unlimited Vocabulary OCR for English and Arabic, Proc. Intl
Conf. Document Analysis and Recognition, vol. 2, pp. 842-846, Ulm,
Germany, 1997.
[23] C.B. Bose and S.-S. Kuo, Connected and Degraded Text Recogni-
tion Using Hidden Markov Model, Pattern Recognition, vol. 27,
pp. 1,345-1,363, 1994.
Issam Bazzi is a staff scientist at BBN Technologies, GTE Internet-
working, Cambridge, Massachusetts, and is pursuing a PhD degree at
the Massachusetts Institute of Technology (MIT). He received his BE in
computer and communication engineering from the American Univer-
sity of Beirut in 1993 and his SM in information technology from MIT in
1997. He has been working at MIT since 1993 on networked multime-
dia systems. He joined BBN in 1996, working mainly on optical char-
acter recognition, which is the topic of his PhD study at MIT.
Richard Schwartz is a principal scientist at BBN Technologies, GTE
Internetworking, Cambridge, Massachusetts. He joined BBN in 1972,
after receiving an SB in electrical engineering from MIT. Since then, he
has worked on phonetic recognition and synthesis, speech coding,
speech enhancement in noise, speaker identification and verification,
speech recognition and understanding, fast search algorithms, neural
networks, online handwriting recognition, optical character recognition,
and statistical text processing.
John Makhoul is a chief scientist at BBN Technologies, GTE Internet-
working, Cambridge, Massachusetts. He is also an adjunct professor at
Northeastern University and at the University of Massachusetts and a
research affiliate at MIT. An alumnus of the American University of
Beirut and the Ohio State University, he received a PhD from MIT in
1970 in electrical engineering. Since then, he has been with BBN,
directing various projects in speech recognition and understanding,
speech coding, speech synthesis, speech enhancement, signal proc-
essing, artificial neural networks, and character recognition. Dr. Mak-
houl is a Fellow of the IEEE and of the Acoustical Society of America.

You might also like