Automatic Text Location in Images and Video Frames

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Pattern Recognition, Vol. 31, No. 12, pp.

2055—2076, 1998
( 1998 Pattern Recognition Society. Published by Elsevier Science Ltd
All rights reserved. Printed in Great Britain
0031-3203/98 $19.00#0.00

PII: S0031-3203(98)00067-3

AUTOMATIC TEXT LOCATION IN IMAGES


AND VIDEO FRAMES
ANIL K. JAIN and BIN YU
Department of Computer Science, Michigan State University, East Lansing, MI 48824-1027, U.S.A.

(Received 22 January 1998; in revised form 23 April 1998)

Abstract—Textual data is very important in a number of applications such as image database indexing and
document understanding. The goal of automatic text location without character recognition capabilities is
to extract image regions that contain only text. These regions can then be either fed to an optical character
recognition module or highlighted for a user. Text location is a very difficult problem because the characters
in text can vary in font, size, spacing, alignment, orientation, color and texture. Further, characters are often
embedded in a complex background in the image. We propose a new text location algorithm that is suitable
in a number of applications, including conversion of newspaper advertisements from paper documents to
their electronic versions, World Wide Web search, color image indexing and video indexing. In many of
these applications, it is not necessary to extract all the text, so we emphasize on extracting important text
with large size and high contrast. Our algorithm is very fast and has been shown to be successful in
extracting important text in a large number of test images. ( 1998 Pattern Recognition Society. Published
by Elsevier Science Ltd. All rights reserved

Automatic text location Web search Image database Video indexing


Multivalued image decomposition Connected component analysis

1. INTRODUCTION We define a text as coded text if it is represented by


some code from which its image can be reproduced
Textual data carry useful and important information.
with a predefined font library. Examples of coded text
People routinely read text in paper-based documents,
can be found in Postscript-formatted files and files
on television screens, and over the Internet. At the
used in many word processing software packages
same time, optical character recognition (OCR) tech-
where characters are represented in ASCII code or
niques(1) have advanced to a point where they can be
Unicode. On the other hand, a text is defined as pixel
used to automatically read text in a wide range of
environments. Compared with the general task of text if it is represented by image pixels. In other words,
object recognition, text is composed of a set of sym- pixel text is contained in image files. Sometimes, both
bols, which are arranged with some placement rules. these types of text appear in the same document. For
Therefore, it is easier for a machine to represent, instance, a web page often consists of both coded text
understand, and reproduce (model) textual data. Gen- and pixel text. Figure 1 depicts a part of a web page
erally, we have two goals in automatic text processing: which consists of two images and a line of coded text
(i) convert text from paper documents to their elec- ‘‘Department of Computer Science’’ as indicated by its
tronic versions (e.g. technical document conver- source code shown in Fig. 2. The ASCII code for the
sion(2)); (ii) understand the document (e.g. image, coded text can be read directly from its source code,
video, paper document) using the text contained in it. while the pixel text ‘‘Michigan State University’’ is
It is this second goal which plays an important role in contained in the image named ‘‘msurev1.gif ’’. The
Web search, color image indexing, image database problem of automatic text location is mainly con-
organization, automatic annotation and video index- cerned with the pixel text.
ing, where only important text is desired to be located Several approaches to text location have been pro-
(e.g. book titles, captions, labels and some key words). posed for specific applications like page segmenta-
Automatic text location (the first step in automatic or tion,(2) address block location,(3,4) form dropout(5)
semi-automatic text reading) is to locate regions that and graphics image processing.(6) In these applica-
just contain text from various text carriers without tions, images generally have a high resolution and the
recognizing characters contained in the text. The ex- requirement is that all the text regions be located.
pected variations of text in terms of character font, There are two primary methods for text location
size and style, orientation, alignment, texture and proposed in the literature. The first method regards
color embedded in low contrast and complex back- regions of text as textured objects and uses well-
ground images make the problem of automatic text known methods of texture analysis(7) such as Gabor
location very difficult. Furthermore, a high speed of filtering(8) and spatial variance(9) to automatically lo-
text location is desired in most applications. cate text regions. This use of texture for text location
2055
2056 A. K. JAIN and B. YU

Fig. 1. A part of a web page.

Sa href ‘‘http://www.msu.edu/’’TSIMG align"left src"‘‘/img/misc/msurev1.gif’’


WIDTH"229 HEIGHT"77TS/aT
Sa href ‘‘http://www.egr.msu.edu’’TSIMG align"right src‘‘/img/misc/engpic3.gif’’
WIDTH"227 HEIGHT"77TS/aT
SbrTSbrTSbrTSbrTSbrTSbrT
ScenterTSh1T Department of Computer ScienceS/h1TS/centerT

Fig. 2. Source code of the web page in Fig. 1.

is sensitive to character font size and style. Further, automatic text location in these problems is sum-
this method is generally time-consuming and cannot marized below.
always accurately give text’s location which may re-
duce the performance of OCR when applied to the
1.1. Conversion of newspaper advertisements
extracted characters. Figure 3(b) shows horizontal
spatial variance for the image in Fig. 3(a) proposed World Wide Web (WWW) is now recognized as an
by Zhong et al.(9) The text location results are shown excellent media for information exchange. As a result,
in Fig. 3(c), where there is some unpredictable offset. the number of applications which require converting
The second method of text location uses connected paper-based documents to hypertext is growing rap-
component analysis.(2,3,5,10,11) This method, idly. Most newspaper and advertisement agencies
which has a higher processing speed and localization would like to put a customer’s advertisements onto
accuracy, however, is applicable to only binary im- their web sites at the same time as they appear in the
ages. Most black and white documents can be re- newspaper. Figure 4(a) shows an example of a typical
garded as two-valued images. On the other hand, newspaper advertisement. Since the advertisements
color documents, video frames, and pictures of natu- sent to these agencies are not always in the form of
ral scenes are multivalued images. To handle various coded text, there is a need to automatically convert
types of documents, we localize text through multi- them to electronic versions which can be further used
valued image decomposition. In this paper we will in automatically generating Web pages. Although
introduce: (i) multivalued image decomposition, these images are mostly binary, both black and white
(ii) foreground image generation and selection, objects can be regarded as foreground due to text
(iii) color space reduction, and (iv) text location reversal. The text in advertisements varies in terms of
using statistical features. The proposed method has font, size, style and spacing. In addition to text, the
been applied to the problem of locating text in a num- advertisements also contain some graphics, logos and
ber of different domains, including classified adver- symbolized rulers. We use a relatively high scan res-
tisements, embedded text in synthetic web images, olution (150 dpi) for these images because (i) they are
color images and video frames. The significance of all binary, so storage requirements are not severe and
Automatic text location in images and video frames 2057

Fig. 3. Text location by texture analysis: (a) original image; (b) horizontal spatial variance; (c) text
location (shown in rectangular blocks).

(ii) all the text in the advertisement, irrespective of their from a magazine cover. Automatically locating text in
font, size and style, must be located for this application. color images has many applications, including image
database search, automatic annotation and image
database organization. Some related work can be
1.2. Web search
found in vehicle license plate recognition.(15)
Since 1993, the number of web servers has been
doubling nearly every three months(12) and now ex-
1.4. Video indexing
ceeds 476,000.(13) Text is one of the most important
components in a web page which can be either coded The goal of video indexing is to retrieve a small
text or pixel text. Through the information superhigh- number of video frames based on user queries. A num-
way, users can access any authorized site to obtain ber of approaches have been proposed which retrieve
information of interest. This has created the problem video frames using texture,(15) shape(17) and color(18)
of automatically and efficiently finding useful pages information contained in the query. At the same time,
on the web. To obtain desired information from this word spotting(19) and speech recognition(20) tech-
humongous source, a coded text-based search engine niques have been used in searching for dialogue and
(e.g. Yahoo, Infoseek, Lycos, and AltaVista) is com- narration for video indexing. Both caption text and
monly used. For instance, AltaVista search engine non-caption text on objects contained in video can be
processes more than 29 million requests each day.(13) used in interactive indexing and automatic indexing,
Because of the massive increase in network band- which is the major objective of text location for video.
widths and disk capacities, more and more web pages Figure 4(d) shows a video frame which contains text.
now contain images for better visual appearance and Some related work has been done for image and video
rich information content. These images, especially the retrieval where the search cues use visual properties
pixel text embedded in them, provide search engines of specific objects and captions in video
with additional cues to accurately retrieve the desired databases.(9,21~23) Lienhart and Stuber(23) assume
information. Figure 4(c) shows one such example. that text is monochromatic and is generated by video
Therefore, a multimedia search engine which can use title machines.
the information from both coded text and pixel text,
image, video and audio is desired for the information
1.5. Summary
superhighway.
Most of web images are computer created and There are essentially two different classes of ap-
called synthetic images. Text in web page images plications involved in our work on automatic text
varies in font, color, size and style even in the same location: (i) document conversion and (ii) web
page. Furthermore, the color and texture of the text searching and image and video indexing. The first
and its background may also vary from one part of class of applications, which mostly involves binary
the page to the other. For these reasons, it is very images, requires that all the text in the input image be
difficult to locate text in Web images automatically located. This necessitates a higher image resolution.
without utilizing character recognition capabilities. On the other hand, it is evident that the most impor-
Only a few simple approaches have been published for tant requirements for the second class of applications
text location in Web images.(14) is (i) high speed of text location, and (ii) extraction of
only important text in the input image. Usually, the
larger the font size of text, the more important it is.
1.3. Color image databases
The text which is very small in size cannot be recog-
A color image can be captured by a scanner or nized easily by OCR engines anyway.(24) Since
a camera. Figure 4(b) shows a color image scanned the important text in images appear mainly in the
2058 A. K. JAIN and B. YU

Fig. 4. Examples of input images for automatic text location applications: (a) classified advertisement
in a newspaper; (b) color scanned image; (c) web image; (d) video frame.

horizontal direction, our method tries to extract only vidual foreground images go through the same pro-
horizontal text of relatively large size. Because some cessing steps, so the connected component analysis
non-text objects can be subsequently rejected by an and text identification modules can be implemented in
OCR module, we minimize the probability of missing parallel on a multiprocessor system to speed up the
text (false dismissal) at the cost of increasing the prob- algorithm. Finally, the outputs from all the channels
ability of detecting spurious regions (false alarm). Fig- are composed together to locate the text in the input
ure 5 gives an overview of the proposed system. The image. Text location is represented in terms of the
input can be a binary image, a synthetic web image, coordinates of its bounding box.
a color image or a video frame. After color reduction, In Section 2 we describe decomposition method for
including bit dropping and color clustering and multivalued images, including color space reduction.
multivalued image decomposition, the input image is Connected component analysis method is applied to
decomposed into multiple foreground images. Indi- foreground images, which is explained in Section 3.
Automatic text location in images and video frames 2059

Fig. 5. Automatic text location system.

Section 4 introduces textual features, text identifica- text is shown in Fig. 7(b). Therefore, an image I can
tion and text composition. Finally, we report the always be completely separated into a foreground
results of experiments in a number of applications and image IF and a background image IB , where
discuss the performance of the proposed system in I #I "I and I WI "0. Theoretically, a º-
F B F B
Section 5. valued image can generate up to (2U!2) different
foreground images. A foreground image is called a
real foreground image if it is produced such that
2. MULTIVALUED IMAGE DECOMPOSITION

An image I is multivalued if the pixel values IRF" Z Im , )RF LI,


I )
u3 U"M0, 1, 2 , º!1N, where º is an integer, m | RF
º'1. Let pixels with value u0 3U be object pixels where ) denotes a set of element images of I. So, we
and all pixels with value u3U and uOu0 be non- can construct a real foreground image by combining
object pixels. A º-valued image can be decomposed element images which are easily extracted. A fore-
into a set of º element images I"MIiN, where ground image is a background-complementary fore-
ground image if it is produced such that
U~1
Z Ii"I, Ii Y Ij"0.
i/0 iOj IBCF"I!IB , IB" Z Im , )B LI,
I )
Figure 6(b) depicts nine element images of the multi- m| B
valued image shown in Fig. 6(a) consisting of º"9 where IB is the background of IBCF . In this case,
different pixel values. Furthermore, all object pixels a background image is easier to extract. Note that the
are set as 1’s and non-object pixels are set as 0’s. union operation in constructing the real foreground
We assume that a text represented with a nearly images and the background images of background-
uniform color can be composed of one or several color complementary foreground images is simplified by
values, which is regarded as real foreground text. An color space reduction discussed in Section 2.3. For the
example of real foreground text is shown in Fig.7(a). image in Fig. 6(a), the union of four element images
On the other hand, text consisting of various colors with pixel values 1, 2, 3 and 4 generates a real fore-
and texture is assumed to be located in a background ground image shown in Fig. 8(a). Let the element
with a nearly uniform color, which can be regarded as image with the value 9 be the background, then the
background-complementary foreground text. An corresponding background-complementary fore-
example of background-complementary foreground ground image is shown in Fig. 8(b).
2060 A. K. JAIN and B. YU

Fig. 6. A multivalued image and its element images: (a) color image; (b) nine element images.

Fig. 7. Examples of text: (a) a real foreground text; (b) a background-complementary foreground text.

In our system, each element image can be selected as image has only two element images, the given image
a real foreground image if there is a sufficient number and its inverse, each being a real foreground image or
of object pixels in it. On the other hand, we generate at a background-complementary image with respect to
most one background-complementary image for each the other.
multivalued image such that the background image
IB is set as the element image with the largest number
2.2. Pseudo-color images
of object pixels or the union of this element image with
the element image with the second largest number of For web images, GIF and JPEG are the two most
object pixels if it is larger than a threshold. popular image formats because they both have high
compression rates and simple decoding methods. The
latter is commonly used for images or videos of
2.1. Binary images
natural scenes.(25) Most of the web images containing
The advertisement images of interest to us are bi- meaningful text are created synthetically and are
nary images (see Fig. 4(a)) for which º"2. A binary stored in GIF format. A GIF image is an 8-bit
Automatic text location in images and video frames 2061

Fig. 8. Foreground images of the multivalued image in Fig. 6(a): (a) a real foreground image; (b) a
background-complementary foreground image.

Fig. 9. Histogram of the multivalued image shown in Fig. 4(c).

pseudo-color image whose pixel values are bounded We extract text in pseudo-color images by combin-
between 0 and 255. A local color map and/or a global ing two methods. One is based on foreground in-
color map is attached to each GIF file to map the formation and the other is based on the background
8-bit image to a full color space. GIF format has two information. Although the pixel values in a GIF im-
versions, GIF87a and GIF89a. The later can encode age can range from 0 to 255, most images contain
an image by interlacing in order to display it in a values only in a small interval, i.e., º@256. Figure 9 is
coarse-to-fine manner during transmission and can the histogram of the pseudo-color image in Fig. 4(c),
indicate a color as a transparent background. As far which shows that a large number of bins are empty.
as the data structure is concerned, an 8-bit pseudo- First, we regard each element image as a real fore-
color image is no different from an 8-bit gray scale ground image. Furthermore, the number of distinct
image. However, they are completely different in values shared by a large number of pixels is small due
terms of visual perception. The pixel values in a gray to the nature of synthetic images. We assume that the
scale image have physical interpretation in terms of characters in a text are of reasonable size and the
light reflectance, so the difference between two gray characters occupy a sufficiently large number of
values is meaningful. However, a pixel value in a pixels. Therefore, we retain those real foreground im-
pseudo-color image is an index to a full color map. ages in which the number of foreground pixels is
Therefore, two pixels with similar pseudo-color values larger than a threshold ¹np ("400). Further, we
may have distinct colors. empirically choose N"8 as the number of real
2062 A. K. JAIN and B. YU

Fig. 10. Decomposition of web image of Fig. 4(c): (a)—(f ) real foreground images; (g) background-
complementary foreground image.

Fig. 11. Foreground extraction from a full color video frame: (a) original frame; (b) bit dropping;
(c) color quantization reduces the number of distinct colors to four; (d)—(g) real foreground images;
(h) background-complementary foreground image.

foreground images. For the text without an unique 2.3. Color images and video frames
color value, we assume that its background has
a unique color value. The area of the background A color image or a video frame is a 24-bit image, so
should be large enough, so we regard the color with the value of º can be very large. To extract only
the largest number of pixels as the background. We a small number of foregrounds from a full color image
also regard the color value with the second largest with the presumption that the color of text is distinct
number of pixels as background if this number is from the color of its background, we implement (i) bit
larger than a threshold ¹bg ("10 000). Thus, a back- dropping for RGB color bands and (ii) color quantiz-
ground-complementary foreground image can be gen- ation. A 24-bit color image consists of three 8-bit red,
erated. At most, we consider only nine foreground green and blue images. For our task of text location,
images (eight real foreground images plus one back- we simply use the highest two bits for each band
ground-complementary foreground image). Each image, which has the same effect as color re-scaling.
foreground is tagged with a foreground identification Therefore, a 24-bit color image is correspondingly
(FI). The image in Fig. 4(c) has 117 element images reduced to a 6-bit color image and the value of º is
(see the histogram in Fig. 9) and only six of them are reduced to 64. Figure 11(b) shows the bit dropping
selected as real foreground images which are shown in results for the input color image shown in Fig. 11(a),
Fig. 10(a)—(f ). One background-complementary fore- where only the highest two bits have been retained
ground image is shown in Fig. 10(g). from each color band. The retained color prototypes
Automatic text location in images and video frames 2063

Fig. 12. Color prototypes: (a) after bit-dropping; (b) after color quantization.

are illustrated in Fig. 12(a). In a bit dropped image,


text may be present in several colors which are as-
sumed to be close in the color space. So, a color
quantization scheme or clustering algorithm is used to
generate a small number of meaningful color proto-
types. Since we perform color quantization in the 6-bit
color space, it greatly reduces the computational cost.
We employ the well-known single-link clustering
method(26) for quantizing the color space. The dis-
similarity between two colors Ci"(Ri , Gi , Bi ) and Fig. 13. A binary image and its BAG.
Cj"(Rj , Gj , Bj) is defined as
used for finding connected components in gray level
d(Ci , Cj)"(Ri!Rj )2#(Gi!Gj )2#(Bi!Bj )2. images described below. Block adjacency graph
(BAG) has been used for efficient computation of
We construct a 64]64 proximity matrix and at each connected components since it can be created by an
stage of the clustering algorithm, two colors with the one-pass procedure.(5) The BAG of an image is de-
minimum proximity value are merged together. The fined as B"(N, E) , where N"Mni N is a set of block
two merged colors are replaced by a single color nodes and E"Me(ni , nj ) D ni , nj 3NN is the set of edges
which has a higher value in the histogram. The color indicating the connection between nodes ni and nj .
quantization/clustering algorithm terminates when For a binary image, one of the two gray values can be
the number of colors either reaches a predetermined regarded as foreground and the other as background.
value of 2 or the minimum value in the proximity The pixels in the foreground are clustered into blocks
matrix is larger than 1. The color quantization result which are adjacently linked as nodes in a graph.
for the image after bit dropping (Fig. 11(b)) is depic- Figure 13 gives an example of a BAG, where a block
ted in Fig. 11(c); the four color prototypes are illus- characterized by its upper left (Xu , ½u) and lower right
trated in Fig. 12(b). Using the same method as for (X , ½ ) rectangular boundary coordinates, is a
pseudo-color images, we produce real foreground and l l
bounding box of a group of closely aligned run
background-complementary foreground images for lengths. Note that links exist between adjacent blocks.
the color quantized images. The image in Fig. 11(c) is We have extended the traditional algorithm of cre-
decomposed into five foreground images shown in ating BAG for multivalued images, where BAGs are
Fig. 10(d—h). individually created for each of the foreground images
using run lengths. A run length in a multivalued image
consists of as many continuous pixels with the same
3. CONNECTED COMPONENTS IN MULTIVALUED
IMAGES
FI on a row as they are tagged previously. A high-
level algorithm for creating BAG for multivalued im-
After decomposition of a multivalued image, we ages is presented in Fig. 14. Note that the BAG nodes
obtain a look-up table of foreground identifications for different foregrounds do not connect to each other.
(FIs) for pixel values according to foreground images. The following process is implemented in parallel for
A pixel in the original image has one or more FI all the foreground images.
values and can contribute to one or more foreground Given a BAG representation, a connected com-
images specified by this table. This information will be ponent ci"MnjN is a set of connected BAG nodes
2064 A. K. JAIN and B. YU

Each run length in the first row of the input image is regarded as a block with a corresponding FI.
For the successive rows in the image M
For each run length r in the current row M
c
If r is 8!connected to a run length in the preceding row and they have the same FI M
c
If r is 8-connected to only one run length r with the same FI and the differences of the horizontal positions of their
c l
beginning and end pixels are, respectively, within a given tolerance ¹ , then r is merged into the block node
a c
n involving r .
i l
Else, r is regarded as a new block node n with a corresponding FI, initialized with edges e(n , n ) to those block
c i`1 i j
nodes Mn N which are 8-connected to r .
j c
N
Else, r is regarded as a new block node n with a corresponding FI.
c i`1
N
N

Fig. 14. One-pass BAG generation algorithm for multivalued images.

Fig. 15. Connected component analysis for the foreground image in Fig. 11(f ): (a) connected compo-
nents; (b) connected component thresholding; (c) candidate text lines.

which satisfy the following conditions: (i) ci LB; provide additional information for this classification.
(ii) ∀nj , nk 3ci , there is a path Text identification module in our system determines
whether candidate text lines contain text or non-text
(n , n , n , 2 , n , n )
j j1 j2 jp k based on statistical features of connected components.
such that n 3c for l"1, 2, 2 , p and A candidate text line containing a number of charac-
jl i
ters will usually consist of several connected compo-
e(n , n ), e(n , n ), 2 , e(n , n ), e(n , n ) 3E;
j j1 j1 j2 jp~1 jp jp k nents. The number of such connected components
and (iii) if &e(nj , nk) 3E and nj 3ci then nk 3 ci . The may not be the same as the number of characters in
upper left and lower right coordinates of a connected this text line because some of the characters may be
component ci"MnjN are touching each other. Figure 16(b) illustrates the text
lines and connected components for the text in
Xu (ci )"min MXu (nj )N, Xl (ci )"max MXl (nj )N, Fig. 16(a) where characters are well separated. On the
n |c
j i n |c j i other hand, many characters shown in Fig. 16(c) are
touching each other and a connected component
½u (ci )"min M½u (nj )N, ½l (ci )"max M½l (nj )N. shown in Fig. 16(d) may include more than one char-
nj |ci nj |ci acter. We have designed two different recognition
The extracted connected components for the fore- strategies for touching and non-touching characters.
ground image shown in Fig. 11(f ) is depicted in A candidate line is recognized as a text line if it is
Fig. 15(a). Very small connected components are de- accepted by any one of the strategies.
leted as shown in Fig. 15(b). Assuming that we are
looking for horizontal text, we cluster connected com- 4.1. Inter-component features
ponents in horizontal direction and the resulting com-
ponents are called candidate text lines as shown in For separated characters, their corresponding con-
Fig. 15(c). nected components should be well aligned. Therefore,
we preserve those text lines in which the top and
4. TEXT IDENTIFICATION
bottom edges of the contained connected components
are respectively aligned, or both the width and the
Without character recognition capabilities, it is not height values of these connected components are close
easy to distinguish characters from non-characters to each other. In addition, the number of connected
simply based on the size of connected components. components should be in proportion to the length of
A line of text consisting of several characters can the text line.
Automatic text location in images and video frames 2065

Fig. 16. Characters in a text line: (a) well separated characters; (b) connected components and text
lines for (a); (c) characters touching each other; (d) connected components and text line for (c);
(e) X-axis projection profile and signature of the text in (c); (f ) ½-axis projection profile and signature of
the text in (c).

Fig. 17. Text composition: (a) text lines extracted from the foreground image in Fig. 10(b); (b) text line
extracted from the foreground image in Fig. 10(g); (c) composed result.

Table 1. Image size and processing time for text location

Text carrier No. of test images Typical size Accuracy (%) Avg. CPU time (s)

Advertisement 26 548]769 99.2 0.15


Web image 54 385]234 97.6 0.11
Color image 30 769]537 72.0 0.40
Video frame 6952 160]120 94.7 0.09

4.2. Projection profile features can be viewed as run lengths of 1s and 0s, where a
For characters touching each other, features are 1 represents a profile value larger than a threshold
extracted based on the projection profiles of the text and a 0 represents a profile value below the threshold.
line in both horizontal and vertical directions. The Therefore, we consider the following features to char-
basic idea is that if there are characters in a candidate acterize text: (i) because text should have many
text line then there will be a certain number of humps humps in the X profile, but only a few humps in the
in its X-axis projection profile and one significant ½ profile, the number of its 1-run lengths in X signa-
hump in its ½-axis projection profile. Figure 16(e) and ture is required to be larger than 5 and the number of
(f ) depict X-axis and ½-axis projection profiles of the its 1-run lengths in ½ signature should be less than 3;
text shown in Fig. 16(c). The signatures of the projec- (ii) since a very wide hump in the X profile of text
tion profiles in both the directions are generated by is not expected, the maximum length of the 1-run
means of thresholding and they are also shown in lengths in X signature should be less than 1.4 times
Fig. 16(e) and (f ). The threshold for X profile is its the height of the text line; and (iii) the humps in the
mean value and the threshold for ½ profile is chosen X profile should be regular in width, i.e. the standard
as one third of the highest value in it. The signatures deviation of the length of 1-run lengths should be less
2066 A. K. JAIN and B. YU

Fig. 18. Located text lines for the advertisement images.

than 1.2 times their mean, and the mean should be less 5. EXPERIMENTAL RESULTS
than 0.11 times the height of the text line.
The proposed system for automatic text location
has been tested on a number of binary images,
4.3. Text composition
pseudo-color images, color images and video frames.
Connected component analysis and text identifica- Since different applications need different heuristics,
tion modules are applied to individual foreground the modules and parameters used in the algorithm
images. Ideally, the union of the outputs from the shown in Fig. 5 change accordingly. Table 1 lists the
individual foreground images should provide the lo- performance of our system. We compute the accuracy
cation of the text. However, the text lines extracted for advertisement images by manually counting the
from different foreground images may be overlapping number of correctly located characters. The accu-
and, therefore, they need to be merged. Two text lines racies for other images are subjectively computed
are merged and replaced by a new text line if their based on the number of correctly located important
horizontal distance is small and their vertical overlap text regions in the image. The false alarm rate is
is large. Figure 17(c) shows the final text location relatively high for color images and is the lowest for
results for the image in Fig. 4(c). Figure 17(a) and (b) advertisement images. At the same time, the accuracy
are the text lines extracted from the two foreground for color image is the lowest because of the high
images shown in Fig. 10(b) and (g); Fig. 17(c) is the complexity of the background. The processing time is
union of Fig. 17(a) and (b). reported for a Sun UltraSPARC I system (167 MHz)
Automatic text location in images and video frames 2067

Fig. 18. (Continued.)


2068 A. K. JAIN and B. YU

Fig. 19. Web images and located text regions.


Automatic text location in images and video frames 2069

Fig. 19. (Continued.)

with a 64 MB memory. More details of our experi- The text along a semicircle at the top of Fig. 18(e)
ments for different text carriers are explained in the cannot be detected by our algorithm. More complic-
following sub-sections. ated heuristics are needed to locate such text. Some
punctuation and dashed lines are missed as expected
because of their small size.
5.1. Advertisement images
The test images were scanned from a newspaper at
5.2. Web images
150 dpi. Some of the text location results are shown in
Fig. 18, where both normal text and reversed text are The 22 representative web images shown in Fig. 19
located and illustrated in red bounding boxes. The were down-loaded through the Internet. The corres-
line of white blocks in the upper part of Fig. 18(b) is ponding results of text location are shown in gray
detected as text because the blocks are regularly ar- scale in Fig. 19. The text in Fig. 19(a) is not com-
ranged in terms of size and alignment. However, this pletely aligned along a straight line. The data
region should be easily rejected by an OCR module. for image in Fig. 19(h) could not be completely
2070 A. K. JAIN and B. YU

Fig. 20. Text location in color images.


Automatic text location in images and video frames 2071

Fig. 20. (Continued.)

Fig. 21. False alarm in complex images.


2072 A. K. JAIN and B. YU

Fig. 22. Video frames with low resolution.

Fig. 23. Video frames containing both caption and non-caption text.

Fig. 24. Video frames with text in sub-window.


Automatic text location in images and video frames 2073

Fig. 25. Video frames with high resolution.


2074 A. K. JAIN and B. YU

Fig. 26. Locating text on containers.

down-loaded because of the accidental interruption of


transmission. Even so, the word ‘‘Welcome’’ was suc-
cessfully located. Figure 19(k) contains a title with
Chinese characters which has been correctly located.
Figures 19(b), (j), and (n) have text logos. Vertical text
in Fig. 19(h) and most small sized text are ignored.
The ‘‘smoke’’ in Fig. 19(p) is regarded as text because
of its good regularity. The IBM logo in Fig. 19(r) is
missed since broken rendered text is not regarded as
important text in our system and is also rather diffi-
cult to locate.

5.3. Scanned color images


Experimental color images are scanned at 50 dpi Fig. 27. A video frame with vertical text.
from magazine and book covers. Some of the results
are shown in Fig. 20. Most important text with suffi- the non-caption text on the wall. Non-caption text is
ciently large font size are successfully located by our more difficult to locate because of arbitrary orienta-
system. Some text with small size are missed, but they tion, alignment and illumination. Lienhart and
are probably not important for image indexing. Our Stuber’s algorithm(23) worked on gray-level video
system can also locate handwritten text as in frames under the assumption that text should be
Fig. 20(d) and (f ). Most of the false alarm occurs in monochromatic and generated artificially by title ma-
images with very complex background as shown in chines. However, no information about processing
Fig. 21. speed was provided. Figure 24(a) shows text in a win-
dow which is commonly used in news broadcasting.
Our system is not very sensitive to the image resolu-
5.4. Video frames
tion. The video frames in Fig. 25 are at a resolution of
A large number of video frames were selected from up to 720]486, where the text shown on the weather
eight different videos covering news, sports, advertise- forecast map has not been located. We are currently
ment, movie, weather report and camera monitor working to augment our heuristic to locate the missed
events. The resolution of these videos ranges from text in weather maps. One of the potential applica-
160]120 to 720]486. The results in Fig. 22 show the tions of text location is container identification and
performance of our algorithm on video frames with as our algorithm can be applied to such images as shown
low resolution as 160]120, where text font, size, color in Fig. 26. By a simple extension of our method, we
and contrast change within a large range. Our algo- can also locate vertical text as shown in Fig. 27, al-
rithm was applied to video frames which contained though it is not commonly encountered in practice.
a significant amount of text. The entire text in
Fig. 22(g) could not be located. Note that it is not
easy even for humans to locate all the text in this 6. CONCLUSIONS
image due to low resolution. The color and texture of
the text in Fig. 23(b) vary and are fairly similar to that The problem of text location in images and
of the background. In Fig. 23(c), our system located video frames has been addressed in this paper. Text
Automatic text location in images and video frames 2075

conversion and database indexing are two major ap- IEEE ¹rans. Pattern Anal. Machine Intell. 10, 910—918
plications of the proposed text location algorithm. A (1988).
method for text location based on multivalued image 7. I. Pitas and C. Kotropoulos, A texture-based approach
to the segmentation of semitic image, Pattern Recogni-
processing is proposed. A multivalued image includ- tion 25, 929—945 (1992).
ing binary image, gray-scale image, pseudo-color im- 8. A. Jain and S. Bhattacharjee, Text segmentation using
age and full color image can be decomposed into Gabor filters for automatic document processing, Mach.
multiple real foreground and background-com- »ision Applic. 5, 169—184 (1992).
9. Y. Zhong, K. Karu and A. Jain, Locating text in complex
plementary foreground images. For full color images, color images, Pattern Recognition 28, 1523—1535 (1995).
a color reduction method is presented, including bit 10. B. Yu and A. Jain, A robust and fast skew detection
dropping and color clustering. Therefore, the connec- algorithm for generic documents, Pattern Recognition,
ted component analysis for binary images can be used 29, 1599—1629 (1996).
in multivalued image processing to find text lines. We 11. Y. Tang, S. Lee and C. Suen, Automatic document
processing: a survey, Pattern Recognition 29, 1931—1952
have also proposed an approach to text identification (1996).
which is applicable to both separated and touching 12. M. Gray, Internet statistics: growth and usage of the
characters. Text location algorithm has been applied Web and the Internet, at http://www.mit.edu/people/
to advertisement images, Web images, color images mkgray/net/.
13. Altavista Web page, at http://altavista.digital.com/.
and video frames. The application to classified ad- 14. D. Lopresti and J. Zhou, Document analysis and the
vertisement conversion demands a higher accuracy. World Wide Web. Proc. ¼orkshop on Document Analy-
Therefore, we use a higher scan resolution of 150 dpi. sis Systems, Marven, pp. 417—424 (1996).
For other applications, the goal is to find all the 15. E. R. Lee, P. K. Kim and H. J. Kim, Automatic recogni-
important text for searching or indexing. Compared tion of a car license plate using color image processing,
Proc. 1st IEEE Conf. on Image Processing, Austin,
to texture-based method(8) and motion-based ap- pp. 301—305 (1994).
proach for video,(2,3) our method has a higher speed 16. R. W. Picard and T. P. Minka, Vision texture for annota-
and accuracy in terms of finding a bounding box tion, Multimedia Systems 3, 3—14 (1995).
around important text regions. Because of the diver- 17. S. Sclaroff and A. Pentland, Modal matching for corres-
pondence and recognition IEEE ¹rans. Pattern Anal.
sity of colors, the text location accuracy for color Machine Intell. 17, 545—561 (1995).
images is not as good compared to that for other 18. H. Sakamoto, H. Suzuki and A. Uemori, Flexible mon-
input sources. Our method does not work well where tage retrieval for image data, Proc. SPIE Conf. on Stor-
the three-dimensional color histogram is sparse and age and Retrieval for Image and »ideo Databases II,
there there are no dominant prototypes. Vol. SPIE2185, San Jose, pp. 25—33 (1994).
19. A. S. Gordon and E. A. Domeshek, Conceptual indexing
for video retrieval, Proc. Int. Joint Conf. on Artificial
Intelligence, Montreal, pp. 23—38 (1995).
REFERENCES 20. P. Schauble and M. Wechsler, First experiences with
a system for content based retrieval of information from
1. S. Mori and D. Y. Suen and K. Yamamoto, Historical speech, Proc. Int. Joint Conf. on Artificial Intelligence,
review of OCR research and development, Proc. IEEE Montreal, pp. 59—70 (1995).
80, 1029—1058 (1992). 21. A. Jain and A. Vailaya, Image retrieval using color and
2. A. Jain and B. Yu, Document representation and its shape, Pattern Recognition 29, 1233—1244 (1996).
application to page decomposition, IEEE ¹rans. Pat- 22. B. Shahraray and D. Gibbon, Automatic generation of
tern. Anal. Machine Intell. 20, 294—308 (1998). pictorial transcripts of video programs, Proc. SPIE Conf.
3. B. Yu, A. Jain and M. Mohiuddin, Address block loca- on Multimedia Computing and Networking, Vol. SPIE
tion on complex mail pieces, in Proc. 4th Int. Conf. on 2417, San Jose, pp. 2417—2447 (1995).
Document Analysis and Recognition, Ulm, pp. 897—901 23. R. Lienhart and F. Stuber, Automatic text recognition in
(1997). digital videos, Proc. SPIE 2666, San Jose, pp. 180—188
4. S. N. Srihari, C. H. Wang, P. W. Palumbo and J. J. Hull, (1996).
Recognizing address blocks on mail pieces: specialized 24. J. Zhou, D. Lopresti and Z. Lei, OCR for World Wide
tools and problem-solving architectures, Artificial Intel- Web images, Proc. of IS&¹/SPIE Electronic Imaging:
ligence 8, 25—35, 38—40 (1987). Document Recognition I», San Jose (1997).
5. B. Yu and A. Jain, A generic system for form dropout, 25. W. B. Pennebaker and J. L. Mitchell, JPEG: Still Image
IEEE ¹rans. Pattern Anal. Machine Intell. 18, Compression Standard. Van Nostrand Reinhold, New
1127—1134 (1996). York: NY (1993).
6. L. A. Fletcher and R. Kasturi, A robust algorithm for 26. A. K. Jain and R. C. Dubes, Algorithms for Clustering
text string separation from mixed text/graphics images, Data. Prentice Hall, Englewood Cliffs, NJ (1988).

About the Author—ANIL JAIN is a university distinguished Professor and Chair of the Department of
Computer Science at Michigan State University. His research interests include statistical pattern recogni-
tion, Markov random fields, texture analysis, neural networks, document image analysis, fingerprint
matching and 3D object recognition. He received the best paper awards in 1987 and 1991 and certificates
for outstanding contributions in 1976, 1979, 1992, and 1997 from the Pattern Recognition Society. He also
received the 1996 IEEE Trans. Neural Networks Outstanding Paper Award. He was the Editor-in-Chief of
the IEEE Trans. on Pattern Analysis and Machine Intelligence (1990—94). He is the co-author of
Algorithms for Clustering Data, Prentice-Hall, 1988, has edited the book Real-Time Object Measurement
and Classification, Springer-Verlag, 1988, and co-edited the books, Analysis and Interpretation of Range
Images, Springer-Verlag, 1989, Markov Random Fields, Academic Press, 1992, Artificial Neural Networks
and Pattern Recognition, Elsevier, 1993, 3D Object Recognition, Elsevier, 1993, and BIOMETRICS:
2076 A. K. JAIN and B. YU

Personal Identification in Networked Society to be published by Kluwer in 1998. He is a Fellow of the


IEEE and IAPR, and has received a Fulbright research award.

About the Author—BIN YU received his Ph.D. degree in Electronic Engineering from Tsinghua University
in 1990, M.S. degree in Electrical Engineering from Tianjin University in 1986 and B.S. degree in
Mechanical Engineering from Hefei Polytechnic University in 1983. Dr. Yu was a visiting scientist in the
Pattern Recognition and Image Processing Laboratory of the Department of Computer Science at
Michigan State University from 1995 to 1997. Since 1992, he has been an Associate Professor in the
Institute of Information Science at Northern Jiaotong University where he worked as a Postdoctoral
Fellow from 1990 to 1992. He is now working as a Senior Staff Vision Engineer at Electroglas, Inc., Santa
Clara. His research interests include Image Processing, Pattern Recognition and Computer Vision. Dr. Yu
has authored more than 50 journal and conference papers. He is a Member of the IEEE, a Member of the
Youth Board of the Chinese Institute of Electronics, and a Senior Member of the Chinese Institute of
Electronics.

You might also like