Text Detection OCR Reseacrh Paper
Text Detection OCR Reseacrh Paper
Text Detection OCR Reseacrh Paper
region, and (3) the text orientation around the pixel. These methods greatly reduce the pipeline into an
Connected positive responses are considered as detected end-to-end trainable neural network component, mak-
characters or text regions. For characters belonging to ing training much easier and inference much faster. We
the same text region, Delaunay triangulation (Kang introduce the most representative works here.
et al., 2014) is applied, after which a graph partition Inspired by one-staged object detectors, TextBoxes
algorithm groups characters into text lines based on (Liao et al., 2017) adapts SSD (Liu et al., 2016a) to
the predicted orientation attribute. fit the varying orientations and aspect-ratios of text by
Similarly, Zhang et al. (2016) first predicts a seg- defining default boxes as quadrilaterals with different
mentation map indicating text line regions. For each aspect-ratio specs.
text line region, MSER (Neumann and Matas, 2012) is EAST (Zhou et al., 2017) further simplifies the anchor-
applied to extract character candidates. Character can- based detection by adopting the U-shaped design (Ron-
didates reveal information on the scale and orientation neberger et al., 2015) to integrate features from dif-
of the underlying text line. Finally, minimum bounding ferent levels. Input images are encoded as one multi-
boxes are extracted as the final text line candidates. channeled feature map instead of multiple layers of dif-
He et al. (2017a) propose a detection process that ferent spatial sizes in SSD. The feature at each spatial
also consists of several steps. First, text blocks are ex- location is used to regress the rectangular or quadri-
tracted. Then the model crops and only focuses on the lateral bounding box of the underlying text instances
extracted text blocks to extract text center line (TCL), directly. Specifically, the existence of text, i.e. text/non-
which is defined as a shrunk version of the original text text, and geometries, e.g. orientation and size for rect-
line. Each text line represents the existence of one text angles, and vertexes coordinates for quadrilaterals, are
instance. The extracted TCL map is then split into sev- predicted. EAST makes a difference to the field of text
eral TCLs. Each split TCL is then concatenated to the detection with its highly simplified pipeline and effi-
original image. A semantic segmentation model then ciency to perform inference at real-time speed.
classifies each pixel into ones that belong to the same Other methods adapt the two-staged object detec-
text instance as the given TCL, and ones that do not. tion framework of R-CNN (Girshick et al., 2014; Gir-
Overall, in this stage, scene text detection algorithms shick, 2015; Ren et al., 2015), where the second stage
still have long and slow pipelines, though they have re- corrects the localization results based on features ob-
placed some hand-crafted features with learning-based tained by Region of Interest (ROI) pooling.
ones. The design methodology is bottom-up and based In (Ma et al., 2017), rotation region proposal net-
on key components, such as single characters and text works are adapted to generate rotating region propos-
center lines. als, in order to fit into text of arbitrary orientations,
instead of axis-aligned rectangles.
In FEN (Zhang et al., 2018), the weighted sum of
3.1.2 Methods Inspired by Object Detection ROI poolings with different sizes is used. The final pre-
diction is made by leveraging the textness score for
Later, researchers are drawing inspirations from the poolings of 4 different sizes.
rapidly developing general object detection algorithms Zhang et al. (2019) propose to perform ROI and
(Liu et al., 2016a; Fu et al., 2017; Girshick et al., 2014; localization branch recursively, to revise the predicted
Girshick, 2015; Ren et al., 2015; He et al., 2017b). In position of the text instance. It is a good way to include
this stage, scene text detection algorithms are designed features at the boundaries of bounding boxes, which
by modifying the region proposal and bounding box re- localizes the text better than region proposal networks
gression modules of general detectors to localize text (RPNs).
instances directly (Dai et al., 2017; He et al., 2017c; Wang et al. (2018) propose to use a parametrized
Jiang et al., 2017; Liao et al., 2017, 2018a; Liu and Jin, Instance Transformation Network (ITN) that learns to
2017; Shi et al., 2017a; Liu et al., 2017; Ma et al., 2017; predict appropriate affine transformation to perform on
Li et al., 2017b; Liao et al., 2018b; Zhang et al., 2018), the last feature layer extracted by the base network, to
as shown in Fig. 4. They mainly consist of stacked con- rectify oriented text instances. Their method, with ITN,
volutional layers that encode the input images into fea- can be trained end-to-end.
ture maps. Each spatial location at the feature map To adapt to irregularly shaped text, bounding poly-
corresponds to a region of the input image. The feature gons (Liu et al., 2017) with as many as 14 vertexes
maps are then fed into a classifier to predict the ex- are proposed, followed by a Bi-LSTM (Hochreiter and
istence and localization of text instances at each such Schmidhuber, 1997) layer to refine the coordinates of
spatial location. the predicted vertexes.
6 Shangbang Long et al.
Ground Truth
Classification Loss
Transcription:
Transcription: C1 C2 Ct Char Type
Ground Truth C’1 C’2 … C’t’
Cropped …… Segmentation Map
Images
CTC CTC
Resize Loss Rule ……
Attention Pooling
h
C1 C2 Ct a1 a2 a3 …… aw Convolutional
w Layers
…… ……
Convolutional
Layers
Feature extractor (a) CTC-based decoding (b) Seq-to-seq learning (c) Character segmentation
Fig. 7: Frameworks of text recognition models. (a) represents a sequence tagging model, and uses CTC for alignment
in training and inference. (b) represents a sequence to sequence model, and can use cross-entropy to learn directly.
(c) represents segmentation-based methods.
3.2 Recognition apply CTC in scene text recognition, the input images
are viewed as a sequence of vertical pixel frames. The
In this section, we introduce methods for scene text network outputs a per-frame prediction, indicating the
recognition. The input of these methods is cropped text probability distribution of label types for each frame.
instance images which contain only one word. The CTC rule is then applied to edit the per-frame
In the deep learning era, scene text recognition mod- prediction to a text string. During training, the loss is
els use CNNs to encode images into feature spaces. The computed as the sum of the negative log probability
main difference lies in the text content decoding mod- of all possible per-frame predictions that can generate
ule. Two major techniques are the Connectionist Tem- the target sequence by CTC rules. Therefore, the CTC
poral Classification (Graves et al., 2006) (CTC) and the method makes it end-to-end trainable with only word-
encoder-decoder framework (Sutskever et al., 2014). We level annotations, without the need for character level
introduce recognition methods in the literature based annotations. The first application of CTC in the OCR
on the main technique they employ. Mainstream frame- domain can be traced to the handwriting recognition
works are illustrated in Fig.7. system of Graves et al. (2008). Now this technique is
Both CTC and encoder-decoder frameworks are orig- widely adopted in scene text recognition (Su and Lu,
inally designed for 1-dimensional sequential input data, 2014; He et al., 2016; Liu et al., 2016b; Gao et al., 2017;
and therefore are applicable to the recognition of straight Shi et al., 2017b; Yin et al., 2017).
and horizontal text, which can be encoded into a se-
quence of feature frames by CNNs without losing im- The first attempts can be referred to as convolu-
portant information. However, characters in oriented tional recurrent neural networks (CRNN). These mod-
and curved text are distributed over a 2-dimensional els are composed by stacking RNNs on top of CNNs and
space. It remains a challenge to effectively represent use CTC for training and inference. DTRN (He et al.,
oriented and curved text in feature spaces in order to 2016) is the first CRNN model. It slides a CNN model
fit the CTC and encoder-decoder frameworks, whose across the input images to generate convolutional fea-
decodes require 1-dimensional inputs. For oriented and ture slices, which are then fed into RNNs. (Shi et al.,
curved text, directly compressing the features into a 2017b) further improves DTRN by adopting the fully
1-dimensional form may lose relevant information and convolutional approach to encode the input images as
bring in noise from background, thus leading to inferior a whole to generate features slices, utilizing the prop-
recognition accuracy. We would introduce techniques to erty that CNNs are not restricted by the spatial sizes
solve this challenge. of inputs.
Instead of RNN, Gao et al. (2017) adopt the stacked
3.2.1 CTC-Based Methods convolutional layers to effectively capture the contex-
tual dependencies of the input sequence, which is char-
The CTC decoding module is adopted from speech recog- acterized by lower computational complexity and easier
nition, where data are sequential in the time domain. To parallel computation.
Scene Text Detection and Recognition: 9
Yin et al. (2017) simultaneously detect and recog- the same reason, the encoder-decoder framework re-
nize characters by sliding the text line image with char- quires a larger training dataset with a larger vocabu-
acter models, which are learned end-to-end on text line lary. Otherwise, the model may degenerate when read-
images labeled with text transcripts. ing words that are unseen during training. On the con-
trary, CTC is less dependent on language models and
3.2.2 Encoder-Decoder Methods has a better character-to-pixel alignment. Therefore it
is potentially better on languages such as Chinese and
The encoder-decoder framework for sequence-to-sequence Japanese that have a large character set. The main
learning is originally proposed in (Sutskever et al., drawback of these two methods is that they assume
2014) for machine translation. The encoder RNN reads the text to be straight, and therefore can not adapt to
an input sequence and passes its final latent state to irregular text.
a decoder RNN, which generates output in an auto-
regressive way. The main advantage of the encoder- 3.2.3 Adaptions for Irregular Text Recognition
decoder framework is that it gives outputs of variable
lengths, which satisfies the task setting of scene text Rectification-modules are a popular solution to irreg-
recognition. The encoder-decoder framework is usually ular text recognition. Shi et al. (2016, 2018) propose
combined with the attention mechanism (Bahdanau et al., a text recognition system which combined a Spatial
2014) which jointly learns to align input sequence and Transformer Network (STN) (Jaderberg et al., 2015)
output sequence. and an attention-based Sequence Recognition Network.
Lee and Osindero (2016) present recursive recurrent The STN-module predicts text bounding polygons with
neural networks with attention modeling for lexicon- fully connected layers in order for Thin-Plate-Spline
free scene text recognition. the model first passes input transformations which rectify the input irregular text
images through recursive convolutional layers to extract image into a more canonical form, i.e. straight text.
encoded image features and then decodes them to out- The rectification proves to be a successful strategy and
put characters by recurrent neural networks with im- forms the basis of the winning solution (Long et al.,
plicitly learned character-level language statistics. The 2019) in ICDAR 2019 ArT2 irregular text recognition
attention-based mechanism performs soft feature selec- competition.
tion for better image feature usage. There have also been several improved versions of
rectification based recognition. Zhan and Lu (2019) pro-
Cheng et al. (2017a) observe the attention drift prob-
pose to perform rectification multiple times to gradu-
lem in existing attention-based methods and proposes
ally rectify the text. They also replace the text bound-
to impose localization supervision for attention score to
ing polygons with a polynomial function to represent
attenuate it.
the shape. Yang et al. (2019) propose to predict lo-
Bai et al. (2018) propose an edit probability (EP)
cal attributes, such as radius and orientation values for
metric to handle the misalignment between the ground
pixels inside the text center region, in a similar way
truth string and the attention’s output sequence of the
to TextSnake (Long et al., 2018). The orientation is
probability distribution. Unlike aforementioned attention-
defined as the orientation of the underlying character
based methods, which usually employ a frame-wise max-
boxes, instead of text bounding polygons. Based on the
imal likelihood loss, EP tries to estimate the probabil-
attributes, bounding polygons are reconstructed in a
ity of generating a string from the output sequence of
way that the perspective distortion of characters is rec-
probability distribution conditioned on the input im-
tified, while the method by Shi et al. and Zhan et al.
age, while considering the possible occurrences of miss-
may only rectify at the text level and leave the charac-
ing or superfluous characters.
ters distorted.
Liu et al. (2018d) propose an efficient attention-
Yang et al. (2017) introduce an auxiliary dense char-
based encoder-decoder model, in which the encoder part
acter detection task to encourage the learning of visual
is trained under binary constraints to reduce computa-
representations that are favorable to the text patterns.
tion cost.
And they adopt an alignment loss to regularize the es-
Both CTC and the encoder-decoder framework sim-
timated attention at each time-step. Further, they use
plify the recognition pipeline and make it possible to
a coordinate map as a second input to enforce spatial-
train scene text recognizers with only word-level an-
awareness.
notations instead of character level annotations. Com-
Cheng et al. (2017b) argue that encoding a text im-
pared to CTC, the decoder module of the encoder-
age as a 1-D sequence of features as implemented in
decoder framework is an implicit language model, and
2
therefore, it can incorporate more linguistic priors. For https://rrc.cvc.uab.es/?ch=14
10 Shangbang Long et al.
most methods is not sufficient. They encode an input 3.2.4 Other Methods
image to four feature sequences of four directions: hor-
izontal, reversed horizontal, vertical, and reversed ver- Jaderberg et al. (2014a,b) perform word recognition
tical. A weighting mechanism is applied to combine the by classifying the image into a pre-defined set of vo-
four feature sequences. cabulary, under the framework of image classification.
The model is trained by synthetic images, and achieves
Liu et al. (2018b) present a hierarchical attention state-of-the-art performance on some benchmarks con-
mechanism (HAM) which consists of a recurrent RoI- taining English words only. However, the application of
Warp layer and a character-level attention layer. They this method is quite limited as it cannot be applied to
adopt a local transformation to model the distortion of recognize unseen sequences such as phone numbers and
individual characters, resulting in improved efficiency, email addresses.
and can handle different types of distortion that are To improve performance on difficult cases such as
hard to be modeled by a single global transformation. occlusion which brings ambiguity to single character
Liao et al. (2019b) cast the task of recognition into recognition, Yu et al. (2020) propose a transformer-
semantic segmentation, and treat each character type based semantic reasoning module that performs trans-
as one class. The method is insensitive to shapes and lations from coarse, prone-to-error text outputs from
is thus effective on irregular text, but the lack of end- the decoder to fine and linguistically calibrated out-
to-end training and sequence learning makes it prone puts, which bears some resemblance to the deliberation
to single-character errors, especially when the image networks for machine translation (Xia et al., 2017) that
quality is low. They are also the first to evaluate the first translate and then re-write the sentences.
robustness of their recognition method by padding and Despite the progress we have seen so far, the evalu-
transforming test images. ation of recognition methods falls behind the time. As
most detection methods can detect oriented and irregu-
Another solution to irregular scene text recognition lar text and some even rectify them, the recognition of
is 2-dimensional attention (Xu et al., 2015), which has such text may seem redundant. On the other hand, the
been verified in (Li et al., 2019). Different from the robustness of recognition when cropped with a slightly
sequential encoder-decoder framework, the 2D atten- different bounding box is seldom verified. Such robust-
tional model maintains 2-dimensional encoded features, ness may be more important in real-world scenarios.
and attention scores are computed for all spatial loca-
tions. Similar to spatial attention, Long et al. (2020)
propose to first detect characters. Afterward, features 3.3 End-to-End System
are interpolated and gathered along the character cen-
ter lines to form sequential feature frames. In the past, text detection and recognition are usually
cast as two independent sub-problems that are com-
In addition to the aforementioned techniques, Qin bined to perform text reading from images. Recently,
et al. (2019) show that simply flattening the feature many end-to-end text detection and recognition sys-
maps from 2-dimensional to 1-dimensional and feeding tems (also known as text spotting systems) have been
the resulting sequential features to RNN based atten- proposed, profiting a lot from the idea of designing dif-
tional encoder-decoder model is sufficient to produce ferentiable computation graphs, as shown in Fig. 8. Ef-
state-of-the-art recognition results on irregular text, which forts to build such systems have gained considerable
is a simple yet efficient solution. momentum as a new trend.
Two-Step Pipelines While earlier work (Wang et al.,
Apart from tailored model designs, Long et al. (2019)
2011, 2012) first detect single characters in the input
synthesizes a curved text dataset, which significantly
image, recent systems usually detect and recognize text
boosts the recognition performance on real-world curved
in word-level or line level. Some of these systems first
text datasets with no sacrifices to straight text datasets.
generate text proposals using a text detection model
Although many elegant and neat solutions have been and then recognize them with another text recogni-
proposed, they are only evaluated and compared based tion model (Jaderberg et al., 2016; Liao et al., 2017;
on a relatively small dataset, CUTE80, which only con- Gupta et al., 2016). Jaderberg et al. (2016) use a com-
sists of 288 word samples. Besides, the training datasets bination of Edge Box proposals (Zitnick and Dollár,
used in these works only contain a negligible proportion 2014) and a trained aggregate channel features detec-
of irregular text samples. Evaluations on larger datasets tor (Dollár et al., 2014) to generate candidate word
and more suitable training datasets may help us under- bounding boxes. Proposal boxes are filtered and recti-
stand these methods better. fied before being sent into their recognition model pro-
Scene Text Detection and Recognition: 11
color, and distortion. The results show that training rid of the location bias. UnrealText achieves significant
merely on these synthetic data can achieve state-of- speedup and better detector performances.
the-art performance and that synthetic data can act Text Editing It is also worthwhile to mention the
as augmentative data sources for all datasets. text editing task that is proposed recently (Wu et al.,
SynthText (Gupta et al., 2016) first propose to em- 2019; Yang et al., 2020). Both works try to replace the
bed text in natural scene images for the training of text text content while retaining text styles in natural im-
detection, while most previous work only print text on ages, such as the spatial arrangement of characters, text
a cropped region and these synthetic data are only for fonts, and colors. Text editing per se is useful in ap-
text recognition. Printing text on the whole natural im- plications such as instant translation using cellphone
ages poses new challenges, as it needs to maintain se- cameras. It also has great potential in augmenting ex-
mantic coherence. To produce more realistic data, Syn- isting scene text images, though we have not seen any
thText makes use of depth prediction (Liu et al., 2015) relevant experiment results yet.
and semantic segmentation (Arbelaez et al., 2011). Se-
mantic segmentation groups pixels together as semantic
clusters and each text instance is printed on one se- 3.4.2 Weakly and Semi-Supervision
mantic surface, not overlapping multiple ones. A dense
depth map is further used to determine the orienta- Bootstrapping for Character-Box
tion and distortion of the text instance. The model
Character level annotations are more accurate and
trained only on SynthText achieves state-of-the-art on
better. However, most existing datasets do not provide
many text detection datasets. It is later used in other
character-level annotating. Since characters are smaller
works (Zhou et al., 2017; Shi et al., 2017a) as well for
and close to each other, character-level annotation is
initial pre-training.
more costly and inconvenient. There has been some
Further, Zhan et al. (2018) equip text synthesis with work on semi-supervised character detection. The ba-
other deep learning techniques to produce more realistic sic idea is to initialize a character-detector and applies
samples. They introduce selective semantic segmenta- rules or threshold to pick the most reliable predicted
tion so that word instances would only appear on sen- candidates. These reliable candidates are then used as
sible objects, e.g. a desk or wall in stead of someone’s additional supervision sources to refine the character-
face. Text rendering in their work is adapted to the im- detector. Both of them aim to augment existing datasets
age so that they fit into the artistic styles and do not with character level annotations. Their difference is il-
stand out awkwardly. lustrated in Fig. 9.
SynthText3D (Liao et al., 2019a) uses the famous WordSup (Hu et al., 2017) first initializes the char-
open-source game engine, Unreal Engine 4 (UE4), and acter detector by training 5K warm-up iterations on
UnrealCV (Qiu et al., 2017) to synthesize scene text synthetic datasets. For each image, WordSup generates
images. Text is rendered with the scene together and character candidates, which are then filtered with word-
thus can achieve different lighting conditions, weather, boxes. For characters in each word box, the following
and natural occlusions. However, SynthText3D simply score is computed to select the most possible character
follows the pipeline of SynthText and only makes use list:
of the ground-truth depth and segmentation maps pro-
vided by the game engine. As a result, SynthText3D
relies on manual selection of camera views, which lim-
its its scalability. Besides, the proposed text regions are area(Bchars ) λ2
s=w· + (1 − w) · (1 − ) (1)
generated by clipping maximal rectangular bounding area(Bword ) λ1
boxes extracted from segmentation maps, and therefore
are limited to the middle parts of large and well-defined where Bchars is the union of the selected character boxes;
regions, which is an unfavorable location bias. Bword is the enclosing word bounding box; λ1 and λ2
UnrealText (Long and Yao, 2020) is another work are the first- and second-largest eigenvalues of a co-
using game engines to synthesize scene text images. It variance matrix C, computed by the coordinates of the
features deep interactions with the 3D worlds during centers of the selected character boxes; w is a weight
synthesis. A ray-casting based algorithm is proposed scalar. Intuitively, the first term measures how com-
to navigate in the 3D worlds efficiently and is able to plete the selected characters can cover the word boxes,
generate diverse camera views automatically. The text while the second term measures whether the selected
region proposal module is based on collision detection characters are located on a straight line, which is the
and can put text onto the whole surfaces, thus getting main characteristic for word instances in most datasets.
Scene Text Detection and Recognition: 13
SVT-P
IIIT5K
Fig. 10: Selected samples from Chars74K, SVT-P, IIIT5K, MSRA-TD 500, ICDAR 2013, ICDAR 2015, ICDAR
2017 MLT, ICDAR 2017 RCTW, and Total-Text.
Table 1: Public datasets for scene text detection and recognition. EN stands for English and CN stands for
Chinese. Note that HUST-TR 400 is a supplementary training dataset for MSRA-TD 500. ICDAR 2013 refers to
ICDAR 2013 Focused Scene Text Competition. ICDAR 2015 refers to ICDAR 2015 Incidental Text Competition.
The last two columns indicate whether the datasets provide annotations for detection and recognition tasks.
are mainly taken from street billboards, and annotated LSVT (Sun et al., 2019) is composed of two datasets.
as polygons with a variable number of vertices. One is fully labeled with word bounding boxes and word
The Chinese Text in the Wild (CTW) dataset (Yuan content. The other, while much larger, is only annotated
et al., 2018) contains 32, 285 high-resolution street view with the word content of the dominant text instance.
images, annotated at the character level, including its The authors propose to work on such partially labeled
underlying character type, bounding box, and detailed data that are much cheaper.
attributes such as whether it uses word-art. The dataset IIIT 5K-Word (Mishra et al., 2012) is the largest
is the largest one to date and the only one that contains scene text recognition dataset, containing both digital
detailed annotations. However, it only provides anno- and natural scene images. Its variance in font, color,
tations for Chinese text and ignores other scripts, e.g. size, and other noises makes it the most challenging
English. one to date.
Scene Text Detection and Recognition: 15
Method P R F1 Detection
Zhang et al. (2016) 88 78 83 Method E2E
Gupta et al. (2016) 92.0 75.5 83.0 P R F
Yao et al. (2016) 88.88 80.22 84.33 Lyu et al. (2018a) 69.0 55.0 61.3 52.9
Deng et al. (2018) 86.4 83.6 84.5 Long et al. (2018) 82.7 74.5 78.4 -
He et al. (2017a)(∗) 93 79 85 Wang et al. (2019b) 80.9 76.2 78.5 -
Shi et al. (2017a) 87.7 83.0 85.3 Wang et al. (2019a) 84.02 77.96 80.87 -
Lyu et al. (2018b) 93.3 79.4 85.8 Zhang et al. (2019) 75.7 88.6 81.6 -
He et al. (2017d) 92 80 86 Baek et al. (2019b) 87.6 79.9 83.6 -
Liao et al. (2017) 89 83 86 Qin et al. (2019) 83.3 83.4 83.3 67.8
Zhou et al. (2017) 92.64 82.67 87.37 Xing et al. (2019) 81.0 88.6 84.6 63.6
Liu et al. (2018e) 88.2 87.2 87.7 Zhang et al. (2020) 86.54 84.93 85.73 -
Tian et al. (2016) 93 83 88
He et al. (2017c) 89 86 88 Table 6: Detection on CTW1500.
He et al. (2018) 88 87 88
Xue et al. (2018) 91.5 87.1 89.2
Method P R F1
Hu et al. (2017)(∗) 93.34 87.53 90.34
Liu et al. (2017) 77.4 69.8 73.4
Lyu et al. (2018a)(∗) 94.1 88.1 91.0
Long et al. (2018) 67.9 85.3 75.6
Zhang et al. (2018) 93.7 90.0 92.3
Zhang et al. (2019) 69.6 89.2 78.4
Baek et al. (2019b) 97.4 93.1 95.2
Wang et al. (2019b) 80.1 80.2 80.1
Tian et al. (2019) 82.7 77.8 80.1
Table 3: Detection on ICDAR MLT 2017. Wang et al. (2019a) 84.84 79.73 82.2
Baek et al. (2019b) 86.0 81.1 83.5
Method P R F1 Zhang et al. (2020) 85.93 83.02 84.45
Liu et al. (2018c) 81.0 57.5 67.3
Zhang et al. (2019) 60.6 78.8 68.5 Table 7: Detection on MSRA-TD 500.
Wang et al. (2019a) 73.4 69.2 72.1
Xing et al. (2019) 70.10 77.07 73.42
Method P R F1
Baek et al. (2019b) 68.2 80.6 73.9
Kang et al. (2014) 71 62 66
Long and Yao (2020) 82.2 67.4 74.1
Zhang et al. (2016) 83 67 74
He et al. (2017d) 77 70 74
Table 4: Detection on ICDAR 2015. Yao et al. (2016) 76.51 75.31 75.91
Zhou et al. (2017) 87.28 67.43 76.08
Method P R F1 FPS Wu and Natarajan (2017) 77 78 77
Zhang et al. (2016) 71 43.0 54 0.5 Shi et al. (2017a) 86 70 77
Tian et al. (2016) 74 52 61 - Deng et al. (2018) 83.0 73.2 77.8
He et al. (2017a)(∗) 76 54 63 - Long et al. (2018) 83.2 73.9 78.3
Yao et al. (2016) 72.26 58.69 64.77 1.6 Xue et al. (2018) 83.0 77.4 80.1
Shi et al. (2017a) 73.1 76.8 75.0 - Wang et al. (2018) 90.3 72.3 80.3
Liu et al. (2018e) 72 80 76 - Lyu et al. (2018b) 87.6 76.2 81.5
He et al. (2017c) 80 73 77 7.7 Baek et al. (2019b) 88.2 78.2 82.9
Hu et al. (2017)(∗) 79.33 77.03 78.16 2.0 Tian et al. (2019) 84.2 81.7 82.9
Zhou et al. (2017) 83.57 73.47 78.20 13.2 Liu et al. (2018e) 88 79 83
Wang et al. (2018) 85.7 74.1 79.5 - Wang et al. (2019b) 85.2 82.1 83.6
Lyu et al. (2018b) 94.1 70.7 80.7 3.6 Zhang et al. (2020) 88.05 82.30 85.08
He et al. (2017d) 82 80 81 -
Jiang et al. (2017) 85.62 79.68 82.54 -
Long et al. (2018) 84.9 80.4 82.6 10.2
4.2 Evaluation Protocols
He et al. (2018) 84 83 83 1.1
Lyu et al. (2018a) 85.8 81.2 83.4 4.8
Deng et al. (2018) 85.5 82.0 83.7 3.0 In this part, we briefly summarize the evaluation pro-
Zhang et al. (2020) 88.53 84.69 86.56 - tocols for text detection and recognition.
Wang et al. (2019a) 86.92 84.50 85.69 1.6 As metrics for performance comparison of different
Tian et al. (2019) 88.3 85.0 86.6 3
Baek et al. (2019b) 89.8 84.3 86.9 8.6 algorithms, we usually refer to their precision, recall and
Zhang et al. (2019) 83.5 91.3 87.2 - F1-score. To compute these performance indicators, the
Qin et al. (2019) 89.36 85.75 87.52 4.76 list of predicted text instances should be matched to the
Wang et al. (2019b) 89.2 86.0 87.6 10.0 ground truth labels in the first place. Precision, denoted
Xing et al. (2019) 88.30 91.15 89.70 -
as P , is calculated as the proportion of predicted text
instances that can be matched to ground truth labels.
Recall, denoted as R, is the proportion of ground truth
labels that have correspondents in the predicted list.
16 Shangbang Long et al.
∗R
F1-score is then computed by F1 = 2∗PP +R , taking both
Table 8: Characteristics of the three vocabulary lists
precision and recall into account. Note that the match- used in ICDAR 2013/2015. S stands for Strongly Con-
ing between the predicted instances and ground truth textualised, W for Weakly Contextualised, and G for
ones comes first. Generic
Table 9: State-of-the-art recognition performance across a number of datasets. “50”, “1k”, “Full” are lexicons.
“0” means no lexicon. “90k” and “ST” are the Synth90k and the SynthText datasets, respectively. “ST+ ” means
including character-level annotations. “Private” means private training data.
scene image that appears in a predesignated vocab- Table 10: Performance of End-to-End and Word Spot-
ulary, while other text instances are ignored. On the ting on ICDAR 2015 and ICDAR 2013.
contrary, all text instances that appear in the scene im-
Word Spotting End-to-End
age are included under End-to-End. Three different vo- Method
S W G S W G
cabulary lists are provided for candidate transcriptions. ICDAR 2015
Liu et al. (2018c) 84.68 79.32 63.29 81.09 75.90 60.80
They include Strongly Contextualised, Weakly Contex- Xing et al. (2019) - - - 80.14 74.45 62.18
tualised, and Generic. The three kinds of lists are sum- Lyu et al. (2018a) 79.3 74.5 64.2 79.3 73.0 62.4
He et al. (2018) 85 80 65 82 77 63
marized in Tab.8. Note that under End-to-End, these Qin et al. (2019) - - - 83.38 79.94 67.98
vocabularies can still serve as references. ICDAR 2013
Busta et al. (2017) 92 89 81 89 86 77
Evaluation results of recent methods on several widely Liu et al. (2018c) 92.73 90.72 83.51 88.81 87.11 80.81
adopted benchmark datasets are summarized in the fol- Li et al. (2017a) 94.2 92.4 88.2 91.1 89.8 84.6
He et al. (2018) 93 92 87 91 89 86
lowing tables: Tab. 2 for detection on ICDAR 2013, Lyu et al. (2018a) 92.5 92.0 88.2 92.2 91.1 86.5
Tab. 4 for detection on ICDAR 2015 Incidental Text,
Tab. 3 for detection on ICDAR 2017 MLT, Tab. 5 for
detection and end-to-end word spotting on Total-Text, ignoring case-sensitivities and punctuations, and pro-
Tab. 6 for detection on CTW1500, Tab. 7 for detection vide new annotations for those datasets. Though most
on MSRA-TD 500, Tab. 9 for recognition on several paper claim to train their models to recognize in a
datasets, and Tab. 10 for end-to-end text spotting on case-sensitive way and also include punctuations, they
ICDAR 2013 and ICDAR 2015. Note that, we do not re- may be limiting their output to only digits and case-
port performance under multi-scale conditions if single- insensitive characters during evaluation.
scale performances are reported. We use ∗ to indicate
methods where only multi-scale performances are re-
ported. Since different backbone feature extractors are 5 Application
used in some works, we only report performances based
on ResNet-50 unless not provided. The detection and recognition of text—the visual and
Note that, current evaluation for scene text recog- physical carrier of human civilization—allow the con-
nition may be problematic. According to Baek et al. nection between vision and the understanding of its
(2019a), most researchers are actually using different content further. Apart from the applications we have
subsets when they refer to the same dataset, causing mentioned at the beginning of this paper, there have
discrepancies in performance. Besides, Long and Yao been numerous specific application scenarios across var-
(2020) further point out that half of the widely adopted ious industries and in our daily lives. In this part, we
benchmark datasets have imperfect annotations, e.g. list and analyze the most outstanding ones that have,
18 Shangbang Long et al.
or are to have, significant impact, improving our pro- Intelligent Content Analysis OCR also allows the
ductivity and life quality. industry to perform more intelligent analysis, mainly
Automatic Data Entry Apart from an electronic for platforms like video-sharing websites and e-commerce.
archive of existing documents, OCR can also improve Text can be extracted from images and subtitles as well
our productivity in the form of automatic data entry. as real-time commentary subtitles (a kind of floating
Some industries involve time-consuming data type-in, comments added by users, e.g. those in Bilibili9 and
e.g. express orders written by customers in the deliv- Niconico10 ). On the one hand, such extracted text can
ery industry, and hand-written information sheets in be used in automatic content tagging and recommen-
the financial and insurance industries. Applying OCR dation systems. They can also be used to perform user
techniques can accelerate the data entry process as well sentiment analysis, e.g. which part of the video attracts
as protect customer privacy. Some companies have al- the users most. On the other hand, website administra-
ready been using these technologies, e.g. SF-Express4 . tors can impose supervision and filtration for inappro-
Another potential application is note taking, such as priate and illegal content, such as terrorist advocacy.
NEBO5 , a note-taking software on tablets like iPad
that performs instant transcription as users write down
notes. 6 Conclusion and Discussion
Identity Authentication Automatic identity authen-
6.1 Status Quo
tication is yet another field where OCR can give a full
play to. In fields such as Internet finance and Cus-
Algorithms: The past several years have witnessed the
toms, users/passengers are required to provide identifi-
significant development of algorithms for text detec-
cation (ID) information, such as identity card and pass-
tion and recognition, mainly due to the deep learning
port. Automatic recognition and analysis of the pro-
boom. Deep learning models have replaced the manual
vided documents would require OCR that reads and
search and design for patterns and features. With the
extracts the textual content, and can automate and
improved capability of models, research attention has
greatly accelerate such processes. There are companies
been drawn to challenges such as oriented and curved
that have already started working on identification based
text detection, and have achieved considerable progress.
on face and ID card, e.g. MEGVII (Face++)6 .
Applications: Apart from efforts towards a general
Augmented Computer Vision As text is an essen- solution to all sorts of images, these algorithms can
tial element for the understanding of scene, OCR can be trained and adapted to more specific scenarios, e.g.
assist computer vision in many ways. In the scenario of bankcard, ID card, and driver’s license. Some compa-
autonomous vehicles, text-embedded panels carry im- nies have been providing such scenario-specific APIs,
portant information, e.g. geo-location, current traffic including Baidu Inc., Tencent Inc., and MEGVII Inc..
condition, navigation, and etc.. There have been sev- Recent development of fast and efficient methods (Ren
eral works on text detection and recognition for au- et al., 2015; Zhou et al., 2017) has also allowed the de-
tonomous vehicle (Mammeri et al., 2014, 2016). The ployment of large-scale systems (Borisyuk et al., 2018).
largest dataset so far, CTW (Yuan et al., 2018), also Companies including Google Inc. and Amazon Inc. are
places extra emphasis on traffic signs. Another example also providing text extraction APIs.
is the instant translation, where OCR is combined with
a translation model. This is extremely helpful and time-
saving as people travel or read documents written in 6.2 Challenges and Future Trends
foreign languages. Google’s Translate application7 can
perform such instant translation. A similar application We look at the present through a rear-view mirror. We
is instant text-to-speech software equipped with OCR, march backward into the future (Liu, 1975). We list and
which can help those with visual disability and those discuss challenges, and analyze what would be the next
who are illiterate8 . valuable research directions in the field scene text de-
tection and recognition.
4
Official website: http://www.sf-express.com/cn/sc/ Languages: There are more than 1000 languages in the
5
Official website: https://www.myscript.com/nebo/ world11 . However, most current algorithms and datasets
6
https://www.faceplusplus.com/
9
face-based-identification/ https://www.bilibili.com
7 10
https://translate.google.com/ www.nicovideo.jp/
8 11
https://en.wikipedia.org/wiki/Screen_reader#cite_ https://www.ethnologue.com/guides/
note-Braille_display-2 how-many-languages
Scene Text Detection and Recognition: 19
have primarily focused on text of English. While En- is to simply sum up the instance-level scores under De-
glish has a rather small alphabet, other languages such tEval instead of first assigning them to 1.0.
as Chinese and Japanese have a much larger one, with Synthetic Data: While training recognizers on syn-
tens of thousands of symbols. RNN-based recognizers thetic datasets has become a routine and results are
may suffer from such enlarged symbol sets. Moreover, excellent, detectors still rely heavily on real datasets. It
some languages have much more complex appearances, remains a challenge to synthesize diverse and realistic
and they are therefore more sensitive to conditions such images to train detectors. Potential benefits of synthetic
as image quality. Researchers should first verify how data are not yet fully explored, such as generalization
well current algorithms can generalize to text of other ability. Synthesis using 3D engines and models can sim-
languages and further to mixed text. Unified detec- ulate different conditions such as lighting and occlusion,
tion and recognition systems for multiple types of lan- and thus is worth further development.
guages are of important academic value and application Efficiency: Another shortcoming of deep-learning-based
prospects. A feasible solution might be to explore com- methods lies in their efficiency. Most of the current sys-
positional representations that can capture the common tems can not run in real-time when deployed on com-
patterns of text instances of different languages, and puters without GPUs or mobile devices. Apart from
train the detection and recognition models with text model compression and lightweight models that have
examples of different languages, which are generated proven effective in other tasks, it is also valuable to
by text synthesizing engines. study how to make custom speedup mechanism for text-
Robustness of Models: Although current text recog- related tasks.
nizers have proven to be able to generalize well to differ- Bigger and Better Datasets: The sizes of most widely
ent scene text datasets even only using synthetic data, adopted datasets are small (∼ 1k images). It will be
recent work (Liao et al., 2019b) shows that robustness worthwhile to study whether the improvements gained
against flawed detection is not a neglectable problem. from current algorithms can scale up or they are just
Actually, such instability in prediction has also been accidental results of better regularization. Besides, most
observed for text detection models. The reason behind datasets are only labeled with bounding boxes and texts.
this kind of phenomenon is still unclear. One conjec- Detailed annotation of different attributes (Yuan et al.,
ture is that the robustness of models is related to the 2018) such as word-art and occlusion may guide re-
internal operating mechanism of deep neural networks. searchers with pertinence. Finally, datasets character-
ized by real-world challenges are also important in ad-
Generalization: Few detection algorithms except for vancing research progress, such as densely located text
TextSnake (Long et al., 2018) have considered the prob- on products. Another related problem is that most of
lem of generalization ability across datasets, i.e. train- the existing datasets do not have validation sets. It is
ing on one dataset, and testing on another. General- highly possible that the current reported evaluation re-
ization ability is important as some application scenar- sults are actually upward biased due to overfitting on
ios would require the adaptability to varying environ- the test sets. We suggest that researchers should focus
ments. For example, instant translation and OCR in on large datasets, such as ICDAR MLT 2017, ICDAR
autonomous vehicles should be able to perform sta- MLT 2019, ICDAR ArT 2019, and COCO-Text.
bly under different situations: zoomed-in images with
large text instances, far and small words, blurred words,
different languages, and shapes. It remains unverified
References
whether simply pooling all existing datasets together
is enough, especially when the target domain is totally
J. Almazán, A. Gordo, A. Fornés, and E. Valveny. Word
unknown.
spotting and recognition with embedded attributes.
Evaluation: Existing evaluation metrics for detection IEEE transactions on pattern analysis and machine
stem from those for general object detection. Matching intelligence, 36(12):2552–2566, 2014.
based on IoU score or pixel-level precision and recall ig- P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con-
nore the fact that missing parts and superfluous back- tour detection and hierarchical image segmentation.
grounds may hurt the performance of the subsequent IEEE transactions on pattern analysis and machine
recognition procedure. For each text instance, pixel- intelligence, 33(5):898–916, 2011.
level precision and recall are good metrics. However, J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J.
their scores are assigned to 1.0 once they are matched Oh, and H. Lee. What is wrong with scene text
to ground truth, and thus not reflected in the final recognition model comparisons? dataset and model
dataset-level score. An off-the-shelf alternative method analysis. In Proceedings of the IEEE International
20 Shangbang Long et al.
L. Kang, Y. Li, and D. Doermann. Orientation robust (ICDAR), volume 1, pages 324–330. IEEE, 2017b.
text line detection in natural images. In Proceed- M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu.
ings of the IEEE Conference on Computer Vision Textboxes: A fast text detector with a single deep
and Pattern Recognition (CVPR), pages 4034–4041, neural network. In AAAI, pages 4161–4167, 2017.
2014. M. Liao, B. Shi, and X. Bai. Textboxes++: A single-
D. Karatzas and A. Antonacopoulos. Text extraction shot oriented scene text detector. IEEE Transactions
from web images based on a split-and-merge segmen- on Image Processing, 27(8):3676–3690, 2018a.
tation method using colour perception. In Proceed- M. Liao, Z. Zhu, B. Shi, G.-s. Xia, and X. Bai. Rotation-
ings of the 17th International Conference on Pattern sensitive regression for oriented scene text detection.
Recognition, 2004. ICPR 2004., volume 2, pages 634– In Proceedings of the IEEE Conference on Com-
637. IEEE, 2004. puter Vision and Pattern Recognition (CVPR), pages
D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. 5909–5918, 2018b.
i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. M. Liao, B. Song, M. He, S. Long, C. Yao, and X. Bai.
Almazan, and L. P. de las Heras. Icdar 2013 robust Synthtext3d: Synthesizing scene text images from
reading competition. In 2013 12th International Con- 3d virtual worlds. arXiv preprint arXiv:1907.06007,
ference on Document Analysis and Recognition (IC- 2019a.
DAR), pages 1484–1493. IEEE, 2013. M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu,
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, C. Yao, and X. Bai. Scene text recognition from two-
A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, dimensional perspective. AAAI, 2019b.
V. R. Chandrasekhar, S. Lu, et al. Icdar 2015 compe- F. Liu, C. Shen, and G. Lin. Deep convolutional neu-
tition on robust reading. In 2015 13th International ral fields for depth estimation from a single image.
Conference on Document Analysis and Recognition In Proceedings of the IEEE Conference on Com-
(ICDAR), pages 1156–1160. IEEE, 2015. puter Vision and Pattern Recognition (CVPR), pages
T. N. Kipf and M. Welling. Semi-supervised classi- 5162–5170, 2015.
fication with graph convolutional networks. arXiv L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen,
preprint arXiv:1609.02907, 2016. X. Liu, and M. Pietikäinen. Deep learning for
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im- generic object detection: A survey. arXiv preprint
agenet classification with deep convolutional neural arXiv:1809.02165, 2018a.
networks. In Advances in neural information process- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed,
ing systems, pages 1097–1105, 2012. C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox
C.-Y. Lee and S. Osindero. Recursive recurrent nets detector. In In Proceedings of European Conference
with attention modeling for ocr in the wild. In Pro- on Computer Vision (ECCV), pages 21–37. Springer,
ceedings of the IEEE Conference on Computer Vision 2016a.
and Pattern Recognition (CVPR), pages 2231–2239, W. Liu, C. Chen, K.-Y. K. Wong, Z. Su, and J. Han.
2016. Star-net: A spatial attention residue network for
J.-J. Lee, P.-H. Lee, S.-W. Lee, A. Yuille, and C. Koch. scene text recognition. In BMVC, volume 2, page 7,
Adaboost for text detection in natural scene. In 2011 2016b.
International Conference on Document Analysis and W. Liu, C. Chen, and K. Wong. Char-net: A character-
Recognition (ICDAR), pages 429–434. IEEE, 2011. aware neural network for distorted scene text recogni-
S. Lee and J. H. Kim. Integrating multiple character tion. In AAAI Conference on Artificial Intelligence.
proposals for robust scene text extraction. Image and New Orleans, Louisiana, USA, 2018b.
Vision Computing, 31(11):823–840, 2013. X. Liu. Old book of tang. Beijing: Zhonghua Book
H. Li, P. Wang, and C. Shen. Towards end-to-end Company, 1975.
text spotting with convolutional recurrent neural net- X. Liu and J. Samarabandu. An edge-based text region
works. In The IEEE International Conference on extraction algorithm for indoor mobile robot naviga-
Computer Vision (ICCV), 2017a. tion. In Mechatronics and Automation, 2005 IEEE
H. Li, P. Wang, C. Shen, and G. Zhang. Show, attend International Conference, volume 2, pages 701–706.
and read: A simple and strong baseline for irregular IEEE, 2005a.
text recognition. AAAI, 2019. X. Liu and J. K. Samarabandu. A simple and fast text
R. Li, M. En, J. Li, and H. Zhang. weakly supervised localization algorithm for indoor mobile robot navi-
text attention network for generating text proposals gation. In Image Processing: Algorithms and Systems
in scene images. In 2017 14th IAPR International IV, volume 5672, pages 139–151. International Soci-
Conference on Document Analysis and Recognition ety for Optics and Photonics, 2005b.
Scene Text Detection and Recognition: 23
X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan. A. Mammeri, E.-H. Khiari, and A. Boukerche. Road-
Fots: Fast oriented text spotting with a unified net- sign text recognition architecture for intelligent
work. CVPR2018, 2018c. transportation systems. In 2014 IEEE 80th Vehic-
Y. Liu and L. Jin. Deep matching prior network: To- ular Technology Conference (VTC Fall), pages 1–5.
ward tighter multi-oriented text detection. 2017. IEEE, 2014.
Y. Liu, L. Jin, S. Zhang, and S. Zhang. Detecting curve A. Mammeri, A. Boukerche, et al. Mser-based text
text in the wild: New dataset and new solution. arXiv detection and communication algorithm for au-
preprint arXiv:1712.02170, 2017. tonomous vehicles. In 2016 IEEE Symposium on
Y. Liu, L. Jin, Z. Xie, C. Luo, S. Zhang, and L. Xie. Computers and Communication (ISCC), pages 1218–
Tightness-aware evaluation protocol for scene text 1223. IEEE, 2016.
detection. In Proceedings of the IEEE Conference A. Mishra, K. Alahari, and C. Jawahar. An mrf model
on Computer Vision and Pattern Recognition, pages for binarization of natural scene text. In ICDAR-
9612–9620, 2019. International Conference on Document Analysis and
Z. Liu, Y. Li, F. Ren, H. Yu, and W. Goh. Squeezedtext: Recognition. IEEE, 2011.
A real-time scene text recognition by binary convo- A. Mishra, K. Alahari, and C. Jawahar. Scene text
lutional encoder-decoder network. AAAI, 2018d. recognition using higher order language priors. In
Z. Liu, G. Lin, S. Yang, J. Feng, W. Lin, and BMVC-British Machine Vision Conference. BMVA,
W. Ling Goh. Learning markov clustering networks 2012.
for scene text detection. In Proceedings of the IEEE L. Neumann and J. Matas. A method for text lo-
Conference on Computer Vision and Pattern Recog- calization and recognition in real-world images. In
nition (CVPR), pages 6936–6944, 2018e. Asian Conference on Computer Vision, pages 770–
S. Long and C. Yao. Unrealtext: Synthesizing realis- 783. Springer, 2010.
tic scene text images from the unreal world. arXiv L. Neumann and J. Matas. Real-time scene text local-
preprint arXiv:2003.10608, 2020. ization and recognition. In 2012 IEEE Conference on
S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and Computer Vision and Pattern Recognition (CVPR),
C. Yao. Textsnake: A flexible representation for de- pages 3538–3545. IEEE, 2012.
tecting text of arbitrary shapes. In In Proceedings of L. Neumann and J. Matas. On combining multiple seg-
European Conference on Computer Vision (ECCV), mentations in scene text recognition. In 2013 12th
2018. International Conference on Document Analysis and
S. Long, Y. Guan, B. Wang, K. Bian, and C. Yao. Recognition (ICDAR), pages 523–527. IEEE, 2013.
Alchemy: Techniques for rectification based ir- S. Nomura, K. Yamanaka, O. Katai, H. Kawakami, and
regular scene text recognition. arXiv preprint T. Shiose. A novel adaptive morphological approach
arXiv:1908.11834, 2019. for degraded character image segmentation. Pattern
S. Long, Y. Guan, K. Bian, and C. Yao. A new per- Recognition, 38(11):1961–1975, 2005.
spective for flexible feature gathering in scene text C. Parkinson, J. J. Jacobsen, D. B. Ferguson, and S. A.
recognition via character anchor pooling. In ICASSP Pombo. Instant translation system, Nov. 29 2016. US
2020-2020 IEEE International Conference on Acous- Patent 9,507,772.
tics, Speech and Signal Processing (ICASSP), pages S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao.
2458–2462. IEEE, 2020. Towards unconstrained end-to-end text spotting. In
P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai. Mask Proceedings of the IEEE International Conference on
textspotter: An end-to-end trainable neural network Computer Vision, pages 4704–4714, 2019.
for spotting text with arbitrary shapes. In In Pro- W. Qiu, F. Zhong, Y. Zhang, S. Qiao, Z. Xiao, T. S.
ceedings of European Conference on Computer Vi- Kim, and Y. Wang. Unrealcv: Virtual worlds for com-
sion (ECCV), 2018a. puter vision. In Proceedings of the 25th ACM inter-
P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai. Multi- national conference on Multimedia, pages 1221–1224.
oriented scene text detection via corner localization ACM, 2017.
and region segmentation. In 2018 IEEE Confer- T. Quy Phan, P. Shivakumara, S. Tian, and
ence on Computer Vision and Pattern Recognition C. Lim Tan. Recognizing text with perspective dis-
(CVPR), 2018b. tortion in natural scenes. In Proceedings of the
J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, IEEE International Conference on Computer Vision
and X. Xue. Arbitrary-oriented scene text detec- (ICCV), pages 569–576, 2013.
tion via rotation proposals. In IEEE Transactions J. Redmon and A. Farhadi. Yolo9000: better, faster,
on Multimedia, 2018, 2017. stronger. arXiv preprint, 2017.
24 Shangbang Long et al.
age Processing (ICIP), pages 2601–2604. IEEE, 2011. L. Wu, C. Zhang, J. Liu, J. Han, J. Liu, E. Ding, and
Z. Tu, Y. Ma, W. Liu, X. Bai, and C. Yao. Detect- X. Bai. Editing text in the wild. In Proceedings of the
ing texts of arbitrary orientations in natural images. 27th ACM International Conference on Multimedia,
In 2012 IEEE Conference on Computer Vision and pages 1500–1508, 2019.
Pattern Recognition, pages 1083–1090. IEEE, 2012. Y. Wu and P. Natarajan. Self-organized text detection
S. Uchida. Text localization and recognition in images with minimal post-processing via border learning. In
and video. In Handbook of Document Image Process- Proceedings of the IEEE Conference on CVPR, pages
ing and Recognition, pages 843–883. Springer, 2014. 5000–5009, 2017.
S. Wachenfeld, H.-U. Klein, and X. Jiang. Recognition Y. Xia, F. Tian, L. Wu, J. Lin, T. Qin, N. Yu, and T.-Y.
of screen-rendered text. In 18th International Con- Liu. Deliberation networks: Sequence generation be-
ference on Pattern Recognition, 2006. ICPR 2006., yond one-pass decoding. In Advances in Neural Infor-
volume 2, pages 1086–1089. IEEE, 2006. mation Processing Systems, pages 1784–1794, 2017.
T. Wakahara and K. Kita. Binarization of color charac- L. Xing, Z. Tian, W. Huang, and M. R. Scott. Con-
ter strings in scene images using k-means clustering volutional character networks. In Proceedings of the
and support vector machines. In 2011 International IEEE International Conference on Computer Vision,
Conference on Document Analysis and Recognition pages 9126–9136, 2019.
(ICDAR), pages 274–278. IEEE, 2011. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,
C. Wang, F. Yin, and C.-L. Liu. Scene text detec- R. Salakhudinov, R. Zemel, and Y. Bengio. Show, at-
tion with novel superpixel based character candidate tend and tell: Neural image caption generation with
extraction. In 2017 14th IAPR International Con- visual attention. In International Conference on Ma-
ference on Document Analysis and Recognition (IC- chine Learning, pages 2048–2057, 2015.
DAR), volume 1, pages 929–934. IEEE, 2017. C. Xue, S. Lu, and F. Zhan. Accurate scene text detec-
F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao. tion through border semantics awareness and boot-
Geometry-aware scene text detection with instance strapping. In In Proceedings of European Conference
transformation network. In Proceedings of the IEEE on Computer Vision (ECCV), 2018.
Conference on Computer Vision and Pattern Recog- M. Yang, Y. Guan, M. Liao, X. He, K. Bian, S. Bai,
nition (CVPR), pages 1381–1389, 2018. C. Yao, and X. Bai. Symmetry-constrained rectifica-
K. Wang, B. Babenko, and S. Belongie. End-to- tion network for scene text recognition. In Proceed-
end scene text recognition. In 2011 IEEE Inter- ings of the IEEE International Conference on Com-
national Conference on Computer Vision (ICCV),, puter Vision, pages 9147–9156, 2019.
pages 1457–1464. IEEE, 2011. Q. Yang, H. Jin, J. Huang, and W. Lin. Swaptext:
T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to- Image based texts transfer in scenes. arXiv preprint
end text recognition with convolutional neural net- arXiv:2003.08152, 2020.
works. In 2012 21st International Conference on Pat- X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles.
tern Recognition (ICPR), pages 3304–3308. IEEE, Learning to read irregular text with attention mech-
2012. anisms. In Proceedings of the Twenty-Sixth Inter-
W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and national Joint Conference on Artificial Intelligence,
S. Shao. Shape robust text detection with progressive IJCAI-17, pages 3280–3286, 2017.
scale expansion network. Proceedings of the IEEE C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets:
Conference on Computer Vision and Pattern Recog- A learned multi-scale representation for scene text
nition (CVPR), 2019a. recognition. In Proceedings of the IEEE Confer-
X. Wang, Y. Jiang, Z. Luo, C.-L. Liu, H. Choi, and ence on Computer Vision and Pattern Recognition
S. Kim. Arbitrary shape scene text detection with (CVPR), pages 4042–4049, 2014.
adaptive text region representation. In Proceedings of C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao.
the IEEE Conference on Computer Vision and Pat- Scene text detection via holistic, multi-channel pre-
tern Recognition, pages 6449–6458, 2019b. diction. arXiv preprint arXiv:1606.09002, 2016.
J. Weinman, E. Learned-Miller, and A. Hanson. Fast Q. Ye and D. Doermann. Text detection and recogni-
lexicon-based scene text recognition with sparse be- tion in imagery: A survey. IEEE transactions on pat-
lief propagation. In icdar, pages 979–983. IEEE, 2007. tern analysis and machine intelligence, 37(7):1480–
C. Wolf and J.-M. Jolion. Object count/area graphs for 1500, 2015.
the evaluation of object detection and segmentation Q. Ye, W. Gao, W. Wang, and W. Zeng. A robust
algorithms. International Journal of Document Anal- text detection algorithm in images and video frames.
ysis and Recognition (IJDAR), 8(4):280–296, 2006. IEEE ICICS-PCM, pages 802–806, 2003.
26 Shangbang Long et al.
C. Yi and Y. Tian. Text string detection from natu- pages 133–136. IEEE, 2010.
ral scenes by structure-based partition and grouping. X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He,
IEEE Transactions on Image Processing, 20(9):2594– and J. Liang. EAST: An efficient and accurate scene
2605, 2011. text detector. In The IEEE Conference on Computer
F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu. Scene Vision and Pattern Recognition (CVPR), 2017.
text recognition with sliding convolutional character Y. Zhu, C. Yao, and X. Bai. Scene text detection
models. arXiv preprint arXiv:1709.01727, 2017. and recognition: Recent advances and future trends.
X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao. Robust Frontiers of Computer Science, 10(1):19–36, 2016.
text detection in natural scene images. IEEE trans- C. L. Zitnick and P. Dollár. Edge boxes: Locating object
actions on pattern analysis and machine intelligence, proposals from edges. In In Proceedings of European
36(5), 2014. Conference on Computer Vision (ECCV), pages 391–
X.-C. Yin, Z.-Y. Zuo, S. Tian, and C.-L. Liu. Text 405. Springer, 2014.
detection, tracking and recognition in video: A com-
prehensive survey. IEEE Transactions on Image Pro-
cessing, 25(6), 2016.
D. Yu, X. Li, C. Zhang, J. Han, J. Liu, and
E. Ding. Towards accurate scene text recognition
with semantic reasoning networks. arXiv preprint
arXiv:2003.12294, 2020.
T.-L. Yuan, Z. Zhu, K. Xu, C.-J. Li, and S.-M.
Hu. Chinese text in the wild. arXiv preprint
arXiv:1803.00085, 2018.
F. Zhan and S. Lu. Esir: End-to-end scene text recog-
nition via iterative image rectification. In Proceed-
ings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019.
F. Zhan, S. Lu, and C. Xue. Verisimilar image synthe-
sis for accurate detection and recognition of texts in
scenes. 2018.
C. Zhang, B. Liang, Z. Huang, M. En, J. Han, E. Ding,
and X. Ding. Look more than once: An accurate de-
tector for text of arbitrary shapes. Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2019.
D. Zhang and S.-F. Chang. A bayesian framework for
fusing multiple word knowledge models in videotext
recognition. In Computer Vision and Pattern Recog-
nition, 2003. IEEE.
S. Zhang, Y. Liu, L. Jin, and C. Luo. Feature enhance-
ment network: A refined scene text detector. In Pro-
ceedings of AAAI, 2018, 2018.
S.-X. Zhang, X. Zhu, J.-B. Hou, C. Liu, C. Yang,
H. Wang, and X.-C. Yin. Deep relational reason-
ing graph network for arbitrary shape text detection.
arXiv preprint arXiv:2003.07493, 2020.
Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and
X. Bai. Multi-oriented text detection with fully con-
volutional networks. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), 2016.
Z. Zhiwei, L. Linlin, and T. C. Lim. Edge based bina-
rization for video text images. In 2010 20th Inter-
national Conference on Pattern Recognition (ICPR),