Optical Character Recognition Using MATLAB: Sandeep Tiwari, Shivangi Mishra, Priyank Bhatia, Praveen Km. Yadav
Optical Character Recognition Using MATLAB: Sandeep Tiwari, Shivangi Mishra, Priyank Bhatia, Praveen Km. Yadav
Optical Character Recognition Using MATLAB: Sandeep Tiwari, Shivangi Mishra, Priyank Bhatia, Praveen Km. Yadav
Abstract -- Character recognition techniques associate a symbolic almost simultaneous advent about 1980 of microprocessors
identity with the image of character. In a typical OCR systems for personal computers and of charge-coupled array scanners
input characters are digitized by an optical scanner. Each character resulted in a huge cost decrease that paralleled that of
is then located and segmented, and the resulting character image is general-purpose computers .Today, shrink-wrapped OCR
fed into a pre-processor for noise reduction and normalization. software is often an add-on to desktop scanners that cost
Certain characteristics are the extracted from the character for
classification. The feature extraction is critical and many different
about the same as a printer or facsimile machine. Our
techniques exist, each having its strengths and weaknesses. After purpose is to examine in some detail examples of the errors
classification the identified characters are grouped to reconstruct committed by current OCR systems and to speculate about
the original symbol strings, and context may then be applied to their cause and possible remedy.
detect and correct errors.
579
All Rights Reserved © 2013 IJARECE
ISSN: 2278 – 909X
International Journal of Advanced Research in Electronics and Communication Engineering (IJARECE)
Volume 2, Issue 5, May 2013
totally dependent of the quality of the bilevel image. Still, commercial OCR machines. However, this technique is
the thresholding performed on the scanner is usually very sensitive to noise and style variations and has no way of
simple. A fixed threshold is used, where gray-levels below handling rotated characters.
this threshold is said to be black and levels above are said to
be white. For a high-contrast document with uniform
background, a prechosen fixed threshold can be sufficient.
However, a lot of documents encountered in practice have a
rather large range in contrast.
C. Preprocessing
The image resulting from the scanning process may contain
a certain amount of noise. The smoothing implies both
filling and thinning. Filling eliminates small breaks, gaps
and holes in the digitized characters, while thinning reduces
the width of the line. The most common techniques for
smoothing, moves a window across the binary image of the
character, applying certain rules to the contents of the
window. The normalization is applied to obtain characters of
uniform size, slant and rotation. To be able to correct for
rotation, the angle of rotation must be found. For rotated
pages and lines of text, variants of Hough transform are
commonly used for detecting skew.
D. Feature Extraction
The techniques for extraction of such features are often Fig.2.(b) Character extraction in form of Matrix
divided into three main groups, where the features are found
from:
• The distribution of points. F. Post Processing
• Transformations and series expansions.
It encompasses grouping, error detection and correction
• Structural analysis.
techniques. The result of plain symbol recognition on a
In MATLAB mat2cell command is used for the extraction of
document, is a set of individual symbols. However, these
image in form of a cell for correlating with the saved
symbols in themselves do usually not contain enough
templates.Fig.2 shows extraction of character in Matrix form.
information. Instead we would like to associate the
individual symbols that belong to the same string with each
E. Template-matching and correlation techniques
other, making up words and numbers. The process of
These techniques are different from the others in that no performing this association of symbols into strings, is
features are actually extracted. Instead the matrix containing commonly referred to as grouping. The grouping of the
the image of the input character is directly matched with a symbols into strings is based on the symbols location in the
set of prototype characters representing each possible class. document. Symbols that are found to be sufficiently close
The distance between the pattern and each prototype is are grouped together Up until the grouping each character
computed, and the class of the prototype giving the best has been treated separately, and the context in which each
match is assigned to the pattern. The technique is simple and character appears has usually not been exploited. However,
easy to implement in hardware and has been used in many in advanced optical text recognition problems, a system
580
All Rights Reserved © 2013 IJARECE
ISSN: 2278 – 909X
International Journal of Advanced Research in Electronics and Communication Engineering (IJARECE)
Volume 2, Issue 5, May 2013
consisting only of single-character recognition will not be rate of 1% means 20 undetected errors per page. In postal
sufficient. Even the best recognition systems will not give applications for mail sorting, where an address contains
100% percent correct identification of all characters, but about 50 characters, an error rate of 1% implies an error on
some of these errors may be detected or even corrected by every other piece of mail.
the use of context.
V. RESULTS
III. WHY MATLAB? To illustrate the accuracy of proposed English handwritten
MATLAB stands for MATrixLABoratory. Here you play and sample text images OCR algorithm by using MATLAB,
around with matrices. Hence, an image (or any other data performance was measured using the samples. Figure 3 and
like sound, etc.) can be converted to a matrix and then 4 shows the sample document scanned from HP deskjet
various operations can be performed on it to get the desired scanner at 300 dpi. The images were then filtered, binarized,
results and values. Image processing is quite a vast field to clipped and resized. Lines of text were then extracted from
deal with. We can identify colors, intensity, edges, texture or the images. The font size was identified; segmentation was
pattern in an image. In this tutorial, we would be restricting performed on each line to segment characters taking in
ourselves to detecting colours (using RGB values) only. consideration the characteristics of English Verdana fonts
Using MATLAB you can solve technical computing templates. MATLAB (R2012.a/64-bit) is used to implement
problems faster than with traditional programming language, the proposed OCR algorithm. The recognition accuracy was
such as C, C++, JAVA, FORTRAN. There is a wide range 85% to 90% due to improper hand written characters. The
of applications, including signal and image processing, templates of all Characters and numbers are of 24X42 pixels.
image accusation, Neural Network, etc.
581
All Rights Reserved © 2013 IJARECE
ISSN: 2278 – 909X
International Journal of Advanced Research in Electronics and Communication Engineering (IJARECE)
Volume 2, Issue 5, May 2013
VII. CONCLUSION
Today optical character recognition is most successful for
constrained material, that is documents produced under Sandeep Tiwari is currently pursuing his B.Tech (Final
some control. However, in the future it seems that the need year) in Electronics and Communication from Kanpur
for constrained OCR will be decreasing. The reason for this Institute of Technology (G.B.T.U). His main areas of
is that control of the production process usually means that Interest are MATLAB, Electronic devices, Optical
the document is produced from material already stored on a Communications.
computer. Hence, if a computer readable version is already
available, this means that data may be exchanged
electronically or printed in a more computer readable form,
for instance barcodes. The applications for future OCR-
systems lie in the recognition of documents where control
over the production process is impossible. This may be
material where the recipient is cut off from an electronic
version and has no control of the production process or older
material which at production time could not be generated
electronically. This means that future OCR-systems intended Shivangi Mishra is currently pursuing her B.Tech
for reading printed text must be omnifont. Another important (Final year) in Electronics and Communication from
area for OCR is the recognition of manually produced Kanpur Institute of Technology (G.B.T.U). Her areas
documents. Within postal applications for instance, OCR of interest are wireless Networks, MATLAB.
must focus on reading of addresses on mail produced by
people without access to computer technology. Already, it is
not unusual for companies etc., with access to computer
technology to mark mail with barcodes. The relative
importance of handwritten text recognition is therefore
expected to increase.
ACKNOWLEGEMENT
It’s a pleasure and a great blessing of GOD for working
on the project named “Optical Character Recognition Using Priyank Bhatia is currently pursuing his B.Tech (Final
MATLAB”. Wherein we gained knowledge by working year) in Electronics and Communication from Kanpur
under the able leadership of our Head of department Mr. Institute of Technology (G.B.T.U). His areas of interest
Vaibhav Purwar who helped and supported us in every are wireless Networks, MATLAB and sensors.
sphere of our project. We all thank our Project in- charge
Mr. Asheesh Gupta who well supported us and provided us
with his precious time and support. It would we gracious to
thank our supervisor Mr. Gaurav Porwal for his valuable
advice and help which he provided us throughout the whole
duration of this project. Besides that we thank whole of the
E.C. department for their appreciation and kind support.
REFRENCES
[1] H.S. Baird & R. Fossey.A 100-Font Classifier.Proceedings ICDAR- Praveen Km. yadav is currently pursuing his B.Tech
91, Vol. 1, p. 332-340, 1991. (Final year) in Electronics and Communication from
[2] R. Bradford & T. Nartker.Error Correlation in Contemporary OCR
Systems.Proceedings ICDAR-91, Vol. 2, p. 516-524, 1991.
Kanpur Institute of Technology (G.B.T.U). His areas of
[3] J-P. Caillot.Review of OCR Techniques. NR-note, BILD/08/087. interest are Basic Electronics, MATLAB.
[4] R. G. Casey & K. Y. Wong.Document-Analysis Systems and
Techniques.Image Analysisi Applications, eds: R. Kasturi& M.
Tivedi, p. 1-36.
[5] Product help: http://www.mathworks.com/pl_homepage
582
All Rights Reserved © 2013 IJARECE