2024 CRC Press - Intelligent Systems and Applications in Computer Vision
2024 CRC Press - Intelligent Systems and Applications in Computer Vision
2024 CRC Press - Intelligent Systems and Applications in Computer Vision
Applications in
Computer Vision
Features:
The book aims to get the readers familiar with the fundamentals of computa-
tional intelligence as well as the recent advancements in related technologies
like smart applications of digital images, and other enabling technologies
from the context of image processing and computer vision. It further covers
important topics such as image watermarking, steganography, morpho-
logical processing, and optimized image segmentation. It will serve as an
ideal reference text for senior undergraduate, graduate students, and aca-
demic researchers in fields including electrical engineering, electronics,
communications engineering, and computer engineering.
Intelligent Systems
and Applications in
Computer Vision
Edited by
Nitin Mittal
Amit Kant Pandit
Mohamed Abouhawwash
Shubham Mahajan
Front cover image: Blackboard/Shutterstock
First edition published 2024
by CRC Press
2385 NW Executive Center Dr, Suite 320, Boca Raton, FL 33431
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2024 selection and editorial matter, Nitin Mittal, Amit Kant Pandit, Mohamed Abouhawwash and
Shubham Mahajan individual chapters
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information storage
or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com
or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@
tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
ISBN: 978-1-032-39295-0 (hbk)
ISBN: 978-1-032-59187-2 (pbk)
ISBN: 978-1-003-45340-6 (ebk)
DOI: 10.1201/9781003453406
Typeset in Sabon
by Newgen Publishing UK
Contents
v
vi Contents
Index 321
About the editors
Nitin Mittal, received his B.Tech and M.Tech degrees in Electronics and
Communication Engineering (ECE) from Kurukshetra University,
Kurukshetra, India in 2006 and 2009, respectively. He completed his
PhD in ECE from Chandigarh University, Mohali, India in 2017. He
worked as a professor and assistant dean of research, ECE Department
at Chandigarh University. Presently, he is working as skill assistant pro-
fessor, department of Industry 4.0 at Shri Vishwakarma Skill University.
His research interests include Wireless Sensor Networks, Image
Segmentation, and Soft Computing.
Mohamed Abouhawwash received the BSc and MSc degrees in statistics and
computer science from Mansoura University, Mansoura, Egypt, in 2005
and 2011, respectively. He finished his PhD in Statistics and Computer
Science, 2015, in a channel program between Michigan State University,
and Mansoura University. He is at Computational Mathematics,
Science, and Engineering (CMSE), Biomedical Engineering (BME) and
Radiology, Institute for Quantitative Health Science and Engineering
(IQ), Michigan State University. He is an assistant professor with the
Department of Mathematics, Faculty of Science, Mansoura University.
In 2018, Dr. Abouhawwash is a Visiting Scholar with the Department
of Mathematics and Statistics, Faculty of Science, Thompson Rivers
University, Kamloops, BC, Canada. His current research interests include
evolutionary algorithms, machine learning, image reconstruction, and
mathematical optimization. Dr. Abouhawwash was a recipient of the
best master’s and PhD thesis awards from Mansoura University in 2012
and 2018, respectively.
ix
x About the editors
xi
xii List of contributors
List of contributors xv
Yukti Upadhyay
Manav Rachna International
Institute of Research and Studies,
Faridabad, Haryana, India
Chapter 1
1.1 INTRODUCTION
The topic of “computer vision” has grown to encompass a wide range of
activities, from gathering raw data to extracting patterns from images and
interpreting data. The majority of computer vision jobs have to do with
feature extraction from input scenes (digital images) in order to get infor-
mation about events or descriptions. Computer vision combines pattern
detection and image processing. Image understanding comes from the
computer vision process. The field of computer vision, in contrast to com-
puter graphics, focuses on extracting information from images. Computer
technology is essential to the development of computer vision, whether it
is for image quality improvement or image recognition. Since the design
of the application system determines how well a computer vision system
performs, numerous scholars have proposed extensive efforts to broaden
and classify computer vision into a variety of fields and applications,
including assembly line automation, robotics, remote sensing, computer
and human communications, assistive technology for the blind, and other
technologies [1]. Deep learning (DL) is a member of the AI method family.
Artificial Neural Networks (ANNs) get their name from the fact that they
receive an input, analyze it, and produce a result. Deep learning is based
on ANN. Because of the massive amount of data generated every minute
by digital transformation, AI is becoming more and more popular. The
majority of organizations and professionals use technology to lessen their
reliance on people [2].
In machine learning, the majority of features taken into account during
analysis must be picked manually by a specialist in order to more quickly
identify patterns. DL algorithms gradually pick up knowledge from high
level features. A part of machine learning called “further deep learning”
is depicted in Figure 1.1. ANNs, which have similar capabilities to human
neurons, are the inspiration for deep learning. The majority of machine
DOI: 10.1201/9781003453406-1 1
2 Intelligent Systems and Applications in Computer Vision
cost. The price is as low as it can be. The rate at which cost will alter as a
result of weight and biases is known as the gradient.
• Convolution Layers
The convolution layer will calculate the scalar product between the
weights of the input volume-connected region and the neurons whose
output is related to particular regions of the input.
• Pooling Layers
After that, it will simply down sample the input along the spatial
dimension, further lowering the number of parameters in that acti-
vation [8].
A review approach on deep learning algorithms in computer vision 5
• Batch Normalization
Batch Normalization is the method through which the activation
nodes are scaled and adjusted to normalize the input layer neurons.
The output from the preceding is normalized using batch normaliza-
tion by dividing by the batch standard deviation after subtracting the
batch mean [9].
• Dropout
In order to avoid over-fitting, input units are set to 0 at random
with a rate of frequency by the “dropout layer” at each training
phase. The sum of all inputs is maintained by scaling up non-zero
1
inputs by .
1 − rate
• Fully Connected Layers
After that, it will carry out the same tasks as regular ANNs and try to
create categorization scores from the activations. ReLu has also been
proposed as a possible application between these layers in order to
enhance performance.
an input and the hidden layer in RBM [10]. Restricted Boltzmann Machine
exhibits strong feature extraction and representation capabilities. The
Restricted Boltzmann machine is a probabilistic network that picks up on
the hidden representation, h as well as the probability distribution of its
inputs v. The two-layer, typical Restricted Boltzmann Machine method is
shown in Figure 1.6. The fundamental benefit of the RBM algorithm is that
there are no links between units in the same layer because all components,
both visible and concealed, are separate.
The Restricted Boltzmann Machine algorithm seeks to reconstruct the
inputs as precisely as possible [11].The input is modified based on the
weights and biases throughout the forward stage before beginning to trigger
the hidden layer. The hidden layer’s activations are then modified based
on the weight and biases and transmitted the activation layer’s input layer
afterward in the following steps: The input layer now searches for the
updated activation as a reconstruction of the input, Compare it against the
original input.
by defining its energy function to more clearly define the DBM’s structure.
In relation to the two-layer model as defined by Equation 1.1.
Where W, V, d(1), and d(2) are equal to. DBM can be thought of as a
bipartite graph with two vertices.
Figure 1.7 DBM’s R1, R2, and R3 list the recognition that is intended.
The Deep Boltzmann Machine (DBM), a deep generative undirected model,
is composed of several hidden layers. In order to affect how lower-level
characteristics are learned, it makes use of the top-down connection pattern.
R1, R2, and R3 are the recognition model weights, which are increased by
two every layer to make up since there is not any top-down feedback [12].
activations, and the top layer can be ignored. Contraction, de-noising, and
sparseness techniques are used to train auto-encoders.
In auto-encoders, some random noise is injected into the input during
de-noising. The original input must be reproduced by the encoder. Regular
neural networks will perform better in terms of generalization if inputs
are randomly deactivated during training [17]. Setting the hidden layer’s
number of nodes in contractive auto- encoders to substantially fewer
than the number of input nodes drives the network to do dimensionality
reduction. As a result, it is unable to learn the identity function since the
hidden layer does not have enough nodes to adequately store the input. By
giving the weight update function a sparsity penalty, sparse auto-encoders
are trained. The connection weights’ overall size are penalized, and the
majority of the weights have low values as a result. At each stage, old k-
1 network hidden layers are used, and a new network with k+1 hidden
layers is constructed, with the k+1th hidden layer using the k+1 hidden
layer as input. The weights in the final deep network are initialized using
the weights from the individual layer training, and the architecture as a
Figure 1.9 Autoencoders.
A review approach on deep learning algorithms in computer vision 11
whole is then tweaked. On the other hand, the network can be tweaked
using back propagation by adding an additional output layer on top. Deep
networks only benefit from back propagation if the weights are initialized
very close to a good solution. This is ensured by the layer-by-layer pre-
training. There are also alternative methods for fine-tuning deep networks,
such as dropout and maxout.
traits helpful for denoising and is significantly more intelligent than the
identify [22].
Restricted Boltzmann
Parameter Convolutional Neural Networks Machines Deep Belief Networks Auto encoders
Type of Learning Supervised Unsupervised Supervised Unsupervised
Input Data 3-D Structured data Any type of data Text, Image Any type of data
Output Classified, predicted Reconstructed output Classified, predicted Reconstructed output
Application Image and voice analysis, Dimensionality Reduction/ NLP, dimensionality Dimensionality
classification, detection, Classification reduction Reduction
recognition
13
14 Intelligent Systems and Applications in Computer Vision
REFERENCES
1. Victor Wiley & Thomas Lucas, ‘Computer Vision and Image Processing: A
Paper Review’. International Journal of Artificial Intelligence Research, Vol.
2, No 1, pp. 28–36, 2018.
2. Savita K. Shetty & Ayesha Siddiqa, ‘Deep Learning Algorithms and
Applications in Computer Vision’. International Journal of Computer
Sciences and Engineering, Vol. 7, pp. 195–201, 2019.
3. Ksheera R. Shetty, Vaibhav S. Soorinje, Prinson Dsouza & Swasthik, ‘Deep
Learning for Computer Vision: A Brief Review’. International Journal of
Advanced Research in Science, Communication and Technology, Vol. 2,
pp. 450–463, 2022.
4. Dr. G. Ranganathan, ‘A Study to Find Facts Behind Preprocessing on Deep
Learning Algorithms’. Journal of Innovative Image Processing, Vol. 3,
pp. 66–74, 2021.
5. Swapna G, Vinayakumar R & Soman K. P, ‘Diabetes detection using deep
learning algorithms’. ICT Express, Vol. 4, pp. 243–246, 2018.
6. Dulari Bhatt, Chirag Patel, Hardik Talsania, Jigar Patel, Rasmika Vaghela,
Sharnil Pandya, Kirit Modi & Hemant Ghayvat, ‘CNN Variants for
Computer Vision: History, Architecture, Application, Challenges and Future
Scope’. MDPI Publisher of Open Access Journals, Vol. 10, p. 2470, 2021.
7. Rachana Patel & Sanskruti Patel, ‘A Comprehensive Study of Applying
Convolutional Neural Network for Computer Vision’. International Journal
of Advanced Science and Technology, Vol. 29, pp. 2161–2174, 2020.
8. Keiron O’Shea, Ryan Nash, ‘An Introduction to Convolutional Neural
Networks’. arXiv, 2015.
9. Shaveta Dargan, Munish Kumar, Maruthi Rohit Ayyagari & Gulshan
Kumar, ‘A Survey of Deep Learning and Its Applications: A New Paradigm
to Machine Learning’. Archives of Computational Methods in Engineering,
Vol. 27, pp. 1071–1092, 2020.
10. Ali A. Alani, ‘Arabic Handwritten Digit Recognition Based on Restricted
Boltzmann Machine and Convolutional Neural Networks’. MDPI Publisher
of Open Access Journals, 2017.
11. Voxid, Fayziyev, Xolbek, Xolyorov, & Kamoliddin, Xusanov, ‘Sorting
the Object Based on Neural Networks Computer Vision Algorithm of the
System and Software’. ijtimoiy fanlarda innovasiya onlayn ilmiy jurnali, Vol.
3, No 1, 67–69, 2023.
12. Roy, Arunabha M., Bhaduri, Jayabrata, Kumar, Teerath, & Raj, Kislay,
‘WilDect-YOLO: An Efficient and Robust Computer Vision-based Accurate
Object Localization Model for Automated Endangered Wildlife Detection’.
Ecological Informatics, Vol. 75, 101919, 2023.
13. Yang Fu, Yun Zhang, Haiyu Qiao, Dequn Li, Huamin Zhou & Jurgen
Leopold, ‘Analysis of Feature Extracting Ability for Cutting State Monitoring
Using Deep Belief Networks’. CIRP Conference on Modelling of Machine
Operations, pp. 29–34, 2015.
14. Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu &
Fuad E. Alsaadi, ‘A Survey of deep neural network architectures and their
applications’. Neurocomputing, Vol. 234, pp. 11–26, 2017.
A review approach on deep learning algorithms in computer vision 15
15. Jiaojiao Li, Bobo Xi, Yunsong Li, Qian Du & Keyan Wang, ‘Hyperspectral
Classification Based on Texture Feature Enhancement and Deep Belief
Networks’. Remote Sensing MDPI Publisher of Open Access Journals, Vol.
10, p. 396, 2018.
16. MehrezAbdellaoui & Ali Douik, ‘Human Action Recognition in Video
Sequences Using Deep Belief Networks’. International Information and
Engineering Technology Association, Vol. 37, pp. 37–44, 2020.
17. V. Pream Sudha & R. Kowsalya, ‘A Survey on Deep Learning Techniques,
Applications and Challenges’. International Journal of Advance ResearchIn
Science and Engineering, Vol. 4, pp. 311–317, 2015.
18. Parvaiz, Arshi, Khalid, Muhammad Anwaar, Zafar, Rukhsana, Ameer, Huma,
Ali, Muhammad, & Fraz, Muhammad Mouzam, ‘Vision Transformers in
Medical Computer Vision—A Contemplative Retrospection’. Engineering
Applications of Artificial Intelligence, Vol. 122, 106126, 2023.
19. Malik, Karim, Robertson, Colin, Roberts, Steven A, Remmel, Tarmo K, &
Long, Jed A., ‘Computer Vision Models for Comparing Spatial Patterns:
Understanding Spatial Scale’. International Journal of Geographical
Information Science, Vol. 37, No 1, 1–35, 2023.
20. Sharma, T., Diwakar, M., Singh, P., Lamba, S., Kumar, P., & Joshi,
‘Emotion Analysis for predicting the emotion labels using Machine Learning
approaches’. IEEE 8th Uttar Pradesh Section International Conference on
Electrical, Electronics and Computer Engineering (UPCON), pp. 1–6, 2021.
21. Joshi, K., Kirola, M., Chaudhary, S., Diwakar, M., & Joshi, ‘Multi-focus
image fusion using discrete wavelet transform method’. International
Conference on Advances in Engineering Science Management & Technology
(ICAESMT), 2019.
22. Ambore, B., Gupta, A. D., Rafi, S. M., Yadav, S., Joshi, K., & Sivakumar, ‘A
Conceptual Investigation on the Image Processing Using Artificial Intelligence
and Tensor Flow Models through Correlation Analysis’. International
Conference on Advance Computing and Innovative Technologies in
Engineering (ICACITE) IEEE, pp. 278–282, 2022.
23. Niall O’ Mahony, Sean Campbell, Anderson Carvalho, Suman Harapanahalli,
Gustavo Velasco Hernandez, Lenka Krpalkova, Daniel Riordan & Joseph
Walsh, ‘Deep Learning vs. Traditional Computer Vision’. Computer Vision
Conference, pp. 128–144, 2019.
24. R. Kumar, M. Memoria, A. Gupta, & M. Awasthi, ‘Critical Analysis
of Genetic Algorithm under Crossover and Mutation Rate’. 2021 3rd
International Conference on Advances in Computing, Communication
Control and Networking (ICAC3N), 2021, pp. 976–980.
Chapter 2
2.1 INTRODUCTION
Object extraction deals with finding out distinct objects in the image that
can further govern the control of some mechanism. The object extraction
can be part of some counter-based system wherein, on the basis of count,
the system follows the progress [1]. Images are the great source of infor-
mation and can record the observations. But it is very difficult to process
them manually for information extraction. The extraction of objects from
the images is one of the most challenging tasks faced in order to make the
systems fully automatic [2, 3].
The main principle of object extraction is based on increasing the inter-
class relationship and decreasing the intra-class similarity. This can ensure
the objects in the image are separated and can be extracted without overlap-
ping. The output of this system can serve as input to the object identification
system [4,5].
The object extraction used in object recognition methodology as shown in
Figure 2.1. The image captured is the representation of a scenic 3-D scene as
a 2-D record. The nontrivial dimensions, usually in numerous recreational
objects, are enough to represent the reasons related in most cases and, there-
fore, play a significant role in different image frameworks, for example for
content-based image retrieval systems [6]. Therefore a continuous research
is carried out in the direction of designing the automatic systems for extrac-
tion of objects from the images. The work is intended to make the systems
more efficient in terms of extracting overlapped objects and converting
into meaningful information. Further the work is done in order to perform
edge linking so that the object boundaries can be connected and form a
closed structure [7, 8]. This will help the number and types of objects pre-
sent in the image. The basic edge-based approaches include application of
a mask, which will be done in both x and y directions and then performing
element by element multiplication of pixels with the mask coefficients. The
image pixels which are mapped to the center of mask will be modified in the
16 DOI: 10.1201/9781003453406-2
Object extraction using edge based approach 17
process and allotted with the updated value as presented in the operation of
multiplication.
1. Vehicular tracking
2. Optical character recognition
3. Tracking people in video frames
4. Ball tracking in different sports
5. Object extraction from satellite images
6. License number plate detection
7. Logo detection from the images
8. Disease detection
9. Medical imaging to detect the tumors in the broad sense
10. Robotics
11. Counter-based applications
12. In agricultural fields to detect any anomalies
The applications listed above are the fields wherein the traditional methods
are having different limitations to manage. Each of these application aims at
reducing human efforts and providing automated mechanisms by detecting
the different objects present in the image. In our proposed model we are
going to perform extraction with the help of the edge-filtering approach.
One important application is the introduction of cameras at the toll booths
to identify automobile number plates through the extraction process and
then controlling the gate-opening mechanism. In traditional methods human
power was utilized to read and record the number. This process was many
times erroneous as huge numbers of cars passed through the toll booths in
a day, and the persons on the windows cannot handle so many incoming
cars. So the need of automation can be visualized in these kind of scenarios.
Thus, a systematic model needs to be developed for managing these kinds
of problems [9–12].
18 Intelligent Systems and Applications in Computer Vision
the gradient is greatest, it comes before the edges. The Sobel method [13]
applies a 2-D spatial gradient quantity to a picture, emphasizing edge-
corresponding high spatial frequency regions. It is typically used to deter-
mine the predicted absolute gradient magnitude at each point in a grayscale
input image. As revealed in the table, the operator in hypothesis comprises
at least two 3x3 complication kernels. One kernel is just the other kernel
turned 90 degrees. The Roberts Cross operator is very similar to this. One
kernel is just the other kernel turned 90 degrees. The Roberts Cross oper-
ator is very similar to this is shown in Figure 2.3.
The gradient is given by:
The angle of orientation of the edge (relative to the pixel grid) giving rise to
the spatial gradient is given by:
𝛳 =arctan(𝐺𝑖/𝐺𝑗)
2.3.3 Prewitt’s Operator
The Prewitt Operator is similar to the Sobel operator and is used for
detecting vertical and horizontal edges in images as shown in Figure 2.4.
Complexity
21
22 Intelligent Systems and Applications in Computer Vision
compound images include the scanned copy of documents and the clicked
pictures of the documents that may possess different sorts of information.
The median filtered for different scanned images is applied, and the quan-
titative analysis of the filtration is performed by obtaining the value of the
parameters. A similar set of approaches is proposed in [17–19], where direct
or a variant of median filter is deployed to minimize the effect of noise.
The variants of a median filter will add certain functionalities while making
computations of median value as it can add different set of weights to the
mask values and while computing the median value some of the pixels will
be strengthened as per the mask weight value.
The approach for object extraction from images is developed in [20] where
a method and apparatus are proposed for the extraction. The object is having
similar intensity values, thereby the process will begin with selecting a seed
pixel and then combining the neighbor pixels depending upon the threshold
criteria. This joining of pixels to the center pixel will yield a group of pixels,
thereby defining an object. Wherever there is a sharp variation in intensity, it
will be treated as an edge pixel. Ideally the edge pixels are utilized to extract
the threshold and determines the number of threshold values to be used for
extraction process to be performed with high level of accuracy.
The edge detection mechanisms and different approaches were discussed
and proposed in the literature by different researchers [21–25] with an aim to
reduce the number of false edges. The edge detection process is determined,
and dependent on, many factors such as presence of noise and similarity
between different objects present in the image. The edge will be located
through a sharp variation in the intensity where the edge pixels have a different
pixel value from its neighbors. As far as object extraction is concerned, it is
governed through combining different edge lines obtained from filtering of the
image. The edges will form connected structures and then connecting these
edges will yield the object boundary. The object extraction makes the image
processing capable of finding its applications in real time situations where we
need to process the images for extraction of meaningful information.
The detected edges need to be smoothened in order to extract the shape of
the required object in the image. Smoothening is directly related to removal
of noise pixels. This can be accomplished through suitable spatial filter such
as an average filter or a weighted median filter. They directly work on the
image pixel by pixel and manipulate the intensity value of the pixel, which
is mapped to the center of the image. The other kind of smoothing masks
include Gaussian, min-max, adaptive median filter [26–32]. Some Fuzzy
Logic based models are proposed through researches based on fuzzification
rules [33]. The common goal of achieving edge detection with high accuracy
is considered while designing.
This paper proposes a new hybrid technique based on the Aquila opti-
mizer (AO) [38–41] and the arithmetic optimization algorithm (AOA). Both
AO and AOA are recent meta-heuristic optimization techniques. They can
be used to solve a variety of problems such as image processing, machine
Object extraction using edge based approach 23
Figure 2.7 Sample 1 (L to R) – The input image; Grayscale converted image; output image.
Figure 2.8 Sample 2 (L to R) The input image; Grayscale converted image; output image.
Figure 2.9 Sample 3 (L to R) The input image; Grayscale converted image; output image.
38, respectively. The algorithm is not able to detect the edges, which are
merged or are similar to the background.
The analysis of the results obtained can be done through visual inspection
of the output image. From the outputs achieved a few observations can be
drawn out and comparison shown in Table 2.2:
Figure 2.10 Sample 4 (L to R) The input image; Grayscale converted image; output image.
2.7 CONCLUSION
In this chapter, we have presented an approach for image-object extraction
using edge detection approach. The broad process and purpose of the image-
object extraction and recognition have been described. The objects formed
through connected lines and points can be segregated from the background
by detecting the edge values in the image and then joining those edges to
extract different shapes present in the image. The future applications of
object detection requires highly efficient systems for object extraction. In
our work we have designed the system to cancel the effects of noise that have
been added to the image during the acquisition stage. The spatial filter for
noise removal is selected in order to remove the noise as well as preserve the
edge strength. Thus, a blurring affect is reduced. Though few of the edges
have failed to detect through the system, the overall edges are preserved in
a good number. We will be looking forward in future to extending these
outputs to serve the object recognition models, which are the backbone of
many real time applications.
REFERENCES
1. Buttler, David, Ling Liu, and Calton Pu. “A fully automated object extrac-
tion system for the World Wide Web.” In Proceedings 21st International
Conference on Distributed Computing Systems, pp. 361–370. IEEE (2001).
Object extraction using edge based approach 27
18. Kumar, N. Rajesh, and J. Uday Kumar. “A spatial mean and median filter
for noise removal in digital images.” International Journal of Advanced
Research in Electrical, Electronics and Instrumentation Engineering 4, no. 1
(2015): 246–253.
19. Wang, Gaihua, Dehua Li, Weimin Pan, and Zhaoxiang Zang. “Modified
switching median filter for impulse noise removal.” Signal Processing 90, no.
12 (2010): 3213–3218.
20. Kamgar- Parsi, Behrooz. “Object extraction in images.” U.S. Patent
5,923,776, issued July 13, 1999.
21. Shrivakshan, G. T., and Chandramouli Chandrasekar. “A comparison of
various edge detection techniques used in image processing.” International
Journal of Computer Science Issues (IJCSI) 9, no. 5 (2012): 269.
22. Sharifi, Mohsen, Mahmood Fathy, and Maryam Tayefeh Mahmoudi.
“A classified and comparative study of edge detection algorithms.” In
Proceedings. International conference on information technology: Coding
and computing, pp. 117–120. IEEE, 2002.
23. Nadernejad, Ehsan, Sara Sharifzadeh, and Hamid Hassanpour. “Edge detec-
tion techniques: evaluations and comparisons.” Applied Mathematical
Sciences 2, no. 31 (2008): 1507–1520.
24. Middleton, Lee, and Jayanthi Sivaswamy. “Edge detection in a hexagonal-
image processing framework.” Image and Vision Computing 19, no. 14
(2001): 1071–1081.
25. Ziou, Djemel, and Salvatore Tabbone. “Edge detection techniques –an
overview.” Pattern Recognition and Image Analysis C/C of Raspoznavaniye
Obrazov I Analiz Izobrazhenii 8 (1998): 537–559.
26. Lee, Jong-Sen. “Digital image smoothing and the sigma filter.” Computer
vision, graphics, and image processing 24, no. 2 (1983): 255–269.
27. Ramponi, Giovanni. “The rational filter for image smoothing.” IEEE Signal
Processing Letters 3, no. 3 (1996): 63–65.
28. Meer, Peter, Rae- Hong Park, and K. J. Cho. “Multiresolution adaptive
image smoothing.” CVGIP: Graphical Models and Image Processing 56,
no. 2 (1994): 140–148.
29. Kačur, Jozef, and Karol Mikula. “Solution of nonlinear diffusion appearing
in image smoothing and edge detection.” Applied Numerical Mathematics
17, no. 1 (1995): 47–59.
30. Hong, Tsai-Hong, K. A. Narayanan, Shmuel Peleg, Azriel Rosenfeld, and
Teresa Silberberg. “Image smoothing and segmentation by multiresolution
pixel linking: further experiments and extensions.” IEEE Transactions on
Systems, Man, and Cybernetics 12, no. 5 (1982): 611–622.
31. Tottrup, C. “Improving tropical forest mapping using multi-date Landsat
TM data and pre-classification image smoothing.” International Journal of
Remote Sensing 25, no. 4 (2004): 717–730.
32. Fang, Dai, Zheng Nanning, and Xue Jianru. “Image smoothing and
sharpening based on nonlinear diffusion equation.” Signal Processing 88,
no. 11 (2008): 2850–2855.
33. Taguchi, Akira, Hironori Takashima, and Yutaka Murata. “Fuzzy filters for
image smoothing.” In Nonlinear Image Processing V, vol. 2180, pp. 332–
339. International Society for Optics and Photonics, 1994.
Object extraction using edge based approach 29
34. Chen, Tao, Kai- Kuang Ma, and Li- Hui Chen. “Tri- state median filter
for image denoising.” IEEE Transactions on Image Processing 8, no. 12
(1999): 1834–1838.
35. Arce, McLoughlin. “Theoretical analysis of the max/ median filter.”
IEEE transactions on acoustics, speech, and signal processing 35, no. 1
(1987): 60–69.
36. Canny, John. “A computational approach to edge detection.” IEEE
Transactions on pattern analysis and machine intelligence 6 (1986): 679–698.
37. Marr, David, and Ellen Hildreth. “Theory of edge detection.” Proceedings
of the Royal Society of London. Series B. Biological Sciences 207, no. 1167
(1980): 187–217.
38. Mahajan, S., Abualigah, L., Pandit, A. K. et al. “Fusion of modern meta-
heuristic optimization methods using arithmetic optimization algorithm for
global optimization tasks.” Soft Comput 26 (2022): 6749–6763.
39. Mahajan, S., Abualigah, L., Pandit, A. K. et al. “Hybrid Aquila optimizer
with arithmetic optimization algorithm for global optimization tasks.” Soft
Comput 26 (2022): 4863–4881.
40. Mahajan, S. and Pandit, A.K. “Hybrid method to supervise feature selection
using signal processing and complex algebra techniques.” Multimed Tools
Appl (2021).
41. Mahajan, S., Abualigah, L. & Pandit, A.K. “Hybrid arithmetic optimization
algorithm with hunger games search for global optimization.” Multimed
Tools Appl 81 (2022): 28755–28778.
42. Mahajan, S., Abualigah, L., Pandit, A. K. et al. “Fusion of modern meta-
heuristic optimization methods using arithmetic optimization algorithm for
global optimization tasks.” Soft Comput 26 (2022): 6749–6763.
Chapter 3
30 DOI: 10.1201/9781003453406-3
Deep learning techniques for image captioning 31
encoder converts the input images into fix-length vector features, while the
decoder converts the image features back into word-by-word descriptions.
ST T
Attention(S, T , U) = Soft max U (3.4)
BT
Z =E(A) (3.5)
h =D(Z) (3.6)
V = Vo ∪ Va ∪ Vr (3.7)
first LSTM at each time step. Figure 3.2 shows the VSU and decoder for
caption generator. Figure 3.3 shows the O and R of VSU acted on the given
input image.
The decoder can frame sentences as “leaves hold rose,” “grass grow
sand,” “rose planted in sand.” Hierarchy parsing architecture can be used
for image captioning and functions as an image encoder to read the hier-
archical structure in images. In order to further improve sentence generation,
tree-structured topology have been added to all instance-level, region-level,
and image-level features [24].
3.5 CONCLUSION
This chapter provides the knowledge on DL techniques involved in image
captioning along with what we have discussed: the deep models for object
detection, differences, limitations of DL, and traditional image feature
methods. The conventional translation captioning approaches use word-
by-word decoding, and it may change the meaning of the caption. An
attention mechanism works well for this problem. Also out of A, O, R
units the O model gives improvement in performance rather than using
combined A and R units. Using A and R increases the computational load
because of the residual connections. The increase in the relationship unit
will uplift the Consensus- Based Image Description Evaluation (CIDE)
score. Finally we provided the common challenges faced by captioning
systems. Utilizing automated measures, while somewhat beneficial, is
still inadequate because they ignore the image. When scoring various and
descriptive captions, their scores frequently remain insufficient and per-
haps even misleading.
REFERENCES
[1] Bonaccorso, G. (2018) Machine Learning Algorithms. Popular Algorithms
for Data Science and Machine Learning, Packt Publishing Ltd, 2nd Edn.
Deep learning techniques for image captioning 43
[2] O’Mahony, N., Murphy, T., Panduru, K., et al. (2017) Real-time monitoring
of powder blend composition using near infrared spectroscopy. In: 2017
Eleventh International Conference on Sensing Technology (ICST). IEEE.
[3] Lan, Q., Wang, Z., Wen, M., et al. (2017) High Performance Implementation
of 3D Convolutional Neural Networks on a GPU. Comput Intell Neurosci
(2017).
[4] Diligenti, M., Roychowdhury, S., Gori, M. (2017) Integrating Prior
Knowledge into Deep Learning. In: 2017 16th IEEE International
Conference on Machine Learning and Applications (ICMLA). IEEE.
[5] Zeng, G., Zhou, J., Jia, X., et al. (2018) Hand-Crafted Feature Guided
Deep Learning for Facial Expression Recognition. In: 2018 13th IEEE
International Conference on Automatic Face and Gesture Recognition (FG
2018). IEEE, pp. 423–430.
[6] Li, F., Wang, C., Liu, X., et al. (2018) A Composite Model of Wound
Segmentation Based on Traditional Methods and Deep Neural Networks.
Comput Intell Neurosci 2018.
[7] AlDahoul, N., Md. Sabri, A. Q., Mansoor, A. M. (2018) Real-Time Human
Detection for Aerial Captured Video Sequences via Deep Models. Comput
Intell Neurosci 2018.
[8] Alhaija, H. A., Mustikovela, S. K., Mescheder, L., et al. (2017) Augmented
Reality Meets Computer Vision: Efficient Data Generation for Urban
Driving Scenes, International Journal of Computer Vision.
[9] Tsai F. C. D. (2004) Geometric hashing with line features. Pattern Recognit
27:377–389.
[10] Rosten, E., and Drummond, T. (2006) Machine Learning for High-Speed
Corner Detection. Springer: Berlin, Heidelberg, pp. 430–443.
[11] Horiguchi, S., Ikami, D., Aizawa, K. (2017) Significance of Soft-max-based
Features in Comparison to Distance Metric Learning-based Features, IEEE
Xplore.
[12] Karami, E., Shehata, M., and Smith, A. (2017) Image Identification
Using SIFT Algorithm: Performance Analysis against Different Image
Deformations.
[13] Dumoulin, V., Visin, F., Box, G. E. P. (2018) A Guide to Convolution
Arithmetic for Deep Learning.
[14] Wang, J., Ma, Y., Zhang, L., Gao, R. X. (2018) Deep learning for smart
manufacturing: Methods and applications. J Manuf Syst.
[15] Tsai F. C. D. (1994) Geometric hashing with line features. Pattern Recognit
27:377–389. https://doi.org/10.1016/0031-3203(94)90115-5
[16] Khan, A., Sohail, A., Zahoora, U. et al. (2020) A survey of the recent
architectures of deep convolutional neural networks. Artif Intell Rev 53,
5455–5516.
[17] Alom, Md. Zahangir, Taha, Tarek, Yakopcic, Christopher, Westberg,
Stefan, Hasan, Mahmudul, Esesn, Brian, Awwal, Abdul & Asari, Vijayan.
(2018). The History Began from AlexNet: A Comprehensive Survey on
Deep Learning Approaches.
[18] Wu, Xiaoxia, Ward, Rachel, and Bottou, Léon. (2018). WNGrad: Learn
the Learning Rate in Gradient Descent.
[19] He, S., Liao, W., Tavakoli, H. R., Yang, M., Rosenhahn, B., and Pugeault,
N. (2021). Image Captioning Through Image Transformer. In: Ishikawa,
44 Intelligent Systems and Applications in Computer Vision
H., Liu, CL., Pajdla, T., and Shi, J. (eds) Computer Vision –ACCV 2020.
Lecture Notes in Computer Science, Vol. 12625.
[20] Huang, L., Wang, W., Chen, J., and Wei, X. Y. (2019). Attention on
attention for image captioning. In: Proceedings of the IEEE International
Conference on Computer Vision. 4634–4643.
[21] Yao, T., Pan, Y., Li, Y., and Mei, T. (2018) Exploring visual relationship
for image captioning. In: Proceedings of the European Conference on
Computer Vision (ECCV), 684–699.
[22] Yang, X., Tang, K., Zhang, H., and Cai, J. (2019) Auto-encoding scene
graphs for image captioning. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 10685–10694.
[23] Guo, L., Liu, J., Tang, J., Li, J., Luo, W., Lu, H. (2019) Aligning linguistic
words and visual semantic units for image captioning. In: Proceedings of
the 27th ACM International Conference on Multimedia. 765–773.
[24] Yao, T., Pan, Y., Li, Y., Mei, T. (2019) Hierarchy parsing for image
captioning. In: Proceedings of the IEEE International Conference on
Computer Vision. 2621–2629.
[25] www.ibm.com/blogs/research/2019/06/image-captioning/ (accessed on 15
July 2022).
Chapter 4
4.1 INTRODUCTION
Computer vision is a field in which a 3D scene can be recreated or
interpreted using basic 2D images. The subject of computer vision has been
fast evolving due to the continual advancement of sophisticated technolo-
gies such as Machine Learning (ML), Deep Learning (DL), and transformer
neural networks. Figure 4.1 represents the overall learning process of ML
and DL.
In ML, handcrafted features are used with proper feature selection
techniques [16], whereas DL models can directly extract salient informa-
tion from images or videos [22]. Thus, advances in DL have made com-
puter vision technologies more precise and trustworthy. The Convolutional
Neural Networks (CNN) in DL have made it appropriate for many indus-
trial applications and a trustworthy technology to invest in for businesses
wishing to automate their work and duties.
DL enables computational models with several processing layers to
learn and represent data at different levels of abstraction, simulating how
the brain processes and comprehends multimodal information and impli-
citly capturing complex structures of big data. Further, the DL model
uses different optimization algorithms [15] to have an impact on accuracy
and training speed. A wide range of unsupervised and supervised feature
learning techniques are included in the DL family, which also includes
neural networks and hierarchical probabilistic models. DL techniques per-
form better than prior state-of-the-art techniques because of a huge volume
of input from different sources such as visual, audio, medical, social, and
sensor. With the help of DL, significant progress has been made in several
computer vision issues, including object detection, motion tracking, action
recognition, human posture estimation, and semantic segmentation [17].
CNN act as a mainstream network in the field of computer vision is shown
in Figure 4.2. The development of deep network emerged for computer
vision tasks is shown in Table 4.1.
DOI: 10.1201/9781003453406-4 45
46 Intelligent Systems and Applications in Computer Vision
4.2 OBJECT DETECTION
Object detection is an essential task in computer vision that involves iden-
tifying and localizing objects within an image or video. The primary goal
of object detection is to provide machines with the ability to perceive and
understand their surroundings by detecting and recognizing the objects
present in them. This capability serves as a foundation for various other
computer vision tasks, such as instance segmentation, object tracking, and
image captioning.
The traditional methods for object detection, such as the Viola-Jones face
detector, utilized techniques such as Adaboost with cascade classifiers, inte-
gral images, and the Haar wavelet. The Histogram of Oriented Gradients
(HOG) and Deformable Part Models (DPM) were also introduced as powerful
feature descriptors. However, the performance of these methods reached a
saturation point before the development of deep learning techniques. Recent
advancements in deep learning, particularly in CNN, have revolutionized
the field of object detection. DL-based object detection methods employ
supervised learning, where a model is trained on annotated images to detect
objects. These models can handle complex scenes with varying illumin-
ation, occlusions, and object orientations. Although collecting a significant
amount of annotated data for training deep learning models is challenging,
the availability of benchmark datasets like MS- COCO, PASCAL VOC,
KITTI, openImage, and ILSVRC with annotated images for object detection
48 Intelligent Systems and Applications in Computer Vision
4.3.1 R-CNN
Region-based Convolutional Neural Network (R-CNN) extracts the object
proposals (region boxes) by merging similar pixels into regions. R-CNN
provides nearly two thousand object proposals and identifies the regions
having the probability of being an object using a selective search algorithm
[25]. Each selected region is reshaped to a fixed size (warped) and inputted
to the backbone CNN architecture to extract the features. Thus, each region
proposal is rescaled and processed by the CNN, due to the fixed size input
representation of the Fully Connected (FC) layer. Further, the classifier and
regressor process the feature vector to obtain the class label and bounding
box respectively. Figure 4.3 depicts the structure of R-CNN model.
However, R-CNN faces certain issues, such as a slow processing rate
in extracting candidate proposals using selective search and redundant
CNN feature computation due to overlapped region proposals. Moreover,
training time is increased due to the fixed process in extraction of candidate
proposals and shows a high prediction time of 47 seconds per image.
4.3.2 SPPNet
Spatial Pyramid Pooling Network (SPPNet) [5] is a modification of R-CNN
that can handle images of arbitrary size and aspect ratio. SPPNet processes
the entire image with the CNN layer and adds a pooling layer before the FC
layer. The region proposal is extracted using selective search, and candidate
regions are mapped onto the feature maps of the last convolutional layer.
Next, the candidate feature maps are inputted to the spatial pooling layer
and then the FC layer. Finally, classification and regression are performed.
SPPNet addresses the warping-based overlapped CNN computation issue
by fine-tuning the FC layer. However, the previous layers based on region
proposal selection are still not addressed, leading to an increase in training
and prediction time. SPPNet architecture is shown in Figure 4.4.
4.3.3 Fast RCNN
Fast RCNN addresses the issue of training multiple region proposals sep-
arately as in R-CNN and SPPNet, by utilizing the single trainable system
[3]. In Fast RCNN entire image is inputted to the convolutional layer to
obtain the feature. The candidate region proposals are obtained using a
selective search algorithm; such regions are called Region of Interest (ROI).
Such region proposals are mapped onto the final feature maps of the CNN
layer. Further, ROI pooling concatenates the feature maps of corresponding
region proposals. Thus, a feature map is obtained for every region proposal
and then feed to the FC layer. The final layer of classification and regression
is performed for object detection.
4.3.4 Faster RCNN
Faster RCNN introduced the Region Proposal Network (RPN) to generate
candidate region proposals instead of selective search. RPN makes use of
an anchor, a fixed bounding box with different aspect ratios to localize the
object [20]. The RPN module consists of a fully convolutional network with
a classifier and a bounding box regressor to provide an objectness score. The
image is inputted to the CNN part to obtain the feature maps, which are
50 Intelligent Systems and Applications in Computer Vision
provided as input to the RPN module. Anchor boxes are selected and predict
the object score, removing those with low objectness scores. RPN utilizes
multi-task loss optimization for classification and regression. The convolu-
tional feature maps and predicted region proposal are concatenated using
ROI pooling. Faster RCNN addresses the issue of slow selective search with
a convolutional RPN model, which makes the network learn region proposal
along with object detection. The prediction time of Faster RCNN is
improved to five frames per second. Figure 4.5 shows the network structure
of Faster RCNN model.
4.3.5 R-F CN
The Region- based Convolutional Neural Network (RCNN) model had
utilized the fully connected (FC) layer before the object detection layer,
which made localization difficult due to the translation-invariant property
of CNN. To overcome this limitation, Jifeng et al. [2] modified the FC layer
with a fully convolutional layer. However, the performance of the model
did not improve significantly. Thus, the Region-based Fully Convolutional
Network (R-FCN) was introduced, which includes the position-sensitive
score to capture the spatial information of the object, and localization is
performed by pooling. The R-FCN model as shown in Figure 4.6, uses
ResNet-101 CNN to extract feature maps, and the position-sensitive score
map is combined with RPN output for classification and regression. While it
has a faster detection speed than other models, its improvement in accuracy
is not substantial compared to Faster RCNN.
4.3.6 FPN
Feature Pyramid Network (FPN) addresses the issue of capturing the small
objects in the image, which is faced by the Faster RCNN model [12]. FPN
Deep learning-based object detection for computer vision tasks 51
4.3.7 Mask RCNN
Mask RCNN is an extension of Faster RCNN, and the structure is depicted
in Figure 4.8. Mask RCNN includes a branch for the prediction of pixel-wise
object segmentation in parallel with existing object detection [4]. The fully
convolutional layer is applied to the final region proposal output to obtain
the object mask. The ROI pooling layer is modified with ROI alignment to
52 Intelligent Systems and Applications in Computer Vision
4.3.8 G-R CNN
Granulated RCNN (G-RCNN) is an improved version of Faster RCNN
designed for video-based object detection [18]. G-RCNN utilizes a net-
work similar to AlexNet model, which includes 5 convolutional layers, 3
pooling layers, and 3 fully connected layers. Additionally, it incorporates
a granulation layer, ROI generation, and anchor process. To extract region
proposals in an unsupervised manner, granules (clusters) are formed after
the first pooling layer. G-RCNN effectively combines spatial and tem-
poral granules, obtained from static images and video sequences, to cap-
ture spatio-temporal information. The granules are processed through the
AlexNet layer, anchored for region proposals, and fed to the classifier and
regressor for detecting class labels and bounding boxes. Figure 4.9 depicts
the detailed view of the G-RCNN model.
4.4.1 YOLO
You Only Look Once (YOLO) is a single CNN model that predicts object
classes and their bounding boxes simultaneously on the full image [19]. It
divides the image into K*K grid cells and assigns each cell the responsibility
of detecting objects that it contains. YOLO uses anchor boxes to provide
multiple bounding boxes for each grid cell based on the aspect ratio of
Deep learning-based object detection for computer vision tasks 53
4.4.2 CenterNet
A new perspective of object detection is performed by modeling objects
as points instead of bounding boxes [27]. CenterNet uses the stacked
hourglass-101 model as a backbone for feature extraction, which is pre-
trained on the ImageNet dataset. The network provides three outputs as
shown in Figure 4.10(b), namely: (1) keypoint heatmap to detect the center
of the object; (2) offset to correct the location of an object; and (3) dimen-
sion to determine the object aspect ratio. The model training is fine-tuned
using the multitask loss of three outputs. Computationally expensive Non-
Maximum Suppression (NMS) technique is not required due to detection
of the object points instead of boxes. The prediction of a bounding box is
generated using the offset output. The network achieves high accuracy with
less prediction time compared with previous models. However, it lacks in
generalization ability to have different backbone architectures.
54 Intelligent Systems and Applications in Computer Vision
4.4.3 SSD
Single Shot Multi-box Detector (SSD) is an object detection model proposed
by Szegedy et al. [14], which outperforms Faster RCNN and YOLO in terms
of average precision and object localization. The model uses VGG16 as a
backbone and adds multi-scale feature layers to detect objects at different
scales, including small objects. The multi-scale layer provides the offset of
default boxes and specific height and weight. The model optimizes using
weighted average localization and confidence loss, and applies NMS to
remove duplicate predictions.
Although SSD enables real-time object detection, it has difficulty detecting
small objects, which can be improved by using VGG19 and Resnet models
as backbones. Figure 4.11(a) illustrates the SSD architecture. The authors
used SSD for multiple real-time object identification [9].
4.4.4 RetinaNet
The performance reduction of the one-stage detector compared with
the two-stage detector is predominantly due to the high-class imbalance
between foreground and background objects. Lin et al [13] proposed a
RetinaNet model using a new loss function named Focal loss, it provides
lower loss for easy misclassified samples and the detector focuses on the
Deep learning-based object detection for computer vision tasks 55
hard misclassified samples. RetinaNet uses Resnet and FPN model as the
backbone to extract the features and two sub-networks of fully convolu-
tional layers for classification and regression. Each pyramidal scale layer
of FPN is processed by the subnetworks to detect the object class and
bounding box in different scales. The diagrammatic representation of
RetinaNet is shown in Figure 4.11(b). Thus, RetinaNet, which is simple,
fast, and easy to implement and train, has outperformed previous models
and paved the way for enhancing model optimization through a new loss
function.
4.4.5 EfficientDet
EfficientDet is a model that improves detection accuracy and speed by scaling
in different dimensions [24]. It uses multi-scale features, Bi-directional FPN
layers, and model scaling. The backbone network is EfficientNet, and mul-
tiple BiFPN layers are used to extract features. As in Figure 4.12, the final
output is processed by a classifier and regressor network. EfficientDet uses
56 Intelligent Systems and Applications in Computer Vision
4.4.6 YOLOR
YOLOR, a novel object detection model proposed by Wang et al. [26],
combines explicit and implicit knowledge to create a unified representa-
tion. It uses architecture called the scaled YOLOv4 CSP model for object
detection and performs multi-task detection using implicit deep learning
that generalizes to different tasks. YOLOR achieves significant performance
and speed compared to current state-of-the-art object detection models by
using modifications like kernel alignment, manifold space reduction, feature
alignment, prediction refinement, and multitask learning in a single model.
These modifications ensure appropriate kernel space is selected for different
tasks, kernels are translated, rotated, and scaled to match the appropriate
output kernel space. The model achieves significant performance and speed
compared with current state-of-the-art models, making it a promising new
approach to object detection.
their mAP and fps. Similarly, Table 4.3 shows the object detector perform-
ance on the Pascal VOC 2012 dataset. The authors of G-RCNN have not
discussed the AP0.5, instead providing the mAP value as 80.9 percent on
Pascal VOC 2012 dataset with AlexNet as backbone architecture.
The performance of object detectors is mainly based on the input size,
training method, optimization, loss function, feature extractor, and so on.
Therefore, a common benchmark dataset is required to analyze the model
improvement in terms of accuracy and inference time. Thus, the study
utilized the standard benchmark dataset like PASAL VOC and MS COCO
datasets. From the analysis, it is inferred that for real-time object detection
YOLOv4 and YOLOR perform better concerning average precision and
inference time.
4.5.1 Future trends
Despite the development of various object detectors, the field of object
detection has plenty of room for improvement.
4.6 CONCLUSION
This chapter offers a comprehensive review of deep learning-based object
detection methods. It categorizes the object detection methods into
single-stage and two-stage deep learning algorithms. Recent algorithmic
advancements and their architecture are covered in depth. The chapter pri-
marily discusses developments in CNN-based methods because they are the
most widely used and ideal for image and video processing. Most notably,
some recent articles have shown that some CNN-based algorithms have
already become more accurate than human raters.
However, despite the encouraging outcomes, more development is still
required –for instance, the current market demand to develop a high-
precision system using lightweight models for edge devices. This work
highlights the ongoing research in improving deep neural network-based
object detection, which presents various challenges and opportunities for
improvement across different dimensions, such as accuracy, speed, robust-
ness, interpretability, and resource efficiency.
REFERENCES
[1] Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and
accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
[2] Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based
fully convolutional networks. Advances in neural information processing
systems. P. 29 (2016).
[3] Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international confer-
ence on computer vision. pp. 1440–1448 (2015).
[4] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings
of the IEEE international conference on computer vision. pp. 2961–2969
(2017).
[5] He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep con-
volutional networks for visual recognition. IEEE transactions on pattern
analysis and machine intelligence 37(9), 1904–1916 (2015).
[6] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image rec-
ognition. In: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). pp. 770–778 (2016).
Deep learning-based object detection for computer vision tasks 59
[7] Howard, A., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand,
T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural
networks for mobile vision applications (2017).
[8] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely
connected convolutional networks. In: 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 2261–2269 (2017).
[9] Kanimozhi, S., Gayathri, G., Mala, T.: Multiple real-time object identi-
fication using single shot multi- box detection. In: 2019 International
Conference on Computational Intelligence in Data Science (ICCIDS). pp.
1–5. IEEE (2019).
[10] Kanimozhi, S., Mala, T., Kaviya, A., Pavithra, M., Vishali, P.: Key object
classification for action recognition in tennis using cognitive mask rcnn. In:
Proceedings of International Conference on Data Science and Applications.
pp. 121–128. Springer (2022).
[11] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with
deep convolutional neural networks. Advances in neural information pro-
cessing systems 25 (2012).
[12] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.:
Feature pyramid networks for object detection. In: Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 2117–2125
(2017).
[13] Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense
object detection. In: Proceedings of the IEEE international conference on
computer vision. pp. 2980–2988 (2017).
[14] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg,
A.C.: Ssd: Single shot multibox detector. In: European conference on com-
puter vision. pp. 21–37. Springer (2016).
[15] Mahajan, S., Abualigah, L., Pandit, A.K., Nasar, A., Rustom, M.,
Alkhazaleh, H.A., Altalhi, M.: Fusion of modern meta-heuristic optimiza-
tion methods using arithmetic optimization algorithm for global optimiza-
tion tasks. Soft Computing pp., 1–15 (2022).
[16] Mahajan, S., Pandit, A.K.: Hybrid method to supervise feature selection
using signal processing and complex algebra techniques. Multimedia Tools
and Applications, pp. 1–22 (2021).
[17] Mahajan, S., Pandit, A.K.: Image segmentation and optimization
techniques: a short overview. Medicon Eng Themes 2(2), 47–49 (2022).
[18] Pramanik, A., Pal, S.K., Maiti, J., Mitra, P.: Granulated rcnn and multi-
class deep sort for multi-object detection and tracking. IEEE Transactions
on Emerging Topics in Computational Intelligence 6(1), 171–181 (2022).
[19] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once:
Unified, real-time object detection. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. pp. 779–788 (2016).
[20] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object
detection with region proposal networks. Advances in neural information
processing systems 28 (2015)
[21] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
60 Intelligent Systems and Applications in Computer Vision
[22] Sulthana, T., Soundararajan, K., Mala, T., Narmatha, K., Meena, G.:
Captioning of image conceptually using bi-lstm technique. In: International
Conference on Computational Intelligence in Data Science. pp. 71–77.
Springer (2021).
[23] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan,
D., Van-houcke, V., Rabinovich, A.: Going deeper with convolutions.
In: 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 1–9 (2015).
[24] Tan, M., Pang, R., Le, Q.V.: Efficientdet: Scalable and efficient object detec-
tion. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. pp. 10781–10790 (2020).
[25] Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective
search for object recognition. International Journal of Computer Vision
104(2), 154–171 (2013).
[26] Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one representation:
Unified network for multiple tasks. arXiv preprint arXiv:2105.04206
(2021).
[27] Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint
arXiv:1904.07850 (2019).
Chapter 5
5.1 INTRODUCTION
The exponential rise in the availability of information and big data over
the past few years drives the motivation to filter and extract high and very
specific information from raw sensor data –for example, speech progres-
sion, images, videos, and so forth. We know that computers do not perceive
images as the human eye does. They are naturally capable of understanding
the numeric notation. To perceive images in machine readable format,
the first and foremost step of any computer is to convert the information
contained in an image in understandable and readable form for machines
[1,2,3]. Since images are constructed of a grid of pixels that cover every
tiny part of the image, each pixel can be considered to be a “spot” of a
singular color. The greater content of pixels in an image represents higher
resolution if the image. It is known that the human brain associates some
important “features” (size, shape, color, etc.) with each object, which helps
one focus solely on those features to recognize those objects correctly
[4,5,6,7]. It succeeds in delivering highly accurate results when put to use
to extract certain particular “features” from images and identify each fea-
ture to an individual category of objects. A convolution matrix identifies
the patterns or “features” that need to be extracted from the raw visual
data, further helping in image identification. A neural network on the other
hand, defines a succession of algorithms that aim to conclude the rudimen-
tary relationships in synchronized data in a method that largely imitates the
way a human brain would deliver on the principal relationship in the set of
data. In the true sense, a neural network cites an arrangement of artificial
“neurons” that utilizes a convolution matrix to break down the visual input
and recognize the key “features” needed to be extracted for image categor-
ization and thus concluding information from the same.
Some of the different libraries which are utilized in computer vision are
TensorFlow, Keras, MATLAB®, and so forth. These libraries which depend
largely on GPU-accelerated libraries deliver soaring multi-GPU-accelerated
training. Apart from being computationally efficient and reducing the input
DOI: 10.1201/9781003453406-5 61
62 Intelligent Systems and Applications in Computer Vision
images into a form which is quite easier to process without the loss of any
important feature, CNNs have the advantage of detecting and extracting
important features from any visual input without any human intervention.
This non-involvement of human-interaction gives it the added advantage
when compared to its predecessors.
Deep learning in computer vision has risen high in the evolving world.
From object detection to identifying whether the X-Ray is indicative of
presence of cancer, deep learning methodologies when appropriately
implemented in the domain of computer vision can flourish to be helpful
to the mankind as well. The chapter delves into the preliminary concepts
of deep learning along with a detailed overview of applied deep learning
algorithms used in computer vision. The following sections discuss miscel-
laneous tools, libraries, and frameworks of deep learning in computer vision.
Computer vision has proved to be versatile, with its applications in various
industrial sectors, such as the transportation sector, the manufacturing unit,
healthcare, retail, agriculture, construction, and so forth. The penultimate
section includes these flexible industrial applications. The chapter concludes
with a reference to a few prospects of the same domain, which could be fur-
ther progressed accordingly.
in these hidden layers and redirects the results to the output nodes. Each
input node is connected to every other hidden layer present in the model,
and every hidden layer is connected to each output node forming an inter-
connected structure.
1 2
C ( w, b ) ≡ ∑ y ( x) − a (5.1)
2n x
64 Intelligent Systems and Applications in Computer Vision
k
( )
L (θ ) = − ∑ i =1yi log y i (5.2)
data) [10,11,12]. CNNs are used to reduce the input of large sized images
into a form that is understandable and easier to process by the computerized
machines without losing the distinct features of the image.
A Convolution Neural Network is built of multiple building blocks such as
convolution layers (kernel/filter), pooling layers, and fully connected layers.
The amalgamation of all these layers makes up the CNN that is used for
automatic and progressive learning of spatial hierarchies of features present
in the input image through an algorithm that supports back-propagation
[13,14].
ht = f ( ht −1 , xt ) (5.3)
Despite the given advantages, RNNs possess few disadvantages, such as the
vanishing gradient problem or the exploring gradient problem. With the
intention to conquer the same, LSTM (Long Short-Term Memory) has been
introduced.
tasks. Computer vision and image processing are not the same thing. Image
processing involves enhancement or modification of images, such as opti-
mizing brightness or contrast, subtracting noise or blurring sensitive infor-
mation to derive a new result altogether. Image processing also does not
require the identification of visual content. On the other hand, computer
vision is solely about identifying and classifying visual input to translate
the visual data based on the evaluative information identified during the
training phase.
Deep learning has been activated in a number of fields of computer vision
and has emerged to be one of the most developing and promising strengths
in the area of computer applications. Some of these fields are as follows:
5.8 CONCLUSION
Owing to the continuous expansion of vision technology, it can be said
that in the immediate future, computer vision would prove to be the pri-
mary technology giving solutions to almost every real-world problem. The
technology is capable of optimizing businesses, strengthening security, auto-
mating services, and thus seamlessly bridging the gap between tech and the
real world. Integrating deep learning methodologies in computer vision has
taken vision technology to a new level that will lead to the accomplishment
of various difficult tasks.
REFERENCES
1. Wilson, J.N., and Ritter, G.X. Handbook of Computer Vision Algorithms in
Image Algebra. CRC Press; 2000 Sep 21.
2. Hornberg, A., editor. Handbook of Machine Vision. John Wiley; 2006
Aug 23.
3. Umbaugh, S.E. Digital Image Processing and Analysis: Human and
Computer Vision Applications with CVIPtools. CRC Press; 2010 Nov 19.
4. Tyler, C.W., editor. Computer Vision: From Surfaces to 3D Objects. CRC
Press; 2011 Jan 24.
5. Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., and Lew, M.S. Deep
Learning for Visual Understanding: A Review. Neurocomputing. 2016 Apr
26;187:27–48.
6. Debnath, S. and Changder, S. Automatic detection of regular geometrical
shapes in photograph using machine learning approach. In 2018 Tenth
International Conference on Advanced Computing (ICoAC) 2018 Dec 13
(pp. 1–6). IEEE.
7. Silaparasetty, V. Deep Learning Projects Using TensorFlow 2. Apress; 2020.
8. Hassaballah, M. and Awad, A.I., editors. Deep Learning in Computer
Vision: Principles and Applications. CRC Press; 2020 Mar 23.
9. Srivastava, R., Mallick, P,K,, Rautaray, S.S., and Pandey, M., editors.
Computational Intelligence for Machine Learning and Healthcare
Informatics. Walter de Gruyter GmbH & Co KG; 2020 Jun 22.
10. de Campos Souza, P.V. Fuzzy neural networks and neuro-fuzzy networks: A
review the main techniques and applications used in the literature. Applied
soft computing. 2020 Jul 1;92:106275.
Deep learning algorithms for computer vision 71
11. Rodriguez, L.E., Ullah, A., Espinosa, K.J., Dral, P.O., and Kananenka, A.A.
A comparative study of different machine learning methods for dissipative
quantum dynamics. arXiv preprint arXiv:2207.02417. 2022 Jul 6.
12. Debnath, S., and Changder, S.. Computational approaches to aesthetic quality
assessment of digital photographs: state of the art and future research direct-
ives. Pattern Recognition and Image Analysis. 2020 Oct; 30(4):593–606.
13. Debnath, S., Roy, R., and Changder, S. Photo classification based on the
presence of diagonal line using pre-trained DCNN VGG16. Multimedia
Tools and Applications. 2022 Jan 8:1–22.
14. Ajjey, S.B., Sobhana, S., Sowmeeya, S.R., Nair, A.R., and Raju, M. Scalogram
Based Heart Disease Classification Using Hybrid CNN- Naive Bayes
Classifier. In 2022 International Conference on Wireless Communications
Signal Processing and Networking (WiSPNET). 2022 Mar 24 (pp. 345–
348). IEEE.
15. Debnath, S., Hossain, M.S., Changder, S. Deep Photo Classification Based
on Geometrical Shape of Principal Object Presents in Photographs via
VGG16 DCNN. In: Proceedings of the Seventh International Conference on
Mathematics and Computing. 2022 (pp. 335–345). Springer, Singapore.
16. Abdel-Basset, M., Mohamed, R., Elkomy, O.M., & Abouhawwash, M.
Recent metaheuristic algorithms with genetic operators for high-dimensional
knapsack instances: A comparative study. Computers & Industrial
Engineering. 2022 166: 107974.
Chapter 6
6.1 INTRODUCTION
It is a difficult task in image processing to use a Convolutional Neural
Network (CNN) to create a robust handwritten equation solver. Handwritten
mathematical expression recognition is one of the most difficult problems
in the domain of computer vision and machine learning. In the field of com-
puter vision, several alternative methods of object recognition and character
recognition are offered. These techniques are used in many different areas,
such as traffic monitoring [3], self-driving cars [9], weapon detection [17],
natural language processing [11], and many more.
Deep learning is subset of machine learning in which neural networks
are used to extract increasingly complex features from datasets. The deep
learning architecture is based on data understanding at multiple feature
layers. Further, CNN is another core application of deep learning approach,
consisting of convolutions, activation functions, pooling, densely linked, and
classification layers. Over the past several years, deep learning has emerged
as a dominating force in the field of computer vision. When compared to
classical image analysis problems, CNNs have achieved the most impressive
outcomes.
Deep learning is becoming increasingly important today. Deep learning
techniques are now being used in several fields like handwriting recogni-
tion, robotics, artificial intelligence, image processing, and many others.
Creating such a system necessitates feeding our machine data in order to
extract features to understand the data and make the possible predictions.
The correction rate of symbol segmentation and recognition cannot meet
its actual requirements due to the two-dimensional nesting assembly and
variable sizes. The primary task for mathematical expression recognition is
to segment and then classify the characters. The goal of this research is to
use a CNN model that can distinguish handwritten digits, characters, and
mathematical operators from an image and then set up the mathematical
expression and compute the linear equation.
72 DOI: 10.1201/9781003453406-6
Handwritten equation solver using Convolutional Neural Network 73
The purpose of this study lies in designing a deep learning model cap-
able of automatic recognizing handwritten numerals, characters, and math-
ematical operations when presented with an image of the handwriting. In
addition, the purpose extends to build a calculator that is capable of both
setting up the mathematical statement and computing the linear equation.
The chapter is divided into different sections. Section 2 presents a thorough
summary of current handwritten character recognition research studies in
recent years. Section 3 goes through each component of the CNN in depth.
Section 4 describes the proposed deep learning algorithms for handwritten
equation recognition as well as the dataset used. Section 5 discusses the
comparative analysis of different technical approaches. In addition, future
scope and conclusion are provided in Section 6.
6.3.1 Convolution layer
The first layer utilized to extract various information from input images is
the Convolutional Layer. The dot product is computed using an array of
Handwritten equation solver using Convolutional Neural Network 75
6.3.2 Pooling layer
This layer is generally used to make the feature maps smaller. It reduces
the number of training parameters, which speeds up computation. There
are mainly three kinds of pooling layers: Max Pooling. It chooses the max-
imum input feature from the feature map region as shown in Figure 6.3,
Average Pooling. It chooses the average input feature from the feature map
region, and Global Pooling. This is identical for employing a filter with the
dimensions h x w, that is, the feature map dimensions.
in that layer is coupled to every other neuron in the layer below it. In
our proposed method, two fully connected layers in CNN are employed
followed by the classification layer.
6.3.4 Activation function
In simple words, Activation Function, shown in Figure 6.5, activates the
neurons. It helps in deciding whether or not a neuron should fire and deter-
mining the output of the convolution layer. These are the most common
activation functions: Sigmoid [12], ReLU [13], Leaky ReLU [7] , and
Softmax [10].
Handwritten equation solver using Convolutional Neural Network 77
Softmax (xi) = Σ
exp (xj) (6.2)
6.4.1 Dataset preparation
The first and most important step of any research is dataset acquisition.
The numerals and operations data and the character/variable dataset were
collected from Kaggle [1, 2]. Then these were augmented to prepare a
large dataset. The dataset contains approximately 24,000 images, which
has 16 classes, such as 0–9 numerals, variable and five basic mathematical
operators/symbols, namely, addition, subtraction, multiplication, equals,
and division, as shown in Figure 6.6.
6.4.2 Proposed methodology
The proposed CNN model is used to recognize simple equations that
consists of arithmetic operators: addition, subtraction, multiplication and
division. It is also used to recognize simple linear equations of the type x +
a = b where x is a variable and a, b are constants. The block diagram of the
implemented model is illustrated by Figure 6.7.
78 Intelligent Systems and Applications in Computer Vision
images were further resized to 100 x 100 for smooth training and better
results.
6.4.2.2 Preprocessing
The goal of preprocessing is to improve the image quality to analyze it more
effectively. Im age are preprocessed using several well-known techniques,
like image resizing, normalizing and augmentation are a few examples.
6.4.3 Solution approach
The solution approach is shown using the flowchart in Figure 6.8. The
handwritten mathematical equation is provided by the user.
Image segmentation is the process of dividing an image information into
divisions known as image segments, which helps to minimize the computa-
tional complexity and makes the further processing or analysis easier. The
segmentation stage of an image analysis system is crucial because it isolates
the subjects of interest for subsequent processing such as classification or
detection. Image categorization is used in the application to better accur-
ately classify image pixels. Figure 6.9 represents the actual deployment of
the proposed methodology. The input image is segmented into well-defined
fixed proportions. In the case of simple character recognition provided in
the image, we have segmented it into three parts, that is, into two numerals
and one operator. This case is considered in general form of image. The seg-
mentation is a 1:3 ratio. The proposed model works with the constraints
being the middle segment should be an operator and the extreme segments
should belong to numerals.
Figure 6.10 describes the steps in the algorithm of the proposed model
to solve the equations from handwritten images. In both cases, each of the
segment is thresholded by Otsu’s Algorithm. Later, the segmented binary
image is normalized before fed to the model for training. The size of
segmented image is further defined by four coordinates, as left, right, top,
bottom. Each segment will be now framed into a new image named as segs.
Each of these segments is resized into 100x100.
Now, the segmented character/variables or operators are extracted and
recognized by the trained model. The end goal of the training is to be able
to recognize each block after analyzing an image. It must be able to assign a
class to the image. Therefore, after recognizing the characters or operators
from each image segment, the equation is solved using mathematical for-
mulas on a trained model.
REFERENCES
[1] Dataset. www.kaggle.com/code/rohankurdekar/ handwritten-basic-math-
equation-solver/data.
[2] Dataset. www.kaggle.com/datasets/vaibhao/handwritten-characters.
84 Intelligent Systems and Applications in Computer Vision
[3] Mahmoud Abbasi, Amin Shahraki, and Amir Taherkordi. Deep learning
for network traffic monitoring and analysis (ntma): A survey. Computer
Communications, 170:19–41, 2021.
[4] Megha Agarwal, Vinam Tomar Shalika, and Priyanka Gupta. Handwritten
character recognition using neural network and tensor flow. International
Journal of Innovative Technology and Exploring Engineering (IJITEE),
8(6S4):1445–1448, 2019.
[5] Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. Understanding
of a convolutional neural network. In 2017 international conference on
engineering and technology (ICET), pages 1–6. Ieee, 2017.
[6] Yellapragada, S.S., Bharadwaj, P., Rajaram, V.P., Sriram, S., and Sudhakar
and Kolla Bhanu Prakash. Effective handwritten digit recognition using
deep convolution neural network. International Journal of Advanced
Trends in Computer Science and Engineering, 9(2):1335–1339, 2020.
[7] Arun Kumar Dubey and Vanita Jain. Comparative study of convolution
neural network’s relu and leaky- relu activation functions. In Sukumar
Mishra, Yogi Sood, and Anuradha Tomar (eds) Applications of Computing,
Automation and Wireless Systems in Electrical Engineering, Lecture Notes
in Electrical Engineering pages 873– 880. Springer, 2019. https://doi.
org/10.1007/978-981-13-6772-4_76
[8] Jitesh Gawas, Jesika Jogi, Shrusthi Desai, and Dilip Dalgade. Handwritten
equations solver using cnn. International Journal for Research in Applied
Science and Engineering Technology (IJRASET), 9:534–538, 2021.
[9] Abhishek Gupta, Alagan Anpalagan, Ling Guan, and Ahmed
Shaharyar Khwaja. Deep learning for object detection and scene per-
ception in self-driving cars: Survey, challenges, and open issues. Array,
10:100057, 2021.
[10] Ioannis Kouretas and Vassilis Paliouras. Simplified hardware implementa-
tion of the softmax activation function. In 2019 8th international confer-
ence on modern circuits and systems technologies (MOCAST), pages 1–4.
IEEE, 2019.
[11] Daniel W. Otter, Julian R. Medina, and Jugal K. Kalita. A survey of the
usages of deep learning for natural language processing. IEEE transactions
on neural networks and learning systems, 32(2):604–624, 2020.
[12] Andrinandrasana David Rasamoelina, Fouzia Adjailia, and Peter Sinčák.
A review of activation function for artificial neural network. In 2020 IEEE
18th World Symposium on Applied Machine Intelligence and Informatics
(SAMI), pp. 281–286. IEEE, 2020.
[13] Johannes Schmidt- Hieber. Nonparametric regression using deep
neural networks with relu activation function. The Annals of Statistics,
48(4):1875–1897, 2020.
[14] Chen ShanWei, Shir LiWang, Ng Theam Foo, and Dzati Athiar Ramli.
A cnn based handwritten numeral recognition model for four arithmetic
operations. Procedia Computer Science, 192:4416–4424, 2021.
[15] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
Zbigniew Wojna. Rethinking the inception architecture for computer
Handwritten equation solver using Convolutional Neural Network 85
Agriware
Crop suggester system by estimating the
soil nutrient indicators
S. P. Gautham, H. N. Gurudeep, Pai H. Harikrishna,
Jasmine Hazel Crasta, and K. Karthik
7.1 INTRODUCTION
Farming is a cornerstone of the food business and is the foundation of
a country’s economic growth. As there is an expansion in population in
countries like India, China, Syria, Niger, Angola, Benin, and Uganda, food
demand also increases, and the agriculture industry needs to be able to
meet all those demands for urban and rural areas. Decades ago, without
technological progress in the agricultural sector, unanticipated losses were
experienced because individuals had lost faith in the agricultural sector.
The conventional method of creating assumptions that could not live up to
expectations is one of the leading causes of this. Apart from these, natural
calamities poison the earth by stealing its substance. The widening spread
of contaminated climatic conditions have radically changed the air, soil,
and water. The altered soil in nature cannot thus agree with the farmer’s
hypothesis, resulting in a massive agricultural loss. Hence, to improve the
agricultural industry, precision farming must be developed to a greater
extent than traditional methods [1]. Most farmers are unfamiliar with pre-
cision farming and are unaware of the scientific agricultural techniques.
Awareness about modern farming with newer agricultural innovations
and farming practices will help farmers increase efficiency, thus reducing
the farm risk in terms of production, poor irrigation facilities, human
resources, usage of modern techniques in farming. The existing method
of soil classification and crop suggestion is manual and time-consuming,
leading to human errors when the results are ambiguous.
Using modern technologies such as artificial intelligence in the agricul-
tural sector has seen many positive impacts [2]. Types of crops that can be
grown vary in different parts of the world based on the type of the soil, its
nutrition, climatic conditions like rainfall, temperature, and so forth. The
crop yield and its growth are extremely dependent upon these factors, and
every type of crop may not be suitable for that location. Hence, building a
system that classifies the soil type and predicts the kinds of crops that can be
grown would be of great assistance for farmers. Such systems will assist in
the large production of crops in those regions [3].
86 DOI: 10.1201/9781003453406-7
Crop suggester system by estimating the soil nutrient indicators 87
A major focus of the present work is to design a model for soil classifi-
cation and to forecast the types of crops that can be grown using artificial
intelligence and computer vision techniques, thereby extending a helping
hand to the farmers. The input image is preprocessed first, then feature
extraction and optimum feature selection are performed for effective soil
classification. Convolutional neural networks have been used for feature
extraction, selection, and classification of soil images. Different parameter
values of the soil nutrients are then fed into the model to suggest suitable
crops. For the final development of the proposed model, VGG16 was used
for soil classification and Random Forest for predicting crops.
The rest of the article presented in this chapter is as follows: Section 2
briefs on existing works related to soil classification and prediction of crops.
Section 3 details the proposed methodology for classifying the soil and
predicting the possible list of crops that can be grown and the precise detail
of the model designed for the task. Section 4 describes the experimental
analysis and observations, along with the results of the proposed model.
A conclusion followed by further improvements in the proposed system and
future directions is at the end.
7.2 RELATED WORK
It is noted that many research works are going on, and researchers are
applying mod-ern techniques for developing the agricultural sector to have
better crop production. Agriculture has helped farmers from information
exchange regarding seasonal change, demand, and cultivation. Climate
and soil-related data are exceptionally useful for the farmers to defeat the
misfortunes that might happen due to inappropriate cultivation.
Zubair et al. [4] proposed a model to predict seasonal crops based on
the locality, and the overall process is built based on the regional areas of
Bangladesh. For building up the model Seasonal Autoregressive Integrated
Moving Average (SARIMA) was used for forecasting the rainfall and tem-
perature, random forest regression for predicting the crops production. The
model finally suggests crops that can have top production based on the
season and location. However, the model focused only on the crops that
can be grown in Bangladesh and did not focus on the varieties of the soil.
Another approach for crop selection and yield prediction by T. Islam et al.
[5] used 46 parameters for the prediction process. Along with the deep neural
network model, support vector machine, logistic regression, and random
forest algorithms were considered for comparing the accuracy and error rate.
However, it is noticed that the study was limited and focused on the region
of Bangladesh. Kedlaya et al. [6] proposed a pattern matching technique
for predicting crops using historical data that relies on different parameters
like weather conditions and soil property. The system suggests the farmers
to plant an appropriate crop on the basis of the season and area or region
88 Intelligent Systems and Applications in Computer Vision
of sowing. Area here speaks about the place, land of sowing. It is noted that
such a system was implemented only for two regional districts of Karnataka,
and proper classification of soil was not included as a part of their work.
Indian farmers face a common problem as they do not choose the right
crop based on their soil requirements. A solution to this has been identi-
fied by S. Pudumalar et al. through precision agriculture [7]. Precision agri-
culture is a cutting edge cultivating strategy that suggests the right crop
based on their specific parameters increasing productivity and thus reduces
the wrong choice of a crop. S. Khaki and L. Wang [8] proposed a crop
yield prediction using deep neural networks and found that their model
had a predominant expectation accuracy with a root-mean-square-error of
12 percent. It also performed a feature selection using the trained DNN
model, which effectively reduced the input space dimension without any
drop in the prediction results. Finally, the outcome showed that the environ-
mental parameters greatly improved crop yield and productivity.
S. Veenadhari et al. [9] presented a software tool called “Crop Advisor”
is a user-friendly web application for predicting the climatic factors on
the selected crops of the Madhya Pradesh districts. However, other agro
parameters responsible for crop yield were not included in this product tool,
as those input parameters vary in individual fields based on the area and sea-
sonal conditions. Nevavuori et al. [10] proposed a CNN-based model for
soil image classification tasks that showed outstanding performance in crop
yield prediction on Normalized Difference Vegetation Index (NDVI) and
Red Green Blue (RGB) data. Significantly, the CNN model showed better
performance with RGB data compared to the NDVI data.
Y. J. N. Kumar et al. [11] in the agriculture sector developed a supervised
machine learning prediction approach for a better crop yield from past his-
torical factors, including temperature, humidity, ph, and rainfall. The model
used the Random Forest algorithm to attain the best accurate value for crop
prediction. N. Usha Rani and G. Gowthami [12] developed a smart crop
suggester recommendation android-based application that assists farmers in
choosing a preferable crop to yield higher production. It has a user-friendly
interface that helps the farmers to get a suitable crop suggestion that is
most suitable based on location, season, soil type, and rainfall analyzed
considering last year’s agriculture data.
The above studies have made their attempts for crop prediction using
advanced technologies that can be applied in the agricultural sector for
better yield of crops considering several parameters. In each of these
research works, it can be noticed that most of the effort was either based
on region-wise conditions or included only crop suggestions or soil clas-
sification. It is observed that very limited works have both soil classifica-
tion and crop suggestion. So for the betterment of agriculture and to help
farmers, we have come across a crop suggestion application after soil classi-
fication. Once a soil image is classified and then different parameters based
Crop suggester system by estimating the soil nutrient indicators 89
on the soil and climatic conditions are considered, the application suggests
a suitable crop that can have a better yield based on the inputs parameters.
To achieve this, we have used neural network techniques and, in the next
section, we discuss the proposed approach that has been developed for soil
classification and crop suggestion.
7.3 PROPOSED METHODOLOGY
A soil classification system is a machine learning- based application
designed to help farmers to classify the soil type and to predict the crops
based on different parameters. Image processing methods and convolu-
tional neural networks are combined to categorize soil images. Later, based
on the input parameters using a random forest algorithm, the type of crop
is suggested. Once a user uploads a soil image using the soil classification
model, the soil is classified into its category. After the classification result
is obtained, the user needs to enter the parameters of the soil and weather
information; the input data then gets processed and compared with the
model and, finally, suggests the crops based on the soil and the given input
parameters. The sequence of tasks involved in the overall progress of the
proposed classification and crop suggestion model using the network is
shown in Figure 7.1.
For the soil classification task, we have considered four types of soil
images –Black, Clay, Red and Sandy. With VGG16 [13] CNN architecture,
the input soil image classification was carried out. The reason for
Figure 7.1 Proposed approach for soil image classification and crop prediction.
90 Intelligent Systems and Applications in Computer Vision
Clay
Red
Sandy
www.kaggle.com/datasets/prasanshasatpathy/soil-types
1
www.kaggle.com/datasets/shashankshukla9919/crop-prediction
2
considered from Kaggle.2 A set of sample images that are used for this crop
prediction model is shown in Table 7.1.
Integration testing is performed by connecting the front end to the back
end. Integration testing is where individual units are merged and tested
as a group. This level of testing aims to expose incompatibilities, faults,
and irregularities in the interaction between integrated modules. When the
image is uploaded, it is processed by the CNN model that classifies the
soil type. After entering the soil parameters and seasonal data values in
the crop suggestion system, it is processed by the Random Forest algo-
rithm that suggests the suitable crop that can be grown. Test cases that were
validated using a test image for soil classification and prediction of crop are
shown in Table 7.2.
The proposed CNN models’ performance are measured using the
standard metrics like accuracy, loss, validation accuracy, and validation
92 Intelligent Systems and Applications in Computer Vision
farmers as they can upload the soil image and enter mandatory soil and cli-
mate parameters through our web interface. In future, the system’s accuracy
can be enhanced by increasing the training dataset, resolving the closed
identity problem, and finding an optimal solution. The crop recommenda-
tion system will be further developed to connect with a yield predictor,
another subsystem that would also allow the farmer to estimate the produc-
tion based on the recommended crop. We likewise anticipate carrying out
this framework on a portable stage for the farmers.
REFERENCES
[1] Awad, M.M.: Toward precision in crop yield estimation using remote
sensing and optimization techniques. Agriculture 9(3) (2019) 54.
[2] Ben Ayed, R., Hanana, M.: Artificial intelligence to improve the food and
agri-culture sector. Journal of Food Quality 2021 (2021) 1–7.
[3] Waikar, V.C., Thorat, S.Y., Ghute, A.A., Rajput, P.P., Shinde, M.S.: Crop
predic-tion based on soil classification using machine learning with clas-
sifier ensembling. International Research Journal of Engineering and
Technology 7(5) (2020) 4857–4861.
[4] Zubair, M., Ahmed, S., Dey, A., Das, A., Hasan, M.: An intelligent model
to suggest top productive seasonal crops based on user location in the
94 Intelligent Systems and Applications in Computer Vision
8.1 INTRODUCTION
The Covid-19 virus erupted in Wuhan, China, in the last days of 2019,
affecting countries around the globe. Covid-19 created tremendous pressure
on health care systems throughout all countries because of the very high
number of patients. The virus, named as Severe Acute Respiratory Syndrome
Corona Virus 2 (SARS CoV-2) is responsible for loss of a huge number of
lives [1]. In January 2019 the World Health Organization (WHO) declared
this virus outbreak as a Public Health Emergency of International Concern
(PHEIC). In 2020 it was named Covid-19 and in March 2020 the WHO
declared this outbreak as a pandemic as it touched all corners of the
world [2].
The testing for this virus is done either through viral or antibody testing
procedures [3]. The testing is available everywhere in the government or
private allocated laboratories containing the supporting equipment and
procedures. In India, according to the latest guidelines by the Indian Council
of Medical Research (ICMR), there should be a testing center within 250
kms distance in the plains and 150 km in the hills. This caused a delay of 5
hours for getting the sample to a lab, followed by a testing procedure lasting
approximately 24–48 hours. As of 13 May 2020, 1.85 million samples
had been tested as per the report published by ICMR [4]. Internationally,
researchers in the medicine worked towards finding the appropriate drug
or vaccine and developing rapid and accurate testing procedures [5]. Many
rapid testing procedures were failing throughout the globe thereby still
leaving the effort for rapid results an open stream for researchers.
In recent years computer- based Machine Intelligence algorithms are
replacing the human effort in almost all the fields. Particularly for the medi-
cine, machine learning based systems have proven their capabilities to pre-
dict and diagnose various kinds of diseases, including lung cancer, breast
cancer detection, fetal heart related diseases, rheumatology, and so forth
[6]. Machine learning algorithms maps the knowledge acquired through
DOI: 10.1201/9781003453406-8 95
96 Intelligent Systems and Applications in Computer Vision
8.2 LITERATURE SURVEY
Mohammad Saber [17] conducted a study on detection of Covid-19 effects
on human lungs and proposed a new hybrid approach based on neural
networks, which extracts features from images by using 11 layers and, on
that basis, another algorithm was also implemented to choose valuable
qualities and hide unrelated qualities. Based on these ideal characteristics,
lung x-rays were identified using a Support Vector Machines (SVM) classi-
fier to get better results. This study demonstrated that the accuracy indicator
and the quantity of pertinent characteristics extracted outperformed earlier
methods using the same data.
Zeynep Gündo Gar [19] reviewed deep learning models and attribute-
mining techniques. Utilization of matrix partition in the TMEMPR [19]
method provides 99.9 percent data reduction; the Partitioned Tridiagonal
Enhanced Multivariance Products Representation (PTMEMPR) method
was proposed as a new attribute-mining method. This method is used as
a preprocessing method in the Covid-19 diagnosis scheme. The suggested
method is compared to the cutting- edge feature extraction techniques,
Singular Value Decomposition (SVD), Discrete Wavelet Transform (DWT),
A machine learning based expeditious Covid-19 prediction model 97
Mokhalad Abdul [21]. A study was done using new Caledonian crow
learning. In the first stage, the best features related to COVID-19 disease
are picked using the crow learning algorithm. The artificial neural network
is given a set of COVID-19 patient-related features to work within the
proposed method, and only those features that are absolutely necessary for
learning are chosen by the crow learning algorithm.
Ekta Gambhir et al. [23]. From the Johns Hopkins University visual dash-
board, a regression analysis was performed to describe the data before fea-
ture selection and extraction. The SVM method and Polynomial Regression
were then used to create the model and obtain the predictions.
Table 8.1 refers to comparative study of all the Machine learning approach
on the basis of accuracy, sensitivity and specificity.
newgenrtpdf
98
Intelligent Systems and Applications in Computer Vision
Table 8.1 Summarized ML approaches and results
Yazeed Zoabi et al. [25] established a new gradient boosting machine model
that used decision tree to predict the Covid-19 in patients by asking eight
specific questions. Mahajan [24–27] proposed a new hybrid technique
based on the Aquila optimizer (AO) and the arithmetic optimization algo-
rithm (AOA). Both AO and AOA are recent meta-heuristic optimization
techniques. They can be used to solve a variety of problems such as image
processing, machine learning, wireless networks, power systems, engin-
eering design, and so on. The impact of various dimensions is a standard test
that has been used in prior studies to optimize test functions that indicate
the impact of varying dimensions on AO-AOA efficiency.
8.3 METHODOLOGY
8.3.2 Classification set up
The data thus obtained after the selection process is used for training and
testing the prediction model that is shown in Figure 8.3.
In total, 750 [7] samples were filtered for consideration as they comprise
nearly complete information of all the variables. The samples are than divided
A machine learning based expeditious Covid-19 prediction model 101
in an 80:20 ratio for training versus testing. Out of this total, 750 samples,
602 are selected for training purposes and among the training samples 519
samples belong to negative tested patients and 83 of positive tested patients.
In our research the various machine learning based classifiers [12–16] are
implemented with this training: testing setup and the performed evaluation
is performed. The classifiers used are Naïve Bayes (NB), Support Vector
Machine (SVM), K Nearest Neighbor (KNN), and Decision Tree (DT) that
is shown in Figure 8.4. Their results are implemented and evaluated.
102 Intelligent Systems and Applications in Computer Vision
8.3.3 Performance evaluation
For evaluation of developed models we have used a 10 fold cross validation
scheme with non-overlapped data. The accuracy of all the machine learning
algorithms stated in Section 2 are evaluated, and a comparison is drafted
for the same. The confusion matrix is also plotted, and a ratio of 80:20
without any overlapping case between training and testing is ensured. The
total training samples number 602, and testing samples are 148.
8.5 CONCLUSION
While medical facilities and government councils try to develop improved
rapid-testing procedures for detecting Covid-19 in patients, a delay in a
report may result in spreading disease to persons who come in contact
with infected patients. The machine learning based system proposed in this
chapter can help predict the result with simple clinical lab tests of the blood
sample of the suspected patient is shown in Table 8.2. The data availability
is the major issue for designing such system. Data true and complete in all
respects can help in improving system accuracy and, with the future avail-
ability of complete data with more relevant parameters, can help develop
the system with highly accurate predictions and thereby can lead to reducing
or minimizing the number of tests done for the suspected patients.
REFERENCES
1. Pourhomayoun, Mohammad, and Mahdi Shakibi. “Predicting Mortality
Risk in Patients with COVID-19 Using Artificial Intelligence to Help Medical
Decision-Making.” medRxiv (2020).
2. www.who.int [Accessed on 2-04-20]
3. www.cdc.gov/ c oro n avi r us/ 2 019- n cov/ s ympt o ms- t est i ng/ t est i ng.html
[Accessed on 2-04-20]
4. https://main.icmr.nic.in/content/covid-19 [Accessed on 13-5-20]
5. Burog, A. I. L. D., C. P. R. C. Yacapin, Renee Rose O. Maglente, Anna
Angelica Macalalad- Josue, Elenore Judy B. Uy, Antonio L. Dans, and
Leonila F. Dans. “Should IgM/IgG rapid test kit be used in the diagnosis
of COVID-19?.” Asia Pacific Center for Evidence Based Healthcare 4
(2020): 1–12.
6. Mlodzinski, Eric, David J. Stone, and Leo A. Celi. “Machine Learning for
Pulmonary and Critical Care Medicine: A Narrative Review.” Pulmonary
Therapy (2020): 1–11.
7. www.kaggle.com/einsteindata4u/covid19 [Accessed on 22-3-20]
8. Wu, Jiangpeng, Pengyi Zhang, Liting Zhang, Wenbo Meng, Junfeng Li,
Chongxiang Tong, Yonghong Li et al. “Rapid and accurate identification of
COVID-19 infection through machine learning based on clinical available
blood test results.” medRxiv (2020).
9. Peker, Musa, Serkan Ballı, and Ensar Arif Sağbaş. “Predicting human actions
using a hybrid of Relief feature selection and kernel-based extreme learning
machine.” In Cognitive Analytics: Concepts, Methodologies, Tools, and
Applications, pp. 307–325. IGI Global, 2020.
10. Urbanowicz, Ryan J., Melissa Meeker, William La Cava, Randal S. Olson,
and Jason H. Moore. “Relief- based feature selection: Introduction and
review.” Journal of Biomedical Informatics 85 (2018): 189–203.
11. Uldry, Laurent and Millan, Jose del R. (2007). Feature Selection Methods
on Distributed Linear Inverse Solutions for a Non-Invasive Brain-Machine
Interface.
12. Susto, Gian Antonio, Andrea Schirru, Simone Pampuri, Seán McLoone, and
Alessandro Beghi. “Machine learning for predictive maintenance: A multiple
classifier approach.” IEEE Transactions on Industrial Informatics 11, no. 3
(2014): 812–820.
13. Lanzi, Pier L. Learning classifier systems: From foundations to applications.
No. 1813. Springer Science & Business Media, 2000.
14. Kononenko, Igor. “Semi-naive Bayesian classifier.” In European Working
Session on Learning, pp. 206–219. Springer, Berlin, Heidelberg, 1991.
15. Rueping, Stefan. “SVM classifier estimation from group probabilities.”
(2010).
104 Intelligent Systems and Applications in Computer Vision
16. Rish, Irina. “An empirical study of the naive Bayes classifier.” In IJCAI 2001
workshop on empirical methods in artificial intelligence, vol. 3, no. 22,
pp. 41–46. 2001.
17. Iraji, Mohammad Saber, Mohammad- Reza Feizi- Derakhshi, and Jafar
Tanha. “COVID-19 detection using deep convolutional neural networks and
binary differential algorithm-based feature selection from X-ray images.”
Complexity 2021 (2021).
18. Ozyurt, Fatih, Turker Tuncer, and Abdulhamit Subasi. “An automated
COVID-19 detection based on fused dynamic exemplar pyramid feature
extraction and hybrid feature selection using deep learning.” Computers in
Biology and Medicine 132 (2021): 104356.
19. Gündoğar, Zeynep, and Furkan Eren. “An adaptive feature extraction
method for classification of Covid-19 X-ray images.” Signal, Image and
Video Processing (2022): 1–8.
20. Ali, Rasha H., and Wisal Hashim Abdulsalam. “The Prediction of COVID
19 Disease Using Feature Selection Techniques.” In Journal of Physics:
Conference Series, Vol. 1879, no. 2, p. 022083. IOP Publishing, 2021.
21. Kurnaz, Sefer. “Feature selection for diagnose coronavirus (COVID-19) dis-
ease by neural network and Caledonian crow learning algorithm.” Applied
Nanoscience (2022): 1–16.
22. Sahlol, Ahmed T., Dalia Yousri, Ahmed A. Ewees, Mohammed A.A. Al-
Qaness, Robertas Damasevicius, and Mohamed Abd Elaziz. “COVID-
19 image classification using deep features and fractional- order marine
predators algorithm.” Scientific Reports 10, no. 1 (2020): 1–15.
23. G. Ekta, J. Ritika, G. Alankrit and T. Uma,“Regression analysis of COVID-
19 using machine learning algorithms.,” in In 2020 International conference
on smart electronics and communication (ICOSEC), 2020.
24. Goodman-Meza, A. R. David, N. C. Jeffrey, C. A. Paul, E. Joseph, S. Nancy
and B. Patrick, “A machine learning algorithm to increase COVID- 19
inpatient diagnostic capacity.,” Plos one, vol. 15, no. 9, p. e0239474, 2020
25. Y. Zoabi, D.-R. Shira and S. Noam, “Machine learning-based prediction
of COVID-19 diagnosis based on symptoms,” npj digital medicine, vol. 4,
no. 1, pp. 1–5, 2021
26. Mahajan, S., Abualigah, L., Pandit, A.K. et al. Fusion of modern meta-
heuristic optimization methods using arithmetic optimization algorithm for
global optimization tasks. Soft Comput 26, 6749–6763 (2022).
27. Mahajan, S., Abualigah, L., Pandit, A.K. et al. Hybrid Aquila optimizer
with arithmetic optimization algorithm for global optimization tasks. Soft
Comput 26, 4863–4881 (2022).
28. Mahajan, S., Pandit, A.K. Hybrid method to supervise feature selection
using signal processing and complex algebra techniques. Multimed Tools
Appl (2021).
29. Mahajan, S., Abualigah, L. & Pandit, A.K. Hybrid arithmetic optimiza-
tion algorithm with hunger games search for global optimization. Multimed
Tools Appl 81, 28755–28778 (2022).
30. Mahajan, S., Abualigah, L., Pandit, A.K. et al. Fusion of modern meta-
heuristic optimization methods using arithmetic optimization algorithm for
global optimization tasks. Soft Comput 26, 6749–6763 (2022).
Chapter 9
9.1 INTRODUCTION
Gathering data about bird species requires immense effort and is very
time consuming. Bird species identification finds the specific category a
bird species belongs to. There are many methods of identification, such as
through image, audio, or video. An audio processing technique captures the
audio signals of birds and, similarly, image processing technique identifies
the species by capturing the image in various parameters (i.e., distorted,
mirror, hd quality) of birds.
The research focuses on identifying the species of birds using audio and
images. To predict these bird species, first it requires accurate information
about their species, for which it needs to select a large dataset for both
images and audio, which will be needful to train Neural Network for bird
species information. By catching the sound of different birds, their species
can be identified with different audio processing techniques. A system for
audio-based bird identification has proven to be particularly useful for
monitoring and education. People analyze images more effectively than
sounds or recordings but, in some cases, audio is more effective than images,
so we opted for the audio approach to classify bird species using images
and audio both. Many institutes are working on ecological and societal
consequences of biodiversity, many of the bird species are extinction-prone,
and a few are functionally extinct. The proposed system aims to help such
institutes and ornithologists who study the ecology of birds, identify the key
threats and find out the ways of enhancing the survival of species.
9.2 LITERATURE SURVEY
Mahajan Shubham et al. (2021) have proposed multilevel threshold based
segmentation. For determining optimal threshold, Fuzzy Entropy Type II
technique is combined with Marine Predators Algorithm. Rai, Bipin Kumar
(2020) and his group proposes a model explaining working of the project
in which the user would be able to capture and upload the image to the
system and can store the image in the database. A. C. Ferreira, L. R. Silva,
and F. Renna, et al. (2020) described methods for automation in the process
of collection and generation of the training data of individual bird species.
They developed a CNN based algorithm for the classification of three small
bird species.
Nadimpalli et al. (2006) uses the Viola-Jones algorithm for bird detec-
tion with accuracy of 87 percent. In the automated wildlife monitoring
system proposed by Hung Nguyen et al. (2017) a deep learning approach
is used. Yo-Ping Huang (2021) uses a transfer learning based method using
Inception-ResNet-v2 to detect and classify bird species. They claim accuracy
of 98 percent for the 29 bird species they have considered. Tejas Khare and
Anuradha C. Phadke (2020) have used the You Only Look Once algorithm
for classification of animals to build automated surveillance systems using
computer vision.
M. T. Lopes et al. (2011) focuses on the automatic identification of bird
species from the audio recorded songs of birds. They use 64 features such
as means and variances of timbral features, calculated in the intervals, for
the spectral centroid, roll off, flux, the time-domain zero crossings, including
the 12 initial MFCCs in each case. Kansara et al. (2016), in his paper on
speaker identification uses, MFCC in combination with deep neural network.
Rawat, Waseem and Wang, and Zenghui (2017) have taken a review of Deep
Convolutional Neural Networks for image classification. They have mentioned
applications of Inception and improved Inception models in their paper.
Existing techniques deal either with an image based or audio based
approach. But adverse environmental conditions may damage the sensors
if we rely on any one approach. So to make the system perfect and to cope
with the changes in weather and environment, we propose here a frame-
work to identify bird species with both approaches. We can combine both
the methodologies to authenticate the results as well, as if one of them fails
other can support.
In Section 3 these methodologies are explained; in Section 4 is the system
design; while in Section 5 results are discussed.
9.3 METHODOLOGY
Classification of images can be done with different state- of-
the art
approaches. One suitable approach for bird species’ identification is the
Vision Transformer (ViT) technique, which works as an alternative to
Convolutional Neural Networks. The ViT was originally designed for text-
based tasks, which is a visual model based on the architecture of a trans-
former. In this model an input image is represented as a series of image
patches, like the series of word embeddings used while using ViT to text
and predict class labels for given images. When enough data are available
for training, it gives an extraordinary performance, with one-fourth fewer
Bird identification using images and audio 107
9.4 SYSTEM DESIGN
Bird species identification carried out in proposed work is with both image
based and audio based techniques. While building datasets the common
108 Intelligent Systems and Applications in Computer Vision
9.4.1 Dataset used
Due to small size of dataset (about 100 images for each class/species) the
image augmentation method was used to increase Dataset size. Each image
is provided with 18 different augmentations that means each image have
total of 19 variations. Table 9.1 shows the total number of images used for
training with augmentation. Table I contains Image Data sets of 50 different
bird species.
Types of Augmentation used:
• Original Image
• Rotation –4 instances of Rotating image once in each quadrant.
• Flip –Flipping image upside down.
• Mirror –Mirror image of original.
• Illumination –Changing contrast and brightness so that night light
(low light) images are made clear (2 Instances).
• Translation –The shifting of an image along the x-and y-axis (4
instances)
Bird identification using images and audio 109
Number of Files
Training Images 3342
Validation Images 409
Testing Images 432
Different augmentations used are shown in Figure 9.3. The total number of
images increased as shown in Table 9.1 after augmentation.
Dataset used for Audio signal-based bird species identification consists of
50 species which are common to dataset used for image-based techniques as
shown in Table 9.2.
InceptionNet was trained and validated on Py-Torch for bird audio classifi-
cation. with the training batch size as 16 and validation batch size as 8. The
results are discussed in next section.
The dataset used was a mixture of files, mono format (audio with one
channel) and stereo format (audio with two channels, left and right).
To standardize the dataset, the whole dataset was converted to mono
since it sounds better. Mono takes less equipment, less space, and is cheaper.
If two or more speaker stereo inputs are used then it gives better experience,
but if a single speaker is used then mono input gives louder music than
stereo input.
• Noise Reduction
Many of the signal processing devices are susceptible to noise. Noise can
be introduced by a device’s mechanism or signal processing algorithms,
which are random in nature with an even frequency distribution (white
noise), or frequency-dependent distribution.
In many electronic recording devices, “hiss” created by random electron
motion due to thermal agitation at all temperatures above absolute zero
is a major source of noise. Detectable noise is generated by agitation of
electrons, which rapidly add and subtract from the voltage of the output
signal.
Therefore, Noise Reduction is an important part of pre-processing, which
is the process of removing noise from a signal.
• Split Audio
Since all audio files are not of the same length, they were split into the
same length clips to make dataset consistent. Audio files were split in 10-
second clips, and the last clip (of less than 10 seconds) was concatenated
with itself until it was 10 seconds.
Bird identification using images and audio 111
Figure 9.4 Mel-Spectrogram.
Figure 9.5 MFCC.
While obtaining MFCCs, the first step was computing the Fourier trans-
formation of audio data, which takes the time domain signal and turns it
into a frequency domain signal. This is computed by the fast Fourier trans-
formation, which is an incredibly important algorithm of time.
Then the Mel-spectrogram (power spectrum) was computed, and the Mel-
Filter bank was applied to it. The Mel-Frequency scale relates to perceived
frequency of an audio signal and relates to pitches judged by listeners. The
Mel Spectrogram of the sample from our dataset is shown in Figure 9.4
while a plot of MFCC is shown in Figure 9.5. Instead of converting MFCC
112 Intelligent Systems and Applications in Computer Vision
9.6 CONCLUSION
From the two approaches implemented for bird species identification, audio
based technique has shown poor accuracy whereas image based technique
Bird identification using images and audio 113
has demonstrated very good accuracy of 98.8 percent. Thus the image
based bird identification tool can be used as assistant to bird-watchers. It
will also play an important role in ecology studies such as identification of
endangered bird species. In many circumstances the birds are not visible,
and there is need to identify the bird that is singing. In such cases, the audio
based bird identification technique can solve the problem of bird identifi-
cation. And, hence, there is need to improve an audio based bird detection
algorithm using features such as MFCC with classifier instead of converting
it to image. Vision transformer is one of the state of the art techniques
of machine learning and gives very good performance for the image based
approach.
REFERENCES
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Heigold, Sylvain Gelly, Jakob, Uszkoreit, Neil Houlsby (2021). An Image is
worth 16x16 words: Transformers for Image Recognition at Scale. ICLR 2021.
Balint Pal Toth, Balint Czeba (2016). Convolutional Neural Networks for Large-
Scale Bird Song Classification in Noisy Environment. Conference and Labs of
the Evaluation forum (CLEF).
Ferreira, A.C., Silva, L.R, Renna F., et al. (2020) Deep learning-based methods for
individual recognition in small birds. Methods in Ecology and Evolution. 11,
1072–1085. https://doi.org/10.1111/2041-210X.13436
Hung, Nguyen, Sarah J. Maclagan, Tu Dinh Nguyen, Thin Nguyen, Paul Flemons,
Kylie Andrews, Euan G. Ritchie, and Dinh Phung (2017). Animal Recognition
and Identification with Deep Convolutional Neural Networks for Automated
Wildlife Monitoring. International Conference on Data Science and Advanced
Analytics (DSAA).
Kansara, Keshvi, Suthar and Anilkumar (2016). Speaker Recognition Using
MFCC and Combination of Deep Neural Networks. doi: 10.13140/
RG.2.2.29081.16487
M.T. Lopes, A.L. Koerich, C.N. Silla and C.A.A. Kaestner (2011). Feature set com-
parison for automatic bird species identification. IEEE International Conference
on Systems, Man, and Cybernetics, 2011, 965– 970, doi: 10.1109/ICSMC.
6083794
Mahajan, Shubham and Mittal, Nitin and Pandit, Amit. (2021). Image segmenta-
tion using multilevel thresholding based on type II fuzzy entropy and marine
predators algorithm. Multimedia Tools and Applications. 80. 10.1007/
s11042-021-10641-5
Rai, Bipin Kumar (2020). Image Based Bird Species Identification. International
Journal of Research in Engineering, IT and Social Sciences, Vol. 10, Issue 04,
April 2020, pp. 17–24.
Rawat, Waseem and Wang, Zenghui. (2017). Deep Convolutional Neural Networks
for Image Classification:A Comprehensive Review. Neural Computation,
29, 1–98.
114 Intelligent Systems and Applications in Computer Vision
10.1 INTRODUCTION
Ichthyosis Vulgaris is described by means of amendments in Full Length
Gene (FLG) Genetic linkage inspection of four families planned to the epi-
dermal separation mixed up on the chromosome. More genotyping has,
as of late, declared that a shortage of highlight transformations within
the FLG quality are the explanation of, the events that are acquired in a
semi-predominant manner with 83–96 percent penetrance. Amendments
achieve a shortened PRO filaggrin protein, which cannot be controlled
into intentional filaggrin subunits. It seems reasonable that amendments
in related qualities ought to achieve shortened filaggrin proteins. FLG
changes rationale in every Caucasian and Asian population [15, 16] any
way they will generally be people with extraordinary and unique, once in
a while mutually uncommon changes among those meetings. Indeed, even
inside European populaces, there are nearby contrasts. While the R501X
and changes represent approximately 80 percent of amendments in nor-
thern European relatives, they are significantly less in southern European
successors. Heterozygous addition considered, change own toleration
benefit over homozygous passive or potentially homozygous predom-
inant genotypes. As there are all the earmarks of there being a scope-based
occurrence propensity all through Europe, FLG changes could likewise give
better endurance rates. The Chinese Singaporean people’s eight unmistak-
able amendments represent approximately 80 percent. [30] Also, S2554X
and 3321delA changes are exceptionally consistent with the Japanese,
to whatever degree it is more uncommon in Koreans. The event of IV in
darkly pigmented populaces is by all accounts low [10,11], but nevertheless
more prominent inspection is expected to attest to those perceptions. The
event gauges might need to presumably underrate the real event of in those
populaces because FLG changes unique to Europeans had been above all
else used to find amendments supplier frequencies in Asians.
10.2 LITERATURE SURVEY
After many investigations, we studied image processing for detecting skin
diseases. In this, we give a brief overview of a number of the strategies as
suggested in this literature. A device is proposed for the detection of skin
disease sicknesses with the usage of color pictures without intervention of a
physician. [3] The device includes different stages, the primary one being the
use of color photo processing strategies, k-approach clustering, and color
gradient strategies to discern the problem in the skin. The second one is the
class of the disorder kind, the usage of synthetic neural networks. After
preparing the system for detecting the illness, it was tested on six styles of
pores and skin sicknesses, where the mean accuracy of the first degree was
95.99 percent and the second degree 94.061 percent. In this approach, the
wider the variety of functions extracted from the photo, the higher the
accuracy of the device. Melanoma is a disease that can result in death as
it leads to skin cancer if it is not diagnosed at an early stage. Numerous
segmentation strategies were targeted that might be carried out to discover
cancer by using photo processing. [5] The segmentation system is defined
that falling at the inflamed spot barriers to extracting greater functions. The
paintings proposed the improvement of a Melanoma prognosis device for
darkish pores and skin and the usage of a specialized set of rules databases
such as pictures from many Melanoma resources. Similarly with the class
of pores and skin sicknesses mentioned together with Melanoma, Basal
Biliary Carcinoma (BCC), Nevus and Seborrheic Keratosis (SK) through
the usage of the method guide vector system, termed the Support Vector
Machine (SVM). SVM yields excellent accuracy from several different
strategies. For one, the unfolding of persistent pores and skin sicknesses
in special areas might also additionally result in extreme consequences.
Therefore, we proposed a personal computer gadget that routinely detects
eczema and determines its severity. [10] The gadget includes three stages,
the primary stage being powerful segmentation through detecting the pores
and skin; the second one extracts a fixed series of functions, specifically
color, texture, and borders and the 0.33 determines the severity of eczema
and the usage of an SVM. A brand-new method has been proposed to dis-
cover pore and skin sicknesses, a method that mixes pc imaginative and
prescient with system mastering.[15] The position of pc imaginative and
prescient is to extract the functions from the photo at the same time as
system mastering is used to discover pore and skin sicknesses. The gadget
examined six styles of pores and skin sicknesses with an as it should be
95 percent.
Detection of Ichthyosis Vulgaris using SVM 117
10.3.1 Ichthyosis Vulgaris
This normal sort of ichthyosis is acquired as an autosomal prevailing. It
is rarely noticed sooner than 90 days after acquisition, and large numbers
of victims are impacted during their lives with injuries to their palms and
legs. Further, many sufferers enhance as they get older, so that no medical
results can be glaring in the summer, and the sufferers appear to be normal.
Nonetheless, rehashed clinical assessment will for the most part show that
this is not generally the case and to the point that either the mother or father
had a couple of indications of the extent of their illness.
10.3.2 Hyperkeratosis
In inclusion to the ichthyosis, characterized with the aid of using pleasant
white branny scales, its miles frequently viable to look at tough elevations
around the hair follicles (keratosis pilaris), in inclusion to expanded palmar
and plantar smudge. A significant part of the face can be impacted. The
pores and skin of the flexures and neck are generally typical. At the point
that the storage compartment is involved the ichthyosis is substantially less
distinguished at the stomach than at the back, and much of the time there is
limited hyperkeratosis on the elbows, knees, and lower legs. A considerable
number of those victims have at least one of the signs of atopy (asthma, skin
inflammation, and roughage fever disorder), and drying of the palms and
heels is a typical concern. The cytology might also display a few hyperkera-
tosis, with a faded or truant granular layer, and its miles more likely are a
few discounts within the range of sweat and sebaceous gland.
10.5 SYMPTOMS
Ichthyosis scales appear dialogue box
Ichthyosis Vulgaris slows the pores’ and skin’s herbal dropping process. The
reason is the extreme development of the protein inside the top layer of the
pores and skin (keratin). Side effects incorporate dry, layered pores and tile-
like skin; little scales that are white, dim or brown scales, depending on the
pores; and a variety of flaky scalp-deep excruciating outbreaks. The scales
by and large appear on elbows and lower legs and can be thick and darkish
over the shins. Most cases of Ichthyosis Vulgaris are mild, but nonetheless,
it sometimes can be extreme. The severity of the signs may also range exten-
sively amongst one’ own circle of relative participants becoming affected.
Indications generally get worse or are more expressed in cool, parched
habitats and tend to enhance or maybe resolve in mild, muggy habitats.
10.6 COMPLICATIONS
Some human beings with ichthyosis may also sense burnout. In uncommon
cases, the pores and skin solidify and scales of ichthyosis can intrude with
dripping. This can impede cooling. In some people, additional perspiring
(hyperhidrosis) can happen. Skin parting and breaking may likewise bring
about further contamination. The prognosis or infants could be very poor.
Most of the affected newborn infants do not live past the primary week of
life. It has been said that survival rates vary from 10 months to twenty-five
years with supportive remedies depending on the severity of the condition.
10.7 DIAGNOSIS
There is no remedy for inherited Ichthyosis Vulgaris. The treatment espe-
cially decreases the size and dryness of pores and epidermis. Treatment plans
require taking showers frequently. Drenching empowers hydration of pores
and skin and melts the size of the IV infection. Assuming you have open
bruises, your dermatologist may propose using petroleum jelly, or some-
thing comparable, on those sbruises before venturing into water. This can
decrease the consumption and stinging caused by the water. A few victims
say that including ocean salt (or common salt) in the water decreases the
consumption and stinging. Adding salt can likewise reduce the tingling.
Absorbing water relaxes the infection’s size. Your dermatologist may like-
wise recommend that you decrease the size simultaneously as it is milder,
gently scour with a rough wipe, Buff Puff, or pumice stone. Apply lotion to
soggy pores and skin within two minutes of washing. The lotion can seal
water from a shower or wash it into your pores and skin. Your dermatolo-
gist may likewise recommend a lotion that comprises an effervescent com-
ponent like urea, alpha hydroxyl corrosive, or lactic corrosive. These and
Detection of Ichthyosis Vulgaris using SVM 119
other such components also can assist to lessen scale. Apply petroleum jelly
to the worst breaks. This can help eliminate them. On the off chance that
you develop pore and skin contamination, your dermatologist will suggest a
cure that you either take or practice for your pores and skin.
10.8 METHODOLOGY
SVM is a measurement technique in light of factual learning hypothesis,
and is reasonable for shopping centre example size order. You can minimize
training errors and gain confidence. Analyze a given training set or test set
that uses SVMs to identify three things. Skin disease. A number is chosen
from the main example number and for preparing extricated highlights (e.g.,
variety and surface capacity) and utilization of the judicious center capacity
of the help vector machine. You can construct a characterization model. In
this composition, the three archetypal skin after-effects are herpes, derma-
titis, and psoriasis classes I, II, III. The Support Vector Machine 1, achy
zone emphasize classifier is coupled with Support Vector Machine 2, along
with the element model picked up is coupled with the effects. σ is the value
of the radial basis function parameter. The Support Vector Machine can be
observed along with the miniature applied in the K-Nearest Neighbor clas-
sifier. Presume that enduring a peculiar kitty family for certain elements of
canines, so assuming we really bear a miniature that can exactly perceive
whether it is a catlike or canine, such a miniature is made by applying the
Support Vector Machine appraisal. At the beginning set up our miniature
along with stacks of film and of pussycats and doggies so it can get know-
ledge from the various species of pussycats and doggies, along with subse-
quently, we try it with this unusual critter. Accordingly, as the assistance
vector pursues a call limit among the two data (catlike along with canine)
and picks incredible cases (support vectors), it will endure the crazy example
of catlike and canine. Given help assist vectors, it with willing request it as a
feline, where aj is a Lagrange multiplier, b∗ is the predisposition, and k(x1,
x2) is a part work; x1 alludes to the eigenvector that is acquired from the
trademark model; alludes to the outcomes; and σ is boundary esteem in the
outspread premise work.
10.9 RESULTS
The images are classified using Support Vector Machine (SVM). We have
used a system with Intel Core i7 processor 10 generation 2.60 GHz with
16 GB RAM. We have classified images on basis of skin color also and
explained it in methodology section. Support vector is explained through
graphs (Figure 10.1). The images are classified as shown in Figure 10.2. In
Figure 10.3 There are two types of images circle and rectangle, in 2-D it is
difficult to classify both images, as we go to 3-D we can easily identify the
120 Intelligent Systems and Applications in Computer Vision
Figure 10.2 Picture division (a) Unique pictures. (b) Marker-controlled watershed division
(c) Imprint controlling +grouping.
Detection of Ichthyosis Vulgaris using SVM 121
images and then, coming back to the original state, we can now classify the
images. Ichthyosis is a non-curable disease but by using SVM methodology
we can predict it early stages and can prevent it by taking precautionary
measures.
10.10 FUTURE WORK
In future we can classify images on the basis of their color, which will be
easy to predict, and classify different skin diseases. We can also develop
a website or an application through which anyone can identify the dis-
ease. Further we can study other diseases that are curable, but due to late
identification becomes serious. Other technologies can also be viewed for
detection.
10.11 CONCLUSION
This chapter utilizes the investigation technique for vertical picture division
to recognize ichthyosis. A couple of unessential elements can be decreased
through picture isolating, picture turn, and Euclidean length change
appertained in picture preprocessing. On a contrary line for every tip on the
essential center is not altogether settled. What is more, the epithelium can
be isolated into ten vertical picture areas. Given the dim position co-event
grid embraced to separate the face element, along with the area pixel tech-
nique appertained to remove attributes of the achy area. At closing, the aid
vector of the appliance is applied toward group data of more than 2 various
derma illnesses as per highlights of surface along with the affliction area,
122 Intelligent Systems and Applications in Computer Vision
REFERENCES
[1] S. Salimi, M. S. Nobarian, and S. Rajebi. “Skin disease images recognition
based on classification methods,” International Journal on Technical and
Physical Problems of Engineering. 22(7), 2015.
[2] M. Ganeshkumar and J. J. B. Vasanthi. “Skin disease identification using
image segmentation,” International Journal of Innovative Research in
Computer and Communication Engineering. 5(1): 5–6, 2017.
[3] S. Kolker, D. Kalbande, P. Shimpi, C. Bapat, and J. Jatakia. “Human skin
detection using RGB, HSV and YCbCr Color models,” Adv Intell Syst Res.
137, 2016.
[4] A. L. Kotian and K. Deepa. “Detection and classification of skin diseases
by image analysis using MATLAB,” Int J Emerging Res Manage Technol.
6(5): 779–784, 2017.
[5] Kumar, S. and A. Singh. “Image processing for recognition of skin diseases,”
Int J Comp Appl. 149(3): 37–40, 2016.
[6] Mazereeuw-Hautier, J., Hernandez-Martin, A., O’Toole, E. A., Bygum,
A., Amaro, Aldwin, C. “Management of congenital ichthyoses: European
guidelines of care,” Part Two. Br J. Dermatol. 180: 484–495, 2019.
[7] Wahlquistuist, A., Fischer, J., Törmä, H. “Inherited nonsyndromic
ichthyoses: An update on pathophysiology, diagnosis and treatment.” Am
J Clin Dermatol. 19: 51–66, 2018.
[8] Schlipf, N. A., Vahlquist, A., Teigen, N., Virtanen, M., Dragomir, A.,
Fishman, S. “Whole-exome sequencing identifies novel autosomal reces-
sive DSG1 mutations associated with mild SAM syndrome.” Br J Dermatol
174: 444–448, 2016.
[9] Kostanay, A. V., Gancheva, P. G., Lepenies, B., Tukhvatulin, A. I.,
Dzharullaeva, A. S., Polyakov, N. B. “Receptor Mincle promotes skin
allergies and is capable of recognizing cholesterol sulfate,” Proc Natl Acad
Sci U S A. 114: E2758–E2765, 2017.
[10] Proksch, E. “pH in nature, humans and skin.” J Dermatol. 45: 1044–10.
2018.
[11] Bergqvist, C. Abdallah, B.. Hasbani, D. J., Abbas, O., Kibbi, A. G., Hamie,
L., Kurban, M., Rubeiz, N. “CHILD syndrome: A modified pathogenesis-
targeted therapeutic approach.” Am. J. Med. Genet. 176: 733–738, 2018.
[12] McAleer, M. A., Pohler, E., Smith, F. J. D., Wilson, N. J., Cole, C.,
MacGowan, S., Koetsier, J. L., Godsel, L. M., Harmon, R. M., Gruber,
R., et al. “Severe dermatitis, multiple allergies, and metabolic wasting
syndrome caused by a novel mutation in the N-terminal plakin domain of
desmoplakin.” J. Allergy. Clin. Immunol. 136: 1268–1276, 2015.
Detection of Ichthyosis Vulgaris using SVM 123
[13] Zhang, L., Ferreyros M, Feng W, Hupe M, Crumrine DA, Chen J, et al.
“Defects in stratum corneum desquamation are the predominant effect of
impaired ABCA12 function in a novel mouse model of harlequin ichthy-
osis.” PLoS One. 11: e0161465, 2016.
[14] Chan A., Godoy-Gijon, E., Nuno-Gonzalez, A., Crumrine, D., Hupe, M.,
Choi, E. H,, et al. “Cellular basis of secondary infections and impaired des-
quamation in certain inherited ichthyoses.” JAMA Dermatol. 151: 285–
292, 2015.
[15] Zhang, H., Ericsson, M., Weström, S., Vahlquist, A., Virtanen, M., Törmä,
H. “Patients with congenital ichthyosis and TGM1 mutations overexpress
other ARCI genes in the skin: Part of a barrier repair response?” Exp
Dermatol. 28: 1164–1171, 2019.
[16] Zhang, H., Ericsson, M., Virtanen, M., Weström, S., Wählby, C.,
Vahlquist, A., et al. “Quantitative image analysis of protein expression and
colocalization in skin sections.” Exp Dermatol. 27: 196–199, 2018.
[17] Honda, Y., Kitamura, T., Naganuma, T., Abe, T., Ohno, Y., Sassa, T., et al.
“Decreased skin barrier lipid acyl ceramide and differentiation-dependent
gene expression in ichthyosis gene Nipal4- knockout mice.” J Invest
Dermatol. 138: 741–749, 2018.
Chapter 11
11.1 INTRODUCTION
Chest radiography is a cost-effective, commonly accessible, and easy to
use medical imaging technology and is a form of radiological examination
that can be used for diagnosis and screening of lung diseases. A chest X-
ray image includes the chest, lung, heart, airways, and blood vessels, and
it can be used by trained radiologists for diagnosing several abnormal
conditions. X-ray imaging is inexpensive and has a simple generation pro-
cess. Computer-aided techniques have the potential to use chest X-rays to
diagnose thoracic diseases accurately and with accessibility [1]. Computers
can be made to learn features that depict the data optimally for certain
problems. Increasingly higher-level features can be learned while the input
data is being transformed to output data using models [2]. The field of deep
learning has seen significant progress in applications like classification of
natural and medical images using computer vision approaches over the past
few years. Applications of these techniques in modern healthcare services
still remains a major challenge. A majority of chest radiographs around the
world are analyzed visually, which requires expertise and is time-consuming
[3]. The introduction of the ImageNet database has improved the perform-
ance of image captioning tasks. In addition to this, improvements in deep
Convolutional Neural Networks (CNN) enable them to recognize images
effectively.
Recent studies also use Recurrent Neural Networks (RNN), using features
from the deep CNNs to generate image captions accurately [4]. These
features may also be modified and combined with other observed features
from the image to retain important details that can help form a richer and
more informative description [5]. Caption generation models must be cap-
able of detecting objects present in an image and also capture and represent
relationships between them using natural language. Attention based models
work on training on local salient features and ignore redundant noise.
Localization and recognition of regions with salient features also allows us
to generate richer and diverse captions [6]. A detailed diagnostic medical
11.2 LITERATURE REVIEW
Previous related work indicates that significant research has been done in the
domain of both feature extraction and text report generation. This litera-
ture review has been divided into two parts: the first part describes previous
work related to methods for chest X-ray image analysis, and the second part
discusses image captioning and text generation studies.
Hu, Mengjie, et al. proposed an approach for quick, efficient, and auto-
matic diagnosis of chest X-rays, called the multi-kernel depth wise convo-
lution (MD-Conv). Lightweight networks can make use of MD-Conv in
place of the depth wise convolution layer. The lightweight MobileNetV2
is used instead of networks like ResNet50 or DenseNet121. The approach
aims to provide a foundation for later research and studies related to light-
weight networks and the probability of identifying diseases in embedded
126 Intelligent Systems and Applications in Computer Vision
and mobile devices [10]. Albahli, Saleh, et al. evaluated the effectiveness of
different CNN models that use Generative Adversarial Networks (GAN) to
generate synthetic data. Data synthesis is required as over-fitting is highly
possible because of the imbalanced nature of the training data labels. Out of
the four models whose performance was evaluated for the automatic detec-
tion of cardiothoracic diseases, the best performance was observed when
ResNet152 with image augmentation was used [11]. Existing methods gen-
erally use the global image (global features) as input for network learning
that introduce a limitation as thoracic diseases occur in small localized areas
and also due to the misalignment of images. This limitation is addressed
by Guan, Qingji, et al. It proposes a three-branch attention guided CNN
(AG-CNN). This approach learns on both local and global branches by first
generating an attention heat map using the global branch. This heat map is
used for generating a mask, which is later used for cropping a discrimina-
tive region to which local attention is applied. Finally, the local and global
branches are combined to form the fusion branch. Very high accuracy is
achieved using this approach with an average value of Area Under Receiver
Operating Characteristic Curve (AUC) as 0.868 using ResNet50 CNN, and
0.871 using DenseNet121 [12].
In addition to using existing data, some studies also proposed collection
and analysis of new data. Bustos, Aurelia, et al. proposed a dataset called
PadChest, which contains labels mapped onto the standard unified medical
language system. Instead of solely relying on automatic annotation tools,
trained physicians performed the task of manually labelling the ground truth
targets. The remaining reports that were not labeled manually were then
tagged using a deep learning neural network classifier and were evaluated
using various metrics [13]. Four deep learning models were developed by
Majkowska, Anna, et al. to detect four findings on frontal chest radiographs.
The study used two datasets, the first one was obtained with approval from
a hospital group in India and the second one was the publicly available
ChestX-ray14 dataset. A natural language processing system was created for
the prediction of image labels by processing original radiology reports. The
models performed on par with on-board radiologists. The study performed
analysis for population- adjusted performance on ChestX- ray14 dataset
images and released the adjudicated labels. It also aims to provide a useful
resource for further development of clinically useful approaches for chest
radiography [14].
Text generation is an important objective, and previous research indicates
that performance is dependent on the data as well as the approach. An RNN
Long Short-Term Memory (LSTM) model that takes stories as input, creates
and trains a neural network on these input stories, and then produces a new
story from the learned data is described by Pawade, D., et al. The model
understands the sequence of words and generates a new story. The network
learns and generalizes across various input sequences instead of learning
individual patterns. Finally, it has also been observed that by adjusting the
Chest X-Ray diagnosis and report generation 127
values of different parameters of the network architecture, the train loss can
be minimized [15].
Melas-Kyriazi, et al. proposed a study in which the issue of lack of diver-
sity in the generated sentences in image captioning models is addressed.
Paragraph captioning is a relatively new task compared to simple image
single-sentence captioning. Training a single-sentence model on the visual
genome dataset, which is one of the major paragraph captioning datasets,
results in generation of repetitive sentences that are unable to describe the
diverse aspects of the image. The probabilities of the words that would
result in repeated trigrams are penalized to address the problem. The results
observed indicate that self-critical sequence training methods result in lack
of diversity. Combining them with a repetition penalty greatly improves
the baseline model performance. This improvement is achieved without any
architectural changes or adversarial training [16].
Jing et al. proposed a multi-task learning framework that can simultan-
eously perform the task of tag prediction and generation of text descriptions.
This work also introduced a co-attention mechanism. This mechanism was
used for localizing regions with abnormalities and to generate descriptions
for those regions. A hierarchical LSTM was also used to generate long
paragraphs. The results observed in this work were significantly better as
compared to previous approaches with a Bilingual Evaluation Understudy
(BLEU) score of 0.517 for BLEU-1 and 0.247 for BLEU-4. The performance
observed on other metrics was also better than other approaches [17]. Xue,
Yuan, et al. tackle the problem of creating paragraphs that describe the med-
ical images in detail. This study proposed a novel generative model which
incorporates CNN and LSTM in a recurrent way. The proposed model
in this study is a multimodal recurrent model with attention and is cap-
able of generating detailed paragraphs sentence by sentence. Experiments
performed on the Indiana University chest x-ray dataset show that this
approach achieves significant improvements over other models [18].
A Co-operative Multi-Agent System (CMAS), which consists of three
agents –the planner, which is responsible for detecting an abnormality in
an examined area, the normality writer, which is used for describing the
observed normality, and the abnormality writer, which describes the abnor-
mality detected –is proposed by Jing, et al. CMAS outperforms all other
described approaches as evident from the value of many of the metrics
used for evaluation. Extensive experiments, both quantitative and quali-
tative, showed that CMAS could generate reports that were meaningful by
describing the detected abnormalities accurately [19].
11.3 PROPOSED METHODOLOGY
This section gives an outline of the deep learning algorithms, terminologies,
and information about the data used in this work. It also explains the two
proposed methods in detail.
128 Intelligent Systems and Applications in Computer Vision
CNNs are most commonly used for analyzing images as a part of various
computer vision tasks. CNN makes use of convolution kernels or filters
as a part of a shared weight architecture. The network generates a feature
map for the input image by extracting local features with fewer number of
weights as compared to traditional artificial neural networks. The feature
map can be passed through pooling or sub-sampling layers to decrease its
spatial resolution. Reducing the resolution also reduces the precision with
which the position of the feature is represented in the feature map. This is
important because similar features can be detected at various positions in
different images [20][21].
Deep CNNs have been made possible recently because of improvements
in computer hardware. As the network grows deeper, the performance
improves as more features are learned. But it also introduces the problem
of vanishing gradient where the gradient becomes smaller and smaller
with each subsequent layer, making the training process harder for the
network.
(
ht = f h t −1 , xt ) (11.1)
4. Pretrained models
residuals with linear bottleneck in addition to the existing depth wise separ-
able convolution feature [26].
Along with the two models mentioned above, the performance of
VGG16, ResNet152V2 and InceptionV3 is also compared on the same task.
The models were loaded with weights learned on the ImageNet dataset.
The ImageNet dataset contains 1.2 million images and 50,000 images for
training and validation respectively labelled with 1,000 different classes. It is
a dataset organized according to the WordNet hierarchy and aims to provide
an average of 1,000 images to represent each synset present in WordNet.
11.3.2 Data
ChestX-ray14 is a medical imaging dataset containing 112,120 frontal-view
X-ray images of 30,805 unique patients; 14 common disease labels were
text-mined from the text radiological reports via natural language pro-
cessing techniques and assigned to each of the images [27]. For training the
feature extraction models in this work, a random sample of the ChestX-
ray14 dataset (available on Kaggle) consisting of 5 percent of the full dataset
was used. The sample consists of 5,606 images and was created for use
in kernels. Additionally, the dataset also contains data for disease region
bounding boxes for some images that can be used for visualizing the disease
region.
There is also a certain amount of data bias in the ChestX-ray14 dataset.
A few disease conditions present in the dataset have less prevalence as
compared to the number of examples where no abnormality is detected.
This bias can affect the performance on unseen test examples. This problem
can be resolved by assigning class weights to each class in the dataset. These
weights are inversely proportional to the frequency of each class, so that a
minority class is assigned a higher weight, and a majority class is assigned a
lower weight. Assigning class weights lowers the training performance but
ensures that the algorithm is unbiased towards predicting the majority class.
ChestX-ray14 contains images of chest X-rays along with the detected
thoracic disease labels. However, it does not contain the original reports
from which the labels were extracted. In this work, the Indiana University
chest X-ray collection from Open-i dataset was used for the report gener-
ation task. The dataset contains 7,470 images containing frontal and lateral
chest X-rays with 3,955 reports annotated with key findings, body parts,
and diagnoses.
11.3.3 Feature extraction
Feature extraction involves generating feature maps for input images that
represent important features in the input. This task can be defined as a sub-
task of chest disease detection in the current problem since the features
Chest X-Ray diagnosis and report generation 131
extracted are finally used for predicting labels. A deep CNN can be trained
on the dataset and made to learn important features. In this work, five
different CNN architectures and pre-trained models were employed to solve
a multi-label problem defined on a subset of the ChestX-ray14 dataset.
These models were trained on the dataset by making the entire model learn
new sets of features specific to this task. In a multi-label problem, a set of
classification labels can be assigned to the sample, as opposed to a single
label out of multiple possible labels in a multi-class problem.
The default final network layer was removed and a new dense layer
with 15 activation units was added. The images in the dataset were
resized to 224 × 224 before using them as input to the initial network.
Adam optimizer was used with the value of learning rate as 0.001. The
value of the learning rate was reduced by a factor of 10 whenever the
validation loss stopped decreasing. Training was done with binary cross-
entropy loss:
Loss = −
1
Nc
∑
Nc
i =1 i ( ) (
y log yi + (1 − yi ) log 1 − yi ) (11.2)
where
yi is the i -th predicted scalar value, yi is the corresponding target
value and Nc is the number of output classes or output scalar values. Since
the problem is a multi-label problem, each class was determined with a sep-
arate binary classifier, and where each output node should be able to predict
whether or not the input features correspond to that particular class irre-
spective of other class outputs. In case of a multi-class problem, categorical
cross-entropy loss is used. The new final classification layer used sigmoid
activation instead of softmax.
1
Sigmoid ( x) = (11.3)
1 + e−x
Local features of the input image can be directly extracted from the pre-
vious layers of the CNN. These features were passed through an average
pooling layer while training the CNN to decrease the feature map dimen-
sion. The feature map was then fed to the fully connected classification
layer. Global features, on the other hand, can be extracted by reducing
the spatial resolution of the image with the help of average pooling
layers.
A CNN used for feature extraction serves as an encoder for the input
data. The data is encoded into a set of features by using various operations
like convolution (for detecting features), dimensionality reduction, and so
forth. The CNN learns to detect features in the input data automatically,
132 Intelligent Systems and Applications in Computer Vision
and manual feature definition is not required. Feature extraction can also be
used in other tasks like image segmentation and class-activation mapping.
11.3.4 Report generation
Report generation for the input image is the next important task. In this
task, the findings part of a detailed report, which lists the observations for
each examined area, was generated. The CNN encoder can be used to infer
the image and output a set of features as discussed above. The sequence-
to-sequence model accepts these features as input. This model uses the
input features to output a sequence of words that form the report and is an
example of feature extraction where the encoder is simply used to predict
the features instead of learning them for the input images.
As mentioned above, the Open-i dataset was used for this task. Instead
of using both frontal and lateral images, only frontal images with the
corresponding findings from the report were used, as these images are used
as input for the encoder, which was originally trained on only frontal images
of chest x-rays. While the encoder is still capable of detecting features in
lateral images, these features will not accurately represent the condition
observed in the chest x-ray.
The findings for each example were pre- processed and tokenized to
remove redundant characters and words. Each tokenized word was mapped
to a 300 dimensional vector defined in the pre-trained GloVe (Global Vectors
for Word Representation) model. These vectors were then made part of an
embedding matrix that was later used to represent each word, using a 300
dimensional vector.
Two existing methods were adopted and implemented for this task:
exp ( xi )
softmax ( xi ) = (11.4)
∑ j
( )
exp xj
The categorical cross-entropy loss function was used to train the model:
where
yi is the i -th predicted scalar value, yi is the corresponding target
value and N is the output size.
The final report generation model can be evaluated by using methods
like greedy search and beam search. Greedy search chooses the word which
maximizes the conditional probability for the current generated sequence of
words. On the other hand, beam search chooses the N most likely words
for the current sequence, where N is the beam width.
2. Attention-based architecture
An increase in the length of the sequence negatively affects the model per-
formance of the first method. The Attention- based method focuses on
important parts of the sequence. Xu, Kelvin, et al. describe the use of CNN
as an encoder in an attention- based encoder decoder architecture. The
features extracted from the image by the CNN are also known as annota-
tion vectors. In order to obtain a correspondence between portions of the
input image and the encoded features, the local features were extracted from
a lower convolutional layer without pooling the outputs to get a higher
dimensional output [29].
The decoder, as introduced by Bahdanau, et al. conditions the probability
on a context vector ci for each target word yi , where ci is dependent on
the sequence of annotation vectors. These annotation vectors are generated
by the encoder as stated above. Annotation vector ai contains information
about the input, with strong focus on the parts surrounding the i -th fea-
ture extracted at a certain image location. A weighted sum of annotations
{a1 ,…, aL } is used for computing the context vector ci .
∑
L
ci = α aj
j =1 ij
(11.6)
134 Intelligent Systems and Applications in Computer Vision
α ij =
exp eij ( ) (11.7)
∑ exp ( eik )
L
k =1
where
(
eij = fatt hi −1 , a j ) (11.8)
11.3.5 Evaluation metrics
This work reports the AUC for the feature extraction task. The AUC is
computed for each of the labels individually and then averaged across those
labels. A Receiver Operating Characteristic curve is a graph that plots the
classification performance for a model for all classification thresholds. Two
parameters are plotted in this curve:
11.4.1 Feature extraction
Section 3.2 states that a smaller (5%) sample of the dataset was used for
training the models for feature extraction. The results indicate that des-
pite the small size of training data, the models generalize moderately well
on unseen data. The AUC values for the five models that were trained
have been compared in Figure 11.1. As mentioned in section 3.5 AUC is a
superior metric for multi-label problems compared to accuracy. The highest
AUC with a value of 0.7165 is achieved by the pre-trained DenseNet-121
model. The performance of the model in the case of MobileNetV2 indicates
overfitting according to the training and validation AUC values observed as
the performance on the training set was significantly better as compared to
the test/validation set.
Section 3.2 also discusses the data imbalance problem faced in ChestX-
ray14 dataset. The best performing model, DenseNet-121 was also trained
on the dataset after assigning class weights. However, the performance
declines, which is justified by the actual frequency of classes. VGG-16 was
the worst performing model of the five with the lowest training AUC of
136 Intelligent Systems and Applications in Computer Vision
11.4.2 Report generation
Table 11.1 contains the various BLEU scores and METEOR score calculated
for various models that were trained. These scores were evaluated for indi-
vidual reports and represent how accurately the model predicts a report for
an input chest x-ray image. The two approaches described in section 3.4
are compared in Table 11.1. Observations indicate that the attention-based
model performed better than the CNN-RNN model when greedy search
was used. Beam search with beam width of 5 was used for generating the
inference for the encoder-decoder model and performed significantly better
than greedy search. However, beam search used for the attention-based
model yielded poor results as compared to greedy search. As mentioned in
section 3.4 only frontal chest x-ray images from the Open-i dataset were
Chest X-Ray diagnosis and report generation 137
used for training. This reduces the number of features extracted by half, as
lateral images are not considered. The original and predicted report for a
random image is shown in Table 11.2.
11.5 CONCLUSION
In the medical field, the amount of openly available task-specific data is
limited. Data collection is difficult as precision and expertise are required
when performing tasks such as data labelling. This work touches upon the
use of deep learning techniques in solving medical problems, specifically
the problem of automated chest x-ray classification and report generation.
In this work, two important objectives were discussed along with methods
used for achieving them. The methods include comparison of various
pretrained CNN models for feature extraction and implementation of two
CNN-RNN encoder-decoder models, one with attention and the other
138 Intelligent Systems and Applications in Computer Vision
REFERENCES
1. Dong Y, Pan Y, Zhang J, Xu W (2017) Learning to read chest X-ray images
from 16000+examples using CNN. In: 2017 IEEE/ ACM International
Conference on Connected Health: Applications, Systems and Engineering
Technologies (CHASE), pp. 51–57.
2. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van
Der Laak JA, Van Ginneken B, Sánchez CI (2017) A survey on deep learning
in medical image analysis. Medical image analysis, 42, pp. 60–88.
3. Wang H, Xia Y: Chestnet (2018) A deep neural network for classification of
thoracic diseases on chest radiography. arXiv preprint arXiv:1807.03058.
4. Shin HC, Roberts K, Lu L, Demner- Fushman D, Yao J, Summers RM
(2016) Learning to read chest x-rays: Recurrent neural cascade model for
automated image annotation. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 2497–2506.
5. Ma S, Han Y. (2016) Describing images by feeding LSTM with structural
words. In: 2016 IEEE International Conference on Multimedia and Expo
(ICME), pp. 1–6,
6. Johnson J, Karpathy A, Fei-Fei L. (2016) Densecap: Fully convolutional
localization networks for dense captioning. In: Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pp. 4565–4574,
7. Boag W, Hsu TMH, McDermott M, Berner G, Alesentzer E, Szolovits P.
(2020) Baselines for chest x-ray report generation. In: Machine Learning for
Health Workshop, pp. 126–140.
8. Liu G, Hsu TMH, McDermott M, Boag W, Weng WH, Szolovits P, Ghassemi
M. (2019) Clinically accurate chest x-ray report generation. In: Machine
Learning for Healthcare Conference, pp. 249–269.
9. Rahman T, Chowdhury ME, Khandakar A, Islam KR, Islam KF, Mahbub
ZB, Kadir MA, Kashem S (2020) Transfer learning with deep convolutional
neural network (CNN) for pneumonia detection using chest X-ray. Applied
Sciences, 10(9): p. 3233.
10. Hu M, Lin H, Fan Z, Gao W, Yang L, Liu C, Song Q (2020) Learning to rec-
ognize chest-Xray images faster and more efficiently based on multi-kernel
depthwise convolution. IEEE Access, 8, pp. 37265–37274.
11. Albahli S, Rauf HT, Arif M, Nafis MT, Algosaibi A (2021) Identification of
thoracic diseases by exploiting deep neural networks. Neural Networks, 5:
p. 6.
12. Guan Q, Huang Y, Zhong Z, Zheng Z, Zheng L, Yang Y (2018) Diagnose
like a radiologist: Attention guided convolutional neural network for thorax
disease classification. arXiv preprint arXiv:1801.09927.
Chest X-Ray diagnosis and report generation 139
12.1 INTRODUCTION
Image captioning is a term to describe a given task by generating text, also
known as caption for the image. It is easy for humans to write short, mean-
ingful sentences for an image by understanding the constituents and activities
in the given image. When the same work has to be performed automatically
by a machine, it is termed as image captioning. It is a worthy challenge for
researchers to solve with the potential applications in real life. In recent
years a lot of research has been focused on object detection mechanisms.
But the automated image captioning generation is a far more challenging
task than object recognition, because of the additional task of detecting the
actions in the image and then converting them into a meaningful sentence
based on the extracted features of the image. As long as machines do not
talk, behave like humans, natural image caption generation will remain a
challenge to be solved.
Andrej Karapathy, the director of AI at Tesla, worked on the image
captioning problem as part of his PhD thesis at Stanford. The problem
involves the use of computer vision (CV) and natural language processing
(NLP) to extract features from images by understanding the environment
and then generate the captions by sequence learning. Figure 12.1, shows
an example to understand the problem. A human can easily visualize an
image while machines cannot easily do it. Different people can give different
captions for an image. In Figure 12.1 following are the various sentences
that can be used to understand the image, that is:
Certainly, both of the above descriptions are applicable for Figure 12.1.
But the argument is this: it is easy for humans to engender various descriptions
The image captioning problem deals with both deep learning techniques
and sequence learning, but this problem suffers from the high variance
problem. So, ensemble learning has been used in image captioning with mul-
tiple models and at last these models are combined to offer better results.
But there is a drawback also: if the number of models increase in ensemble
learning then its computational cost also increases. So, it is good to use with
a limited number of models in ensemble learning.
In the proposed work, deep learning is used for the generation of
descriptions for images. The image captioning problem has been solved
in three phases: (a) image feature extractor; (b) sequence processing; and
(c) decoding. For the task of image feature extraction, a 16-layer pre-trained
VGG model and Xception model are used on the ImageNet dataset. The
model is based on an CNN architecture. The extracted feature vector, after
reducing the dimension, is fed into a LSTM network. The decoder phase
combines the output of other two.
This chapter has been divided into following sections. Section 2 presents
the literature review. Section 3 contains datasets, different architectures, and
proposed work. Section 4 contains some of the results. Section 5 gives a
brief dialogue and future prospects. And at last Section 6 completed it with
the conclusion of the proposed work.
12.2 RELATED WORK
The previous approaches in visual recognition have been focused on
image classification problems. It is now easy to provide labels for a certain
number of categories of objects in the image. During the last few years,
researchers have had good success in image classification tasks with the use
of deep learning techniques. But the image classification using deep learning
provides us only with a limited statistics regarding the items that are there
in the scene. Image captioning is a much more multifaceted undertaking and
requires a lot more knowledge to find the relations amongst various objects
by combining the information with their characteristics and happenings.
After processing is complete, the objects have been identified and their
various features have been combined, the last task being to express this
gathered information as a conversational message in the form of a caption.
Earlier, Caltech 101 was one of the first datasets for multiclass classifi-
cation that comprehends 9,146 images and 101 classes. Later Caltech 256
increased the number of different object classes to 257 (256 object classes,
1 clutter class). It consists of 30,607 real world images. For classification,
ImageNet dataset is given and the dataset is larger both in terms of scale and
diversity when compared to Caltech101 and Caltech 256.
Several algorithms and techniques were anticipated by the researchers to
solve the image captioning problem. The latest approaches to this project
follow deep learning-based architectures. Visual features can be extracted
144 Intelligent Systems and Applications in Computer Vision
using convolution neural networks and, for describing the features, LSTM
can be used to generate some meaningful and conversational sentences.
Krizhevsky et al. [4] has implemented the task with non-saturated neurons
for the neural networks and proposed the use of dropout for the regu-
larization to decrease overfitting. The proposed network contains convo-
lution layers followed by Maxpooling layers and used softmax function
for the next predicted word. Mao at el. proposed combining both the
architectures of CNN and RNN. This model extracts the features by using
CNN and then generates words sequentially by predicting prospects for
every next word and generates a sentence by combining the predicted
words. Xu et al. [5] proposed a model by using attention mechanism with
the LSTM network. This focuses on salient features on the image and
increases the probability of the words based on these salient features by
neglecting other less important features. The model was learned by opti-
mizing a vector lower bound using normal backpropagation techniques.
The model learnt to identify the entity’s border although still generating
an effective descriptive statement. Yang et al. [6] suggested a method for
automatically generating a natural language explanation of a picture that
will help to understand the image. Pan et. al [7] investigated broadly with
various network designs on massive datasets containing a variety of sub-
ject formats and introduced an inimitable model that outperformed pre-
viously proposed models in terms of captioning accuracy. X. Zeng et al.
[11] has worked on region detection of ultrasound images for medical
diagnoses and worked on gray scale noisy medical images with low reso-
lution. The authors used a pretrained model VGG16 on ImageNet dataset
to isolate the attributes from the imageries and for sequence generation.
LSTM network has been used and an alternate training method is used for
the detection model, evaluating the performance using BLEU, METEOR,
ROUGE-L and CIDEr. The image captioning problem was well researched
by Karpathy & Feifei [8] in their PhD work and proposed a multimodal
RNN that fixes the co-linear arrangements of features in their model. They
used the dataset of images to learn the interaction between objects in an
input image.
Another different approach model, used by Harshitha Katpally [9], was
based on ensemble learning methods to resolve the unruly low variance during
training of deep networks. The proposed ensemble learning approaches was
different from conventional approaches and analyze the previous ones to
find the best ensemble learning approach and has compared all the results
with different approaches. Deng et al. [10] launched ImageNet database,
a wide-ranging array of pictures based on the structure of WordNet. The
various classes are organized by the ImageNet database in a semantic hier-
archy that is heavily inhabited. Vinyals et al. [12] proposed deep recurrent
neural network architecture to generate descriptions for the pictures, while
ensuring that the produced sentence accurately describes the target image
Automatic image caption generation for the visually impaired 145
12.3.1 Data set
In the present study, for the image caption generation, the Flickr_8k dataset
has been used. It is easy to work on this dataset as it is relatively small
and realistic. In this dataset there are a total of 8,000 images, which have
been predefined as 6,000 images for training, 1,000 images for development
and another 1,000 images are for testing. In the dataset each image has 5
diverse descriptions that provide a strong understanding of the prominent
objects and actions in the image. Each image has a predefined different label
name /identity so it is easy to find the related captions from the flick8k_text
dataset. Other larger datasets that can also be used are Flick_30k dataset,
MS COCO dataset. But these datasets can take days just to train the net-
work and have only a marginal accuracy than has been reported in the
literature.
Source: [1].
take an input picture, process it, allocate rank to things within the image,
and then identify and classify the different objects in the images. Figure 12.2
shows the convolution operation in the maxpooling layer in CNNs [22, 24].
The CNN architecture is composed of following layers (a) a convolution
layer (CL); (b) a pooling layer (PL); and (c) a fully connected (FC) layer. Figure
12.3 shows the CNN architecture used for the classification. The CL is used
to generate unruffled depictions of the input picture. Matrix multiplication
is executed amid the filter with every single slice of the picture matrix with
equal dimensions as that of the kernel. The PL also have a similar purpose as
that of the CL, but the PL has extra ability to find more dominant features
in picture. PL has two categories, Average pooling and Max pooling, these
layers reduce the dimension of the input matrix. Max pooling has been well
thought-out to be the better pooling method as it contributes a de-noising
component and removes the noisy features from the image. The last step is
to compress all the characteristics we have obtained from the last layers and
feed them into an Neural Network (NN) for classification. In the FC layer,
all the non-linear relationships amongst the significant characteristics are
learned. This compressed vector is fed into the NN using backpropagation
and categorizes all the objects via softmax function.
Automatic image caption generation for the visually impaired 147
CNNs extract features from the images by glancing over the picture from
left to right and top to bottom, followed by combination of all the features
that are detected. In the preset work, the feature vector having a dimension
of 2048 has been extracted from the last third layer of the pretrained VGG-
16 network.
12.3.3 Proposed model
The proposed model has been divided into two parts. The first part uses a
CNN for the attribute abstraction from images and followed by LSTM. The
LSTM network uses the data from CNN to assist generation of sentences
of the image. So, in the proposed work, both these networks have been
148 Intelligent Systems and Applications in Computer Vision
Source: [1].
last layer has a size equivalent to that of the vocabulary size. This is
done to get the probability for each word in the sequence learning
process.
and the generated sentence explains about the picture that was used
as input. An RNN network performs a sequence learning process to
forecast the subsequent term grounded on the features and also using
the information from the previous output. Then the output is provided
to the next cell as input to forecast the subsequent word. The input of
the RNN network is word embedding layer which was generated for
the words of the vocabulary.
iv. Evaluation of model: Testing is done on 1000 images of the Flicker8k
dataset. For evaluation, BLEU metric is used. BLEU metric matches
the similarity between the generated caption and reference sentence.
So, we get different BLEU scores every time. Figure 12.8 shows the
architecture of the proposed model.
12.4.1 Evaluation metrics
a. Accuracy metrices
The precision of model is tested proceeding the test data. Each image in
the test dataset is fed as a record and an output caption is engendered. The
closeness of this caption with the captions of the dataset for the same image
will give the accuracy of the model. The evaluation of generated captions
can be done using the metrics: BLEU (Bilingual Evaluation Understudy),
CIDEr (Consensus-based Image description evaluation), METEOR (Metric
for Evaluation of Translation with Explicit ordering). Each metric gives
the grade that established how adjacent evaluated text is to references text.
In present work, BLEU metrics has been used for the performance evalu-
ation. BLEU scores are used to evaluate the translated text against the one
or more reference translation sentences. For the flicker8k dataset there are
5 references given for each input image. So, the BLEU score is calculated
against all of the reference sentences for the input image and the BLEU
scores are calculated for 1 gram, 2 gram, 3 gram and 4 cumulative n-grams.
The BLEU score will be in the range from 0 to 1 and a score near to 1.0 is
considered as improved results.
12.4.3 Examples
Table 12.2 Examples
12.6 CONCLUSIONS
In the present work, deep learning techniques have been used to generate
captions automatically from images. This will aid visually compromised
individuals for improved perception of their environments. The proposed
model draws it inspiration from a hybrid CNN-RNN model. The CNN
has been employed to excerpt the characteristics from pictures. The
obtained characteristics are fed as input to an LSTM network through a
dense layer. Based on the existing research, the proposed model achieves
156 Intelligent Systems and Applications in Computer Vision
REFERENCES
1. Theckedath, Dhananjay and R. R. Sedamkar. “Detecting affect states using
VGG16, ResNet50 and SE-ResNet50 networks.” SN Computer Science 1
(2020): 1–7.
2. Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. “Show
and tell: A neural image caption generator.” In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pp. 3156–3164. 2015.
3. Tan, Ying Hua and Chee Seng Chan. “Phrase-based image caption generator
with hierarchical LSTM network.” Neurocomputing 333 (2019): 86–100.
4. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classifi-
cation with deep convolutional neural networks.” Advances in neural infor-
mation processing systems 25 (2012): 1097–1105.
5. Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville,
Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. “Show, attend and
tell: Neural image caption generation with visual attention.” In International
conference on machine learning, pp. 2048–2057. PMLR, 2015.
6. Yang, Zhongliang, Yu- Jin Zhang, Sadaqat ur Rehman, and Yongfeng
Huang. “Image captioning with object detection and localization.” In
International Conference on Image and Graphics, pp. 109–118. Springer,
Cham, 2017.
7. Pan, Jia-Yu, Hyung-Jeong Yang, Pinar Duygulu, and Christos Faloutsos.
“Automatic image captioning.” In 2004 IEEE International Conference on
Multimedia and Expo (ICME) vol. 3, pp. 1987–1990. IEEE, 2004.
8. Karpathy, Andrej, and Li Fei-Fei. “Deep visual-semantic alignments for gen-
erating image descriptions.” In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pp. 3128–3137. 2015.
9. Katpally, Harshitha. “Ensemble Learning on Deep Neural Networks for
Image Caption Generation.” PhD diss., Arizona State University, 2019.
10. Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
“Imagenet: A large-scale hierarchical image database.” In 2009 IEEE con-
ference on computer vision and pattern recognition, pp. 248– 255. Ieee,
2009.
11. Zeng, Xianhua, Li Wen, Banggui Liu, and Xiaojun Qi. “Deep learning
for ultrasound image caption generation based on object detection.”
Neurocomputing 392 (2020): 132–141.
12. Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. “Show
and tell: A neural image caption generator.” In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pp. 3156–3164. 2015.
13. Christopher Elamri, Teun de Planque, “Automated Neural Image caption
generator for visually impaired people” Department of Computer science,
Stanford University.
Automatic image caption generation for the visually impaired 157
14. Tanti, Marc, Albert Gatt, and Kenneth P. Camilleri. “What is the role of
recurrent neural networks (rnns) in an image caption generator?.” arXiv
preprint arXiv:1708.02043 (2017).
15. Maru, Harsh, Tss Chandana, and Dinesh Naik. “Comparitive study of
GRU and LSTM cells based Video Captioning Models.” In 2021 12th
International Conference on Computing Communication and Networking
Technologies (ICCCNT), pp. 1–5. IEEE, 2021.
16. Aneja, Jyoti, Aditya Deshpande, and Alexander G. Schwing. “Convolutional
image captioning.” In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 5561–5570. 2018.
17. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine
translation by jointly learning to align and translate.” arXiv preprint
arXiv:1409.0473 (2014).
18. Karpathy, Andrej and Li Fei-Fei. “Deep visual-semantic alignments for gen-
erating image descriptions.” In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pp. 3128–3137. 2015.
19. Jonas, Jost B., Rupert RA Bourne, Richard A. White, Seth R. Flaxman, Jill
Keeffe, Janet Leasher, Kovin Naidoo et al. “Visual impairment and blindness
due to macular diseases globally: a systematic review and meta-analysis.”
American Journal of Ophthalmology 158, no. 4 (2014): 808–815.
20. Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang et al. “Imagenet large scale visual recognition
challenge.” International Journal of Computer Vision 115, no. 3 (2015):
211–252.
21. Abouhawwash, M., & Alessio, A. M. (2021). Multi-objective evolutionary
algorithm for pet image reconstruction: Concept. IEEE Transactions on
Medical Imaging, 40(8), 2142–2151.
22. Balan, H., Alrasheedi, A. F., Askar, S. S., & Abouhawwash, M. (2022). An
Intelligent Human Age and Gender Forecasting Framework Using Deep
Learning Algorithms. Applied Artificial Intelligence, 36(1), 2073724.
23. Abdel-Basset, M., Mohamed, R., Elkomy, O. M., & Abouhawwash, M.
(2022). Recent metaheuristic algorithms with genetic operators for high-
dimensional knapsack instances: A comparative study. Computers &
Industrial Engineering, 166, 107974.
24. Abdel- Basset, M., Mohamed, R., & Abouhawwash, M. (2022). Hybrid
marine predators algorithm for image segmentation: analysis and validations.
Artificial Intelligence Review, 55(4), 3315–3367.
Chapter 13
13.1 INTRODUCTION
Problems in datasets may lead to loss or to misinterpreting the data
instances. Nowadays, machine learning methods are becoming so efficient
that they can handle these problems in datasets. The problem in real-world
datasets may be due to distribution disparity among the class or may be due
to missing values. The distribution disparity is also called class imbalance,
which may arise due to an imbalance in the distribution of data instances
among classes. A class imbalance problem may create an impact only when
there is a need to study the behavior of minority class instances. Class
imbalance problems are seen in many applications, such as fault detection
[1], bankrupt prediction [2], natural language processing[3], credit scores
[4], twitter spam detection [5] and so forth. A class imbalance problem
with missing value problems becomes more complicated. The missing value
itself is a huge problem that creates problems in classification. There are
various types of missing values, such as MCAR (missing completely at
random), MAR (missing at random), NMAR (not missing at random) [6],
and so forth.
Real-time datasets contain missing values as well as class imbalance
problems. So far, techniques developed for these problems are operated
in different phases, which may lead to a problem in classification. Class
imbalanced datasets are usually not so complicated, but with other intrinsic
difficulties, they become more complex like small disjunct, class overlap-
ping, missing values, and so forth. Many types of techniques are developed
to deal with class imbalance problems such as data level class imbal-
ance handling techniques, algorithmic level, and hybrid level. Data level
imbalanced handling techniques use two different techniques, like oversam-
pling and undersampling. Oversampling will generate synthetic instances,
either by repeating examples (ROS) [7]or by using some random value
methods like linear interpolation methods (SMOTE) [8]. Undersampling is
used to remove the majority of class instances and create balance in the
13.2 RELATED WORK
This section, we present a review on class imbalance, missing value in detail,
and also deal with missing values within class imbalance.
160 Intelligent Systems and Applications in Computer Vision
minority class are calculated by using k nearest neighbor and induce those
instances which are present in the danger zone. Another researcher is trying
to create instances present in the safe zone by using SMOTE, which may
lead to overfitting and the process called Safe-SMOTE [23]. To remove the
problem raised by the SMOTE –overgeneralization of various authors who
are trying and develop multiple techniques like in [24] authors are working
to provide oversampling smote with the direction of generating instances
and not consider outliers present in the datasets with the minority class.
In [25], authors developed radial based oversampling technique to hand-
ling class imbalance and tried to overcome the problem of overfitting, over-
generalization. Also, to perform oversampling in noisy imbalanced data
classification.
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
∑𝑗 𝑚𝑖𝑠
, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
∑𝑖 𝜇(𝑥𝑖, 𝑢𝑠)
Empirical analysis of machine learning techniques 163
In this way, this method provides the required missing value as well as
incomplete datasets.
13.3 METHODOLOGY
In this section, we discuss the effect of class imbalance and missing value
imputation techniques on datasets having class imbalance as well as missing
values and discuss techniques developed for tackling class imbalance and
missing value together like fuzzy information decomposition technique
(FID). For this we experiment with different algorithms and experiment
section and results. In the experiment section, we discuss datasets descrip-
tion and experiments.
Datasets
In order to perform the analysis, we collect 18 class imbalance datasets
from different repositories like MC2, kc2, pc1, pc3, pc4 taken from promise
repository; datasets belong to software defaults whose instances range from
194 to 1077 and a number of attributes are ranges from 21 to 39 imbalance
ratio (IR) ranges from 1.88 to 10.56. Whereas, heart, Bupa, Mammographic,
ILPD datasets are taken from the UCI repository and the rest of the datasets
are taken from KEEL- dataset repository [13, 23]. Table 13.1 lists the
datasets used in this chapter, where the first column describes the name
of the dataset, the second tells us the size (total number of instances), the
third tells the number of attributes (attributes), and the last tells us imbal-
ance ratio (IR). The IR (majority to minority ratio) range of datasets varies
from 1.05 to 22.75; size varies from 125 to 1484 and attributes fields from
5 to 39. We group the datasets according to class imbalance ratio (i.e., 1–5,
5–10, 10–22) for performance evaluation.
To perform analysis, we need to create missing values randomly with
different percentages, that is, 5 percent, 10 percent, 15 percent, 20 percent
of each attribute excluding target attribute.
Experimental setup
In this study, we need to analyze the effects of missing values under class
imbalance environment. To perform this, we use SMOTE [8] , ROS [8]
as oversampling technique, CNN [24] as undersampling technique and
SMOTE-Tomek-link [25] as combined technique (both oversampling and
undersampling) for resampling under missing value and class imbalance
environment. For missing value imputation, we use Expectation Maximization
(EM), MICE, K-nearest-neighbor (KNN), Mean, and Median. Original
(ORI) datasets results are also calculated along with these techniques
without using missing value imputation technique and resampling technique.
Finally, perform a comparison with Fuzzy Information Decomposition tech-
nique (FID) used for missing value imputation and oversampling in case of
164 Intelligent Systems and Applications in Computer Vision
Figure 13.1 General model for missing value with class imbalance handling techniques.
considered as the training part and performs resampling over the training
part and train classifier and test on the classifier. The procedure is repeated
n number of times, and we take an average of all performance.s
13.4 RESULTS
Our result section is divided into two main sections: Overall performance
section and the relationship between imbalance class and missing values.
Figure 13.4 Comparison between FID and ORI and missing value imputation techniques
using smote technique.
Figure 13.5 Comparison between FID and ORI and missing value imputation techniques
using CNN technique.
Figure 13.6(a & b) Comparison of FID and ORI with missing value imputation techniques
using ROS and SMT.
shows average missing value imputation techniques using CNN and FID
and ORI techniques and Y-axis shows G-Mean.
At 0 percent missing value, only two techniques, ORI and FID, perform
well. However, with an increase in missing value ORI technique shows
poor result as compared to other missing value imputation techniques
using CNN. The missing value ranges from 5–20 percent, KNN technique
performs well in comparison with other missing value and class imbalance
handling techniques.
Likely wise in Figure 13.6 (a & b) also show the comparison between
FID technique and different missing value and class imbalance handling
techniques over 18 different datasets with 0 percent, 5 percent, 10 percent,
15 percent, 20 percent missing value. The comparison is made using G-mean
and, finally, we observe that FID performs well in case of missing value
0 percent, 5 percent, and 10 percent but with an increase in missing value
from 15–20 percent FID and ORI shows poor performance in comparison
170 Intelligent Systems and Applications in Computer Vision
Techniques
Missing
Met rics values ORI FID SM ROS CNN SMT KNN MICE EM Mean Media n Knn_ SM
F1 0% 0.8624 0.70603 0.853 0.860 0.630 0.851 ……… ……. …….. …….. …….. …….
Score 5% 0.8212 0.66951 0.702 0.8129 0.592 0.7035 0.80496 0.8373 0.83395 0.8368 0.8324 0.826
10% 0.6827 0.65720 0.356 0.6869 0.483 0.3583 0.80707 0.8388 0.83915 0.8419 0.8360 0.835
15% 0.4080 0.38121 0.167 0.3623 0.314 0.1605 0.80495 0.8324 0.83373 0.8385 0.8366 0.822
Techniques
Missing MICE EM_ Mean Media Knn_ MICE EM_R Mean_ Median Knn_ Mice_ EM_
Metrics values _SM SM _SM n_SM ROS _ROS OS ROS _ROS CNN CNN CNN
F1 0% …….. ……… ……. ……. ……. …….. …….. ……… …….. …… …….. ……
Score 5% 0.8271 0.81947 0.832 0.823 0.834 0.8387 0.82288 0.8386 0.82837 0.7970 0.7740 0.779
10% 0.8368 0.83000 0.827 0.835 0.857 0.8433 0.82606 0.8382 0.84604 0.8018 0.7949 0.792
15% 0.8311 0.81861 0.828 0.812 0.842 0.8355 0.82109 0.8349 0.83005 0.7679 0.7303 0.770
20% 0.8351 0.82211 0.828 0.828 0.844 0.8394 0.83273 0.8420 0.83603 0.7914 0.7648 0.766
Techniques
171
20% 0.7676 0.76081 0.830 0.8265 0.819 0.8276 0.83330
172 Intelligent Systems and Applications in Computer Vision
observe that FID shows poor performance as compared with the rest of
the missing value imputation technique using SMOTE. Although, in class
imbalance 10–22 and missing value 20 percent, we observe that FID is not
performed well as compared with the rest of the missing value imputation
technique using SMOTE as class imbalance handling technique.
Moreover, Figure 13.9 shows that comparison between FID, ORI, and
missing value imputation technique using ROS as class imbalance hand-
ling technique. Using ROS as class imbalance handling technique use to
duplicate the instances of minority instances for handling imbalancing in
datasets.
In Figure 13.9, we depict three graphs having techniques for handling
missing value using class imbalance with AUC as metrics. From these
graphs, we concluded that FID performs better in comparison with ORI,
but FID performance degrade with an increase in class imbalance ratio and
missing value percentage. At 20 percent missing value with different class
imbalance ratio, we observe that FID shows poor performance as compared
with other missing value imputation technique using ROS as class imbal-
ance handling technique.
On comparing FID and Missing value imputation techniques using CNN
as undersampling technique for class imbalance handling technique as
shown in Figure 13.10. We observe that in all class imbalance ratios with a
missing value up to 5 percent FID perform better and when missing value
ratio increasing from 10–20 percent FID technique shows unsatisfactory
results in comparison with missing value imputation techniques using CNN
174 Intelligent Systems and Applications in Computer Vision
Figure 13.10 Comparison of FID, ORI and missing value imputation techniques using CNN
with different class imbalance ratio.
Figure 13.11 Comparison of FID, ORI, and missing values imputation techniques using SMT
as class imbalance handling technique.
13.5 CONCLUSION
In this study, analysis of 18 publicly available datasets with a class imbal-
ance problem and randomly generated incomplete values with varying
percentages (such as 0 percent, 5 percent, 10 percent, 15 percent, and
176 Intelligent Systems and Applications in Computer Vision
20 percent) was performed using FID. The remaining techniques for hand-
ling class imbalance (SMOTE, ROS, CNN, SMT), missing value imputation
(KNN, EM, MICE, Mean, Median), and combined techniques (used for
class imbalance and missing value imputation techniques) were also used.
We conclude that the FID approach performs effectively when the fraction of
missing values is lower. However, we see that the combined techniques work
well overall as the percentage of incomplete values rises (15–20 percent).
REFERENCES
[1] L. Chen, B. Fang, Z. Shang, and Y. Tang, “Tackling class overlap and
imbalance problems in software defect prediction,” Software Quality
Journal, vol. 26, pp. 97–125, 2018.
[2] V. García, A. I. Marqués, and J. S. Sánchez, “Exploring the synergetic
effects of sample types on the performance of ensembles for credit risk
and corporate bankruptcy prediction,” Information Fusion, vol. 47,
pp. 88–101, 2019.
[3] S. Maldonado, J. López, and C. Vairetti, “An alternative SMOTE oversam-
pling strategy for high-dimensional datasets,” Applied Soft Computing,
vol. 76, pp. 380–389, 2019.
[4] C. Luo, “A comparison analysis for credit scoring using bagging ensembles,”
Expert Systems, p. e12297, 2018.
[5] C. Li and S. Liu, “A comparative study of the class imbalance problem
in Twitter spam detection,” Concurrency and Computation: Practice and
Experience, vol. 30, p. e4281, 2018.
[6] A. Puri and M. Gupta, “Review on Missing Value Imputation Techniques
in Data Mining,” 2017.
[7] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano,
“RUSBoost: A hybrid approach to alleviating class imbalance,” IEEE
Transactions on Systems, Man, and Cybernetics- Part A: Systems and
Humans, vol. 40, pp. 185–197, 2010.
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE:
synthetic minority over-sampling technique,” Journal of Artificial
Intelligence Research, vol. 16, pp. 321–357, 2002.
[9] W.-C. Lin, C.-F. Tsai, Y.-H. Hu, and J.- S. Jhang, “Clustering-based
undersampling in class-imbalanced data,” Information Sciences, vol. 409,
pp. 17–26, 2017.
[10] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera,
“An overview of ensemble methods for binary classifiers in multi-class
problems: Experimental study on one- vs-
one and one- vs-
all schemes,”
Pattern Recognition, vol. 44, pp. 1761–1776, 2011.
[11] L. E. Zarate, B. M. Nogueira, T. R. Santos, and M. A. Song, “Techniques
for missing value recovering in imbalanced databases: Application in a
marketing database with massive missing data,” in 2006 IEEE International
Conference on Systems, Man and Cybernetics, 2006, pp. 2658–2664.
Empirical analysis of machine learning techniques 177
14.1 INTRODUCTION
The term medical imaging comprises a vast area that includes x- rays,
computed tomography (CT), ultrasound scan, nuclear medicines (PET
scan), and MRI. They are the key sources used as the input for the diag-
nosis and detection of any kind of image, especially in the preliminary stage.
But image analysis is quite a time-consuming and labor-intensive process.
It also suffers from inter-observer variability. So, for the last few decades,
the whole process is becoming automated using Artificial Intelligence (AI)
techniques, which have touched almost all of these fields as part of digital-
ization and automation processes. AI has a good source of algorithms that
can quickly and accurately identify the abnormalities in medical images. As
the number of radiologists is small compared to the number of patients, the
introduction of AI techniques in medical imaging diagnosis helps the entire
healthcare system in terms of time and money.
Colangelo et al. (2019) have made a study on the impact of using AI in
healthcare, and some of the analytics they made are as follows. In Nigeria,
the ratio of radiologists to the number of patients is 60:190 million people;
in Japan it is 36:1 million; and some African countries do not have any
radiologists at all. The researchers have concluded that most often the AI
algorithms diagnose and predict the disease more efficiently and effectively
than do human experts. The main hurdle in machine-learning techniques
is the feature extraction process from the input images. Several filtering
methods are available for doing this task, includinhg Gaussian filter, mean
filter, histogram equalization, median filter, Laplacian filter (Papageorgiou
et al. 2000), wavelet transform, and so forth.
Gabor filter (GF) is a linear filter for the intent of texture analysis, edge
detection, and feature extraction. It was named after Dennis Gabor, a
Hungarian-British electrical engineer and physicist. It has proved its excel-
lence in various real-life applications like facial recognition, lung cancer
detection, vehicle detection, Iris recognition (Zhang et al. 2019), finger-vein
Figure 14.1 A sinusoidal wave (a) has been combined with a 2D Gaussian filter (b) that
results in a Gabor filter(c).
Gabor filter as feature extractor in anomaly detection 181
In this chapter, we have used a dataset of MRI brain images that can be
used to train a machine-learning model that predicts whether tumor is pre-
sent in an image or not. In this work, we are only extracting the features of
images using the Gabor filter –features that can be later used in classifica-
tion or segmentation models. This study also depicts the supremacy of such
a filter on other filtering and feature extraction methods.
14.2 LITERATURE REVIEW
Gabor filter is one of the finest filters for feature extraction that can be used
in both machine learning and deep learning architectures. Gabor filter helps
us to extract the features from the image in different directions and angles
based on the content of the image and user requirement.
The history of AI-based automatic analysis of medical images for the
purpose of assessing health and predicting risk begins at the University
of Hawai‘i Cancer Center, the world’s first AI Precision Health Institute
(Colangelo et al. 2019). An optimized moving-vehicle detection model has
been proposed by Sun et al. (2005), where an Evolutionary Gabor Filter
Optimization (EGFO) approach has been used to do the feature extraction
task. Genetic Algorithm (GA) has integrated with the above approach along
with incremental clustering methods so that the system can filter the features
that are specific for vehicle detection. The features obtained through this
method have been used as input for SVM for further processing. A sto-
chastic computation-based Gabor filter has been used in the digital circuit
implementation, achieving 78 percent area reduction as compared to the
conventional Gabor filter (Onizawa et al. 2015). Due to this stochastic
computation, data can be considered as streams of random bits. In order
to achieve high speed, the area required for the hardware implantation of
digital circuits is usually very large. But this proposed method has area-
efficient implementation as this is a combination of Gaussian and sine
functions.
The Gabor filter has been used in biometric devices for human identifi-
cation using Iris recognition (Minhas et al. 2009). As the iris contains sev-
eral discriminating features to uniquely identify the human being, it is a
good choice for authenticating a person. Two methods have been used here,
one for collecting global features from the entire image using a 2D Gabor
filter and a multi-channel Gabor filter applied locally on different patches of
the image. The feature vector obtained through this has shown 99.16 per-
cent accuracy, and it has a good correlation with output obtained through
hamming distance, which is a metric to find the similarity between two
strings of same length. Biometric systems also make use of the finger vein for
the authentication process. Souad Khellat-kihel et al. (2014) have presented
an SVM-based classification model where the feature extraction part has
been performed by the Gabor filter. Here, two types of pre-processing have
182 Intelligent Systems and Applications in Computer Vision
been done, one with median filter plus histogram equalization and the
second with Gabor filter. Among them, the second approach performs well.
This work outperformed the method presented by Kuan-Quan Wang et al.
(2012), in which classification is done by the same SVM, but they used the
Gaussian filter and Local Binary Pattern Variance for the feature extraction.
A combination of GF and Watershed segmentation algorithm has been used
for lung cancer diagnosis (Avinash et al. 2016). It helps to make the detec-
tion and early-stage diagnosis of the lung nodules easier.
A face recognition system based on GF and Sparse Auto-encoder has been
introduced by Rabah Hammouche et al. on seven existing face databases
that outperform the other existing systems (Hammouche et al. 2022). The
feature extraction has been enriched here using GF, and this has been utilized
by the auto-encoder and the system has been tested with different publically
available databases, namely JAFFE, AT&T, Yale, Georgia Tech, CASIA,
Extended Yale, and Essex (Zhang et al. 2021). The Gabor filter has been used
in destriping of hyperspectral images, where the strips have been created due
to the error of push-broom imaging devices. There will be vertical, hori-
zontal, and oblique stripes and it affects the quality of the image (Barshooi
et al. 2022). This filter has become an important aspect in the diagnosis and
classification of the most pandemic disease of this century, Covid19. Here
the authors have used chest x-ray dataset for the processing and data scar-
city has resolved using combing data augmentation technique with GAN
and deeper feature extraction has performed using Gabor filter, Sobel, and
Laplacian of Gaussian, where the first one shows better accuracy. This filter
has also been used along with genetic algorithm for facial expression rec-
ognition (Boughida et al. 2022). The bottleneck of input data size in deep
learning methods has been reduced by the use of the 3D-Gabor filter as fea-
ture extractor by Hassan Ghassemian et al (2021).
14.3 RESEARCH METHODOLOGY
The way this filter does image analysis has many similarities with the human
visual system. Some authors have described its resemblance with cells in the
visual cortex of mammalian brains (Marčelja et al. 1980).
In this particular case we are considering only the 2D Gabor filter. A 2D
Gabor filter is a function g(x,y, σ, ϴ,λ, ϒ,φ) and can be expressed as such
(Wikipedia 2017).
Explanation of the terms:
Figure 14.2 (a) Represents the original image; (b) is the image obtained after applying Gabor
filter; and (c) is the kernel or filter applied on the image. The combination of
values for this case is x,y =20, sigma =1, theta =1*np.pi/4, lamda =1*np.pi/4,
gamma=1, and phi =0.9.
184 Intelligent Systems and Applications in Computer Vision
Figure 14.3 Workflow of the proposed work is narrated here. We have created a filter
bank with 4 kernels and image passed to them has been filtered and shown as
Filtered images.
14.4 RESULTS
In our study, we have collected an MRI dataset of brain images from a
public repository and applied different filters and compared the results.
The images from the dataset have been tested with different filters, namely
Gaussian, median, Sobel, and Gabor, for comparing the efficiency. The
outputs obtained from them are displayed in Figures 14.3 and 14.4.
The features extracted from these filters have been used to predict whether
or not the brain tumor is present in the image. We have used Random Forest
classifier for the classification task. The model that uses Gabor filter as the
feature extractor shows better accuracy than anything else.
14.5 DISCUSSION
In our study, we have taken the MRI images of brain to predict whether the
image is tumorous or not. For classification purposes, we have selected the
Random Forest classifier, which has proven to be good for medical image
classification. As we did the classification without applying filters, it shows
the accuracy of only 73 percent. But when we have applied the filters to
the images it shows better results. We have tested the dataset with different
filtering algorithms like Gaussian, Median, Gabor and Sobel. Among them,
Gabor filter as feature extractor in anomaly detection 185
Figure 14.4 Output of (a) Gabor (b) Gaussian with sigma=1 (c) median filter (d) Sobel
edge detection algorithm.
14.6 CONCLUSION
Feature extraction is the most important phase in any machine learning or
deep learning technique. We have made a comparison study on different
186 Intelligent Systems and Applications in Computer Vision
filters –Gaussian, median, Sobel and Gabor –to identify the best one and
we found that the Gabor filter performs best in our dataset. As every deep
learning works, brain tumor classification also suffers from data scarcity
and the need for heavy computation power. We have developed Random
Forest models using different filters to classify the MRI brain images, and
Gabor won to extract the features in the best way. We can do this work with
a sounder dataset and also can be extended with other healthcare classifi-
cation and segmentation tasks. The filter bank that we have created in this
work can be customized based on the application where we are using these
features.
REFERENCES
Anuj shah, Through The Eyes of Gabor Filter, https://medium.com/@anuj_shah/thro
ugh- the-eyes-of-gabor-filter-17d1fdb3ac97, Accessed by Jun 17, 2018.
Avinash, S., Manjunath, K., & Kumar, S. S. (2016, August). An improved image pro-
cessing analysis for the detection of lung cancer using Gabor filters and water-
shed segmentation technique. In 2016 International Conference on Inventive
Computation Technologies (ICICT) (Vol. 3, pp. 1–6). IEEE.
Barshooi, A. H., & Amirkhani, A. (2022). A novel data augmentation based on
Gabor filter and convolutional deep learning for improving the classification
of COVID-19 chest X-Ray images. Biomedical Signal Processing and Control,
72, 103326.
Boughida, A., Kouahla, M. N., & Lafifi, Y. (2022). A novel approach for facial
expression recognition based on Gabor filters and genetic algorithm. Evolving
Systems, 13(2), 331–345.
Ghassemi, M., Ghassemian, H., & Imani, M. (2021). Hyperspectral image clas-
sification by optimizing convolutional neural networks based on information
theory and 3D-Gabor filters. International Journal of Remote Sensing, 42(11),
4380–4410.
Hammouche, R., Attia, A., Akhrouf, S., & Akhtar, Z. (2022). Gabor filter bank
with deep auto encoder based face recognition system. Expert Systems with
Applications, 116743.
Khellat-Kihel, S., Cardoso, N., Monteiro, J., & Benyettou, M. (2014, November).
Finger vein recognition using Gabor filter and support vector machine. In
International image processing, applications and systems conference (pp. 1–
6). IEEE.
Kuan-Quan, W., S. Krisa, Xiang-Qian Wu, and Qui-Sm Zhao. “Finger vein rec-
ognition using LBP variance with global matching.” Proceedings of the
International Conference on Wavelet Analysis and Pattern Recognition, Xian,
15–17 July, 2012.
Marčelja, S. (1980). “Mathematical description of the responses of simple cortical
cells”. Journal of the Optical Society of America. 70 (11): 1297–1300.
Margaretta Colangelo & Dmitry Kaminskiy (2019) AI in medical imaging may make
the biggest impact in healthcare. Health Management. Vol. 19 –Issue 2, 2019.
Gabor filter as feature extractor in anomaly detection 187
Minhas, S., & Javed, M. Y. (2009, October). Iris feature extraction using Gabor
filter. In 2009 International Conference on Emerging Technologies (pp. 252–
255). IEEE.
Onizawa, N., Katagiri, D., Matsumiya, K., Gross, W. J., & Hanyu, T. (2015). Gabor
filter based on stochastic computation. IEEE Signal Processing Letters, 22(9),
1224–1228.
Papageorgiou, C. & T. Poggio, “A trainable system for object detection,” Int.
J. Comput. Vis., vol. 38, no. 1, pp. 15–33, 2000.
Sun, Z., Bebis, G., & Miller, R. (2005). On-road vehicle detection using evolutionary
Gabor filter optimization. IEEE Transactions on Intelligent Transportation
Systems, 6(2), 125–137.
Wikipedia, the Free Encyclopedia, https://en.wikipedia.org/wiki/Gabor_filter,
Accessed February 2017
Zhang, B., Aziz, Y., Wang, Z., Zhuang, L., Ng, M. K., & Gao, L. (2021).
Hyperspectral Image Stripe Detection and Correction Using Gabor Filters
and Subspace Representation. IEEE Geoscience and Remote Sensing Letters,
19, 1–5.
Zhang, Y., Li, W., Zhang, L., Ning, X., Sun, L., & Lu, Y. (2019). Adaptive learning
Gabor filter for finger-vein recognition. IEEE Access, 7, 159821–159830.
Chapter 15
15.1 INTRODUCTION
Images are effective and an efficient medium for presenting visual data. In the
present technology the major part of information is images. With the rapid
development in multimedia information such as audio visual data, it becomes
mandatory to organize it in some efficient manner so that it can be obtained
effortlessly and quickly. Image indexing, searching, and retrieval –in other
words, content based image retrieval (CBIR) becomes an active research
area [1, 2, 3] in both industry and academia. Accurate image retrieval is
achieved by classifying image features appropriately. In this chapter we pro-
pose a classifier based on a statistical approach to be applied on features
extracted by Zernike Moments for selecting appropriate features of images.
Image indexing is usually done by low level visual features such as texture,
color, and shape. Texture may consist of some basic primitives that describe
structural arrangement of a region and its relationship with surrounding
regions [4]. However, texture features hardly provide semantic information.
Color is another low level feature for describing images, which is invariant
to image size and orientation. Color histograms, color correlograms, and
dominant color descriptors are used in CBIR. Among them color histograms
are most commonly used. However color feature does not include spatial
information [5]. Another visual feature shape is related to a specific object
in an image. Therefore, shape feature provides more semantic information
than color and texture [6], and shape based image retrieval of similar images
is extensively studied [6, 7, 8, 9, 10]. In this chapter we pursue the shape
based image retrieval system.
Zernike Moments (ZMs) were introduced by Teague [11] and are used
as a shape descriptor for similarly based image retrieval applications.
ZMs are excellent in image reconstruction [12, 13], feature represen-
tation [13, 14] and low noise sensitivity [15]. ZMs are widely used as
shape descriptors in various image retrieval applications [17, 18, 19, 20,
21, 22, 23]. ZMs provide appropriate feature extraction from images,
however the need for suitable classifier emerges by which only effective
features can be selected and non effective features can be discarded from
the features set. There have been considerable researches done on pattern
classification using neural networks [24, 25, 26, 27, 28] and support
vector machines [29, 30]. These techniques have their various properties,
advantages, and disadvantages. In order to detect and classify images
from large and complex databases, we need to select only significant
features rather than incorporating all the extracted attributes. In this
respect we propose a data dependent classification technique that opts
for only those features that describe image more precisely while eradi-
cating less significant features. It is a statistical approach that analyzes
all the images of database, emphasizing only on ZMs coefficients with
more discriminative power. The discriminate coefficients (DC) have
small within class variability and large between class variability. To
evaluate the performance of our approach we performed experiments
on most frequently used MPEG 7 CE shape 1 part B image database
for subject test and rotation test, and observed that ZMs with proposed
classifier outperforms by increasing the retrieval and recognition rate
more than 3 percent effectively than the traditional approach. The rest
of the chapter (described as Section 2) elaborates Zernike Moments
shape descriptor; Section 3 describes the proposed discriminative feature
selection classifier; Section 4 provides similarity measurement. Detailed
experiments are given in Section 5, and Section 6 contains discussions and
conclusions.
2π 1
p+1
π ∫0 ∫0
Zpq = f (r, θ)Vpq
*
(r, θ)rdrd θ (15.1)
(p− q ) 2
(p − k)!
R pq(r) = ∑ (−1)
k
p+ q p− q
r
p − 2k
(15.3)
k=0
k! − k ! − k !
2 2
To make ZMs translation and scale invariant the discrete image function
is mapped on to unit disc. The set of Zernike polynomials need to be
approximated by sampling at fixed intervals when it is applied to discrete
image space [31, 32, 33]. For an N × N discrete space image the Cartesian
equivalent to Equation 15.1 is given as
p + 1 N −1 N −1
Zpq = ∑ ∑ f (xi , y j )Vpq* (xi , y j )∆xi ∆yi ,
π i =0 j =0
xi2 + y 2j ≤ 1 (15.4)
2i + 1 − N 2j + 1 − N
xi = , yj = , i, j = 0, 1, 2, , N − 1 (15.5)
D D
2
∆ xi = ∆ y j = (15.7)
D
15.2.2 Orthogonality
ZMs are orthogonal and their orthogonal property makes image reconstruc-
tion or inverse transform process easier due to the individual contribution of
each order moment to the reconstruction process. The orthogonal proper-
ties of Zernike polynomials and radial polynomials are given by Equations
15.8 and 15.9 respectively.
2π 1
π
∫ ∫V pq (r, θ)V *p′ q ′(r, θ) rdrdθ =
p+1
δ pp′ δ qq ′ ( 15.8)
0 0
Zernike moments for shape based image retrieval system 191
1
1
∫R pq (r)R p′q(r)rdr =
2(p + 1)
δ pp′ (15.9)
0
15.2.3 Rotation invariance
The set of ZMs inherently possess a rotation invariance property. The mag-
nitude values of ZMs remain similar before and after rotation. Therefore the
magnitude values of ZMs are rotation invariant. ZMs of an image rotated
by an angle ϕ are defined as
′ = Zpq e − jqϕ
Zpq (15.10)
where Zpq ′ are ZMs of rotated image and Zpq are ZMs of original image.
The rotation invariant ZMs are extracted by considering only magnitude
values as
′ = Z pq e − jqϕ
Zpq (15.11)
e
− jqϕ
= cos(qϕ) + j sin(qϕ) = 1 (15.12)
′ = Z pq
Zpq (15.13)
As Z*pq = Zp, − q and Zpq = Zp, − q , therefore only magnitudes of ZMs with
q ≥ 0 are considered [33].
pmax = 10) with higher discrimination power for each mentioned pmax and
analyzed that retrieval accuracy increases up to pmax = 12 then subsequently
diminishes. The recognition rate at various moments’ orders is depicted in
Figure 15.1. However, as the moment’s order increases, time complexity
also increases. To acquire the best possible solution to our approach we
choose pmax = 12 and by applying the proposed classifier we select merely
34 features with higher discrimination power and without tormenting
system speed.
1 S
M pq =
c
∑ Z pq(s, c), c = 1, 2,..., C
S s =1
(15.15)
∑ (Z )
2
Varpq
c
= pq (s, c) − Mpq
c
, c = 1, 2,C (15.16)
s =1
1 C
Varpq
w
= ∑ Varpqc
C c =1
(15.17)
194 Intelligent Systems and Applications in Computer Vision
1 C S
Mpq = ∑ ∑ Zpq (s, c)
S × C c =1 s =1
(15.18)
∑ ∑ (Z pq (s, c) − M )
C S 2
Var pq = (15.19)
B
pq
c =1 s =1
Varpq
B
D(p, q) = , p ≥ 0, 0 ≤ q ≤ p, p − q = even (15.20)
Vpq
w
15.4 SIMILARITY MEASURE
To compute the similarity of test image to the archived images a suitable
similarity metric is required. The database image with the smallest distance
to the test image is termed as the most similar image. We apply Euclidean
distance similarity measure to evaluate the resemblance of test and training
images, given as:
M −1
d(T , D) = ∑ (f (T ) − f (D))
k k
2
(15.21)
k −0
Where fk(D) and fk(D) represent the kth feature of test image and data-
base image respectively. M represents the total number of features to be
Zernike moments for shape based image retrieval system 195
15.5 EXPERIMENTAL STUDY
A detailed experimental analysis is performed to evaluate the performance
of the proposed approach to the image retrieval system. The comparison
is carried out among three techniques: the proposed approach, traditional
Zernike Moments descriptor without classification, and wavelet moments
(WM) [35]. Experiments are executed on an Intel Pentium core 2 duo 2.10
GHz processor with 3 GB RAM. Algorithms are implemented in VC++9.0.
15.5.1 Experiment setup
(a) Subject database: MPEG-7 CE Shape 1 Part-B is a standard image
set which contains 1,400 images of 70 classes with 20 samples in
each class. Two images from each class are arbitrarily chosen for
a test set and the rest of the images are located in the train set.
Therefore 140 images are used as queries. All images are resized
to 96 × 96 pixels. Figure 15.3 refers to sample images from
MPEG-7 subject database.
(b) Rotation database: 70 images from each class of MPEG-7 CE Shape
1 Part-B database are selected and each of them is rotated at angles
196 Intelligent Systems and Applications in Computer Vision
of 0°, 15°, 30°, 45°, 60°, 75°, 90°, 105°, 105°, 120°, 135°, 150°, 1650°
and 180° thereby creating 13 samples of each class. The database
contains 910 images, 70 classes of 13 samples each. Sample images
from rotation database are depicted in Figure 15.4. Two images
from each class are chosen irrationally as query images. Thus the
test set contains 140 query images.
nq nq
p= × 100, R = × 100 (15.22)
Nq Dq
where nq represents the number of similar images retrieved from the data-
base. Nq represents total number of images retrieved. Dq represents number
of images in database similar to query image q.
15.5.3 Experiment results
Initially we have analyzed the performance of proposed classifier on sub-
ject database by querying 140 images from the test set at pmax = 12 and
selecting 34 higher discriminating coefficients. Table 15.2 presents selected
coefficients of ZMs with higher discriminating power. The traditional
Zernike Moments descriptor at pmax = 10 (ZMD10) has 34 features without
classification, Wavelet moments with 126 features are used. Their perform-
ance is measured through P – R graph as shown in Figure 15.5 which signi-
fies that the retrieval rate of proposed ZM is higher than that of traditional
(ZMD10),followed by wavelet moments. The retrieval accuracy of Proposed
ZM, ZMD10 and wavelet moments are 54.21 percent, 51.07 percent and
48.79 percent respectively. Thus it is evident that retrieval rate of proposed
ZM increased more than 3 percent than the traditional ZMD10 by using the
same number of features. Average preci sion and recall for top 40 retrievals
are depicted in Figure 15.6(a) and Figure 15.6(b) which also represents
the similar trend of accuracy. Top 10 retrieval results corresponding to a
Table 15.2 Features with higher discrimination power for pmax = 12 (subject database)
Figure 15.5 Precision and recall performance on MPEG subject database for Proposed ZM,
ZMD10, and WM.
query image for employed three methods are shown in Figure 15.7 along
with the number of images retrieved by each method. Wavelet moments
have the poor performance by retrieving merely two similar images. While
comparing retrieval performance of proposed ZM and ZMD10, we see that
proposed ZM retrieves consecutive 3 similar images, whereas in case of
ZMD10 and WM variation occurs at 3rd image.
Another set of experiments are performed on rotation database in which
140 images are passed as queries to the proposed method, traditional
ZMD10 and WM. Selected coefficients with higher discrimination power
for pmax = 12 are presented in Table 15.3. P – R graph representing the
behavior of applied methods is shown in Figure 15.8, which gives the
evidence that proposed ZM, traditional ZMD10 and WM are giving best
performance as being rotation invariant and their graph is overlapped.
Average precision and average recall for rotation database are given in
Figure 15.9(a) and Figure 15.9(b) respectively. The top 10 retrievals on
rotation database are displayed in Figure 15.10, which demonstrates that
the proposed ZM, ZMD10 and WM are retrieving 10 similar images from
the database.
Zernike moments for shape based image retrieval system 199
Figure 15.6 Top 40 retrievals from subject database (a) Average precision (b) Average recall.
200 Intelligent Systems and Applications in Computer Vision
Table 15.3 Features with higher discrimination power for pmax = 12 (rotation database)
Figure 15.7 (a) Query image (b) Top 10 retrieval results by proposed ZM, ZMD10, and WM
(subject database).
Zernike moments for shape based image retrieval system 201
Figure 15.8 Precision and recall performance on MPEG subject database for proposed ZM,
ZMD10 and WM.
Figure 15.9 Top 30 retrievals from rotation database (a) Average precision (b) Average
recall.
Zernike moments for shape based image retrieval system 203
Figure 15.10 (a) Query Image (b) Top 10 retrieval results by proposed ZM, ZMD10 and
WM (rotation database).
REFERENCES
[1] Li, X., Yang, J., Ma, J. (2021). Recent developments of content-based
image retrieval (CBIR). Neurocomputing 452, 675–689.
[2] Zhu, X., Wang, H., Liu P., Yang, Z., Qian, J. (2021). Graph-based reasoning
attention pooling with curriculum design for content-based image retrieval.
Image and Vision Computing, 115, 104289.
[3] Surendranadh, J., Srinivasa Rao, Ch. (2020). Exponential Fourier
Moment- Based CBIR System: A Comparative Study. Microelectronics,
Electromagnetics and Telecommunications, 757–767.
[4] Alaeia, F., Alaei, A., Pal, U., Blumensteind, M. (2019). A comparative
study of different texture features for document image retrieval. Expert
Systems with Applications, 121, 97–114.
[5] Datta, R., Joshi, D., Li, J., Wang, J. Z. (2008). Image retrieval: ideas,
influences, and trends of the new age. ACM Computing Surveys, 40 (2),
1–60.
[6] Hu, N., An-An, H., Liu, et al. (2022). Collaborative Distribution Alignment
for 2D image-based 3D shape retrieval. Journal of Visual Communication
and Image Representation, 83, 103426.
204 Intelligent Systems and Applications in Computer Vision
[7] Li, H., Su, Z., Li, N., Liu, X. (2020). Non-rigid 3D shape retrieval based on
multi-scale graphical image and joint Bayesian. Computer Aided Geometric
Design, 81, 101910.
[8] Iqbal, K., Odetayo, M., O., James, A. (2002). Content- based image
retrieval approach for biometric security using colour, texture and shape
features controlled by fuzzy heuristics. Journal of Computer and System
Sciences, 35 (1), 55–67.
[9] Wang, Y. (2003). Image indexing and similarity retrieval based on spatial
relationship model. Information Sciences, 154 (1–2), 39–58.
[10] Zhou, X., Huang, T. (2002). Relevance feedback in content-based image
retrieval: some recent advances. Information Sciences, 148 (1–4), 129–137.
[11] Teague M. R. (1980). Image analysis via the general theory of moments.
Journal of Optical Society of America. 70, 920–930.
[12] Pawlak, M. (1992). On the reconstruction aspect of moment descriptors.
IEEE Transactions on Information Theory, 38 (6), 1698–1708.
[13] Liao, S.X., Pawlak, M. (1996). On image analysis by moments. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18 (3),
254–266.
[14] Belkasim, S.,O., Shridhar, M., Ahmadi, M. (1991). Pattern recognition
with moment invariants: a comparative study and new results. Pattern
Recognition, 24 (12), 1117–1138.
[15] Mukundan R., and Ramakrishnan, K.R. (1998). Moment Functions in
Image Analysis. World Scientific Publishing, Singapore.
[16] Teh, C., H., Chin, R.,T. (1988). On image analysis by the methods of
moments, IEEE Transactions on Pattern Analysis and Machine Intelligence.
10 (4), 496–512.
[17] Li, S., Lee, M., C., Pun, C., M. (2009). Complex Zernike moments features
for shape-based image retrieval. IEEE Transactions on Systems, Man, and
Cybernetics. 39 (1) 227–237.
[18] Kim, H., K., Kim, J., D., Sim, D., G,. Oh, D. (2000). A modified Zernike
moment based shape descriptor invariant to translation, rotation and scale
for similarity based image retrieval. IEEE Int. Conf. on Multimedia and
Expo, 1307–1310.
[19] Kumar, Y., Aggarwal, A., Tiwari, S., Singh, K. (2018). An efficient and
robust approach for biomedical image retrieval using Zernike moments.
Biomedical Signal Processing and Control. 39, 459–473.
[20] An-Wen, D., Chih-Ying, G. (2018). Efficient computations for generalized
Zernike moments and image recovery. Applied Mathematics and
Computation. 339. 308–322.
[21] Vargas-Varga, H., JoséSáez-Landetea, et al. (2022). Validation of solid
mechanics models using modern computation techniques of Zernike
moments. Mechanical Systems and Signal Processing. 173, 109019.
[22] Kim, H., Kim, J . (2000). Region-based shape descriptor invariant to rota-
tion, scale and translation. Signal Processing: Image Communication. 16,
87–93.
[23] Wei, C. –H., Li, Y., Chau W. –Y., Li, C. –T. (2009). Trademark image
retrieval using synthetic features for describing global shape and interior
structure. Pattern Recognition. 42, 386–394.
Zernike moments for shape based image retrieval system 205
[24] Su, Z., Zhang H., Li, S., Ma, S. (2003). Relevance feedback in content
based image retrieval: Bayesian framework, feature subspaces, and pro-
gressive learning. IEEE Transactions on Image Processing. 12(8), 924–937.
[25] Park, S. -S., Seo, K., -K., and Jang, D. -S. (2005). Expert system based on
artificial neural networks for content-based image retrieval. Missing Values
Imputation Techniques. 29(3), 589–597.
[26] Pakkanen, J., Iivarinen J., Oja, E. (2004). The evolving tree –A novel self-
organizing network for data analysis. Neural Processing Letters. 20(3),
199–211.
[27] Koskela, M., Laaksonen, J., Oja, E. (2004). Use of image subset features in
image retrieval with self-organizing maps. LNCS. 3115, 508–516.
[28] Fournier, J., Cord, M., and Philipp-Foliguet, S. (2001). Back-propagation
algorithm for relevance feedback in image retrieval. In IEEE International
conference in image processing (ICIP’01) 1, 686–689.
[29] Kumar M., A., Gopal, M. (2009). Least squares twin support vector
machines for pattern classification. Expert systems with applications. 36,
7535–7543.
[30] Seo, K. –K. (2007). An application of one class support vector machines
in content-based image retrieval. Expert systems with applications. 33,
491–498.
[31] Wee, C. Y., Paramseran, R. (2007). On the computational aspects of
Zernike moments. Image and Vision Computing. 25, 967–980.
[32] Xin, Y., Pawlak, M., Liao, S. (2007). Accurate calculation of moments in
polar co-ordinates. IEEE Transactions on Image Processing. 16, 581–587.
[33] Singh C., Walia, E. (2009). Computation of Zernike Moments in improved
polar configuration, IET Journal of Image Processing. 3, 217–227.
[34] Khotanzad, A., Hong, Y. H. (1990). Invariant image recognition by
Zernike moments. IEEE Transactions on Pattern Analysis and Machine
Intelligence. 12 (5), 489–497.
[35] Shen D., Ip, H. H. S. (1999). Discriminative wavelet shape descriptors for
recognition of 2-D patterns. Pattern Recognition. 32, 151–165.
Chapter 16
16.1 INTRODUCTION
Digital images are a convenient medium for describing information contained
in a variety of domains such as medical images in medical diagnosis, archi-
tectural designs, trademark logos, finger prints, military systems, geograph-
ical images, satellite/aerial images in remote sensing, and so forth. A typical
database may consist of hundreds of thousands of images. Therefore, an
efficient and automatic approach is required for indexing and retrieving
images from large databases. Traditionally, image annotations and labeling
with keywords heavily rely on manual labor. The keywords are inherently
subjective and not unique. As the size of the image database grows, the use
of keywords becomes cumbersome and inadequate to represent the image
content [1,2]. Hence, content based image retrieval (CBIR) has drawn sub-
stantial attention during the last decade. CBIR usually indexes images with
low level visual features such as color, texture and shape. The extraction
of good visual features, which compactly represent the image, is one of the
important tasks in CBIR. A color histogram is the most widely used color
descriptor in CBIR; while colors are easy to compute, they represent large
feature vectors that are difficult to index and have high search and retrieval
costs [3]. Texture features do not provide semantic information [4]. Shape is
considered a very important visual feature in object recognition and retrieval
system, since shape features are associated with a particular object in an
image [5,6]. A good shape representation should be compact and retain
the essential characteristics of the image. Moreover, invariance to rotation,
scale, and translation is required because such transforms are consistent
with human perception. A good method should also deal with photometric
transformations such as noise, blur, distortion, partial occlusion, JPEG com-
pression, and so forth.
Various shape representations and description techniques have been
proposed during the last decade [7]. In shape description, features are gen-
erally classified into two types: the region based descriptors and the con-
tour based descriptors. In region based descriptors, features are extracted
from the interior of the shape and represent the global aspect of the image.
The region based descriptors include geometric moments [15], moment
invariants (MI) [16], a generic Fourier descriptor (GFD) [17], Zernike
Moments descriptors (ZMD) [18]. In the contour based descriptors,
features are extracted from the shape boundary points only. The contour
based descriptors include Fourier descriptors (FD) [8], curvature scale space
[9], contour flexibility [10], shape context [11], histograms of centroid dis-
tance [12], contour point distribution histograms (CPDH) [13], Histograms
of Spatially Distributed Points, and Angular Radial Transform [33], Weber’s
local descriptors (WLD) [14], and so forth. Both the region and contour
based methods are complimentary to each other as one method provides the
global characteristics, while the other provides the local change in an image.
Therefore, we exploit both local and global features of images to propose
a novel and improved approach to an effective image retrieval system.
Teague introduced [19] the notion of orthogonal moments to recover the
image from moments based on the theory of orthogonal polynomials using
Zernike Moments (ZMs), which are capable of reconstructing an image
and exhibit minimum information redundancy. The magnitudes of ZMs
have been used as global features in many applications [20–30], due to their
rotation invariant property. Since ZMs are inherently complex, therefore,
the real and imaginary coefficients possess significant image representation
and description capability. The phase coefficients are considered to be very
effective during signal reconstruction as demonstrated by [28–31]. However,
the phase coefficients are not rotationally invariant, which is illustrated as
follows:
Let Zpq and Zpq r
be the ZMs of original and rotated images, respectively,
with order p and repetition q, then the two moments are related by:
Zpq
r
= Zpq e − jqθ , (16.1)
or
where ψ pq and ψ rpq are the phase coefficients of original and rotated images,
respectively. Therefore, in our approach we use the relationship given by
Equation 16.3 to compute qθ from the phase coefficients of original and
208 Intelligent Systems and Applications in Computer Vision
16.2 PROPOSED DESCRIPTORS
In this section, we first introduce the invariant method for global features
extraction using ZMs, and later we propose histograms of centroid distances
to linear edges as local features based on HT.
Zpq
r
= Zpq e − jqθ , (16.4)
where θ is the angle of rotation. Since the ZMs magnitudes are rotation
invariant therefore, we have
Zernike Moments for improved content based image retrieval 209
Zpq
r
= Zpq e − jqθ = Zpq , (16.5)
where Zpq
r
and Zpq are the magnitudes of rotated and original images,
respectively. However, the ZMs phase coefficients are not rotation invariant
and are related by:
ψ rpq = ψ pq − qθ (16.6)
or
qθ = ψ pq − ψ rpq , (16.7)
I(Zpq
r
) I(Zpq ) (16.8)
ψ rpq = tan −1 r
, ψ pq = tan −1 ,
R(Zpq ) R(Zpq )
where ψ rpq and ψ pq are the phase coefficients of rotated and original image
⋅ and R()
respectively, and I() ⋅ are the real and imaginary coefficients of
ZMs. Using Equation 16.4 let Zpq c
be the corrected ZMs derived from the
rotated version of ZMs as follows:
Zpq
c
= Zpq
r
e jqθ , (16.9)
R(Zpq
c
) + jI(Zpq
c
(
) = R(Zpq
r
) + jI(Zpq
r
)
) × ( cos(qθ) + j sin(qθ)) , (16.10)
R(Zpq
c
) + jI(Zpq
c
(
) = R(Zpqr
) + jI(Zpq
r
)
) × (cos(ψ pq − ψ rpq ) +
j sin(ψ pq − ψ pq )),
r (16.11)
R(Zpq
c
) + jI(Zpq
c
(
) = R(Zpq
r
) + jI(Zpq
r
)
) × ( cos(α) + j sin(α)) , (16.12)
or
R(Zpq
c
) = R(Zpqr
)cos(α) − I(Zpq
r
)sin(α)
I(Zpq ) = R(Zpq )sin(α) + I(Zpq )cos(α)
c r r (16.13)
210 Intelligent Systems and Applications in Computer Vision
If the two images are similar but rotated by an angle θ, then the ZMs of the
rotated images are modified according to the Equation 16.23, then we have
R(Zpq ) = R(Zpq
c
) and I(Zpq ) = I(Zpq
c
) (16.14)
From the Equation 16.14 we obtain two corrected invariant real and
imaginary coefficients of ZMs, which we use as global features for images in
the proposed system. The real and imaginary coefficients are corrected by the
phase difference between original and rotated images. Thus, in the proposed
system we use real and imaginary coefficients of ZMs individually rather
than using the single features set obtained through ZMs magnitudes. We
consider four similar images rotated at different angles and four dissimilar
images given in Figure 16.1 from MPEG-7 database, and evaluate the dis-
crimination power of real and imaginary components. It is observed from
Figures 16.2(a) and 16.2(c) that, while considering similar images rotated
at different angles, no variation perceived in real and imaginary coefficients.
On the other hand, significant variation occurs among real and imaginary
coefficients in case of dissimilar images as depicted in Figures 16.2(b) and
16.2(d).
Figure 16.2 (a) ZMs real coefficients of similar images (b) ZMs real coefficients of dis-
similar images (c) ZMs imaginary coefficients of similar images, and (d) ZMs
imaginary coefficients of dissimilar images (M represents total number of
moments used).
which generates 41 moments given by Table 16.1. The moments Z0,0 and
Z1,1 are excluded from the features set as Z0,0 signifies an average gray value
of image and Z1,1 is the first order moment, which is zero if the centroid of
the image falls on the center of the disc.
Figure 16.2(a) ZMs real coefficients of similar images Figure 16.2(b)
ZMs real coefficients of dissimilar images (c) ZMs imaginary coefficients of
similar images, and (d) ZMs imaginary coefficients of dissimilar images (M
represents total number of moments used).
16.3 SIMILARITY METRICS
In the above section, methods are proposed for the extraction of global
and local features. An effective similarity measure is one that can preserve
the discrimination powers of features and match them appropriately. In
existing methods, the Euclidean Distance (ED) metric, also called L2 norm,
is used the most frequently. In ED, distances in each dimension are squared
Zernike Moments for improved content based image retrieval 213
Figure 16.3 (a) Normalized histograms (pi) for four images of similar class (b) Normalized
histograms (pi) for four images of dissimilar class.
before summation, which puts greater emphasis on those features for which
the dissimilarity is large. Therefore, to overcome this issue we suggest city
block (CB) distance, also called L1 norm and Bray-Curtis (BC) also called
Sorensen’s distance metrics [37, 38]. The BC metric normalizes the feature
values by dividing the summation of absolute differences of corresponding
feature vectors by the summation of their absolute sums. We analyze the
performance of the proposed system using the three distance metrics ED,
CB, and BC, and analyze that the BC similarity metric outperforms rest
of the metrics. The ED, CB and BC metrics for the proposed region based
descriptor are given as:
M −1
drED (Q, D) = ∑ (R(Z Q
i ) − R(ZiD ))2 + (I(ZiQ ) − I(ZiD ))2 (16.15)
j =0
M −1
drCB (Q, D) = ∑ R(Z Q
i ) − R(ZiD ) + I(ZiQ ) − I(ZiD ) , (16.16)
j =0
M −1
∑ R(Z Q
i ) − R(ZiD ) + I(ZiQ ) − I(ZiD
i =0
drBC (Q, D) = M −1
, (16.17)
∑ R(ZiQ ) + R(ZiD + I(Z Qi ) + I(ZiD )
i =0
214 Intelligent Systems and Applications in Computer Vision
where ZiQ and ZiD are the ZMs features of the query and database images,
respectively and M = 42 . The ED, CB and BC metrics for the proposed con-
tour based descriptor are given as:
H −1 2
d c (Q, D) =
ED
∑ [f (Q) − f (D)]
i (16.18)
i =0
H −1
d c (Q, D) =
CB
∑ f (Q) − f (D) ,
i i (16.19)
i =0
H −1
∑ f (Q) − f (D)
i i
dcBC (Q, D) = i =0
, (16.20)
H −1
∑ f (Q) − f (D)
i i
i =0
where fi (Q) and f j (D) represent the feature vectors of the query and
database images, respectively, and H is the number of features, which is 10
for contour based features. Since we consider both global and local features
to describe the shape, therefore, the above corresponding similarity metrics
are combined to compute the overall similarity, given as:
Figure 16.4 The performance of similarity metrics ED, CB, and BC for (a) Kimia-99 and
(b) MPEG-7 shape databases.
nQ nQ
P= × 100, R = × 100, (16.24)
TQ DQ
where nQ represents the number of similar images retrieved from the data-
base, TQ represents total number of images retrieved, DQ represents number
of images in database similar to query image Q. The system performance is
also measured using Bull’s Eye Performance (BEP).
Figure 16.5 The performance comparison using P – R curves for Kimia-99 database with
(a) phase based methods and (b) contour and region based methods.
Zernike Moments for improved content based image retrieval 217
Figure 16.6 The performance comparison using P – R curves for Brodatz texture database
with (a) phase based methods and (b) contour and region based methods.
Figure 16.7 The performance comparison using P – R curves for COIL-100 database with
(a) phase based methods and (b) contour and region based methods.
Figure 16.5(b), which depicts that the MI represents the overall worst
performance. However, the proposed system outperforms all the region
and contour based descriptors followed by CPDH and GFD with slight
variation in their performance.
(2) Brodatz texture: The next set of experiments is performed over tex-
ture database and the results are given in Figure 16.6(a) for phase
based methods. It is observed that proposed ZMs and optimal simi-
larity methods have almost similar performance followed by adjacent
phase and CZM methods. However, the proposed ZMs and HT based
approach outperforms all the methods. The performance of MI and FD
is lowest on the texture database as their P – R curves are not emerged
in the graph as can be seen from Figure 16.6(b). The performances
of GFD and CPDH are highly reduced. Nevertheless, the proposed
system and ZMD represent very high performance and overlap with
each other, followed by WLD.
(3) COIL-100: This database represents 3D images rotated at different
angles and the performance for this database is given in Figure 16.7(a)
for phase based approaches. The performance is similar as for
218 Intelligent Systems and Applications in Computer Vision
Figure 16.8 The performance comparison using P – R curves for MPEG-7 database with
(a) phase based methods and (b) contour and region based methods.
Figure 16.9 The performance comparison using P – R curves for rotation database with (a)
phase based methods and (b) contour and region based methods.
Figure 16.10 The performance comparison using P – R curves for scale database with
(a) phase based methods and (b) contour and region based methods.
Figure 16.11 The performance comparison using P – R curves for translation database with
(a) phase based methods and (b) contour and region based methods.
Figure 16.12 The performance comparison using P – R curves for noise database with
(a) phase based methods and (b) contour and region based methods.
The P − R curves for phase based methods are given in Figure 16.12(a),
which shows that the proposed solution is highly robust to noise
followed by optimal similarity method. The CZM has the lowest per-
formance among others. The Figure 16.12(b) represents P − R curves
for contour and region based descriptors, which shows that the region
based descriptors are more robust to noise as compared to contour
based descriptors. The proposed system represents 100 percent robust-
ness to noise followed by ZMD and GFD. In contour based descriptors
FD has the worst performance.
(9) Blur: Another photometric test is performed over blurred images. The
results are given in Figure 16.13(a) for phase based methods, which
represents that all the methods are highly robust to blur transform-
ation, including the proposed approach. When contour and region
based methods are considered, the proposed system still preserves its
highest retrieval accuracy, followed by CPDH, FD, and WLD. The MI
represent their worst performance over blur images, as can be seen
from Figure 16.13(b).
Zernike Moments for improved content based image retrieval 221
Figure 16.13 The performance comparison using P – R curves for blur database with
(a) phase based methods and (b) contour and region based methods.
The BEP in percentage for contour and region based methods is given in
Table 16.2. It represents that the proposed method outperforms all other
methods and gives an average retrieval accuracy of 95.49 percent. While
comparing the average BEP with phase based approaches, the proposed
ZMs based technique has higher retrieval rate than the CZM, adjacent
phase, and optimal similarity methods as presented in Table 16.3.
222 Intelligent Systems and Applications in Computer Vision
Table 16.2 The comparison of average BEP of the proposed and other contour and
region based methods
Proposed
FD WLD CPDH MI GFD ZMD (ZMs+H T)
Kimia-99 63.45 59.5 78.6 8.73 77.5 74.65 99.54
Brodatz texture 9.33 93.07 43.13 7.3 41.43 98.16 98.2
COIL-100 57.94 74.98 86.95 51.43 89.62 84.01 92.71
MPEG-7 36.54 31.89 55.97 34.24 55.59 58.24 77.27
Rotation 22.79 50.14 58.69 27.69 57.71 100 100
Scale 65.77 6.85 52.93 36.62 67.17 100 100
Translation 91.09 93.19 97.83 37.79 73.98 63.09 89.01
Noise 16.66 33.55 30.23 25.73 69.06 92.05 98.2
Blur 66.93 31.71 88.86 20.58 35.42 94.39 100
JPEG 82.93 40.19 93.59 29.98 67.17 100 100
Average 51.34 51.51 68.68 28.01 63.46 86.46 95.49
Table 16.3 The comparison of average BEP of the proposed and phase based methods
noise, blur, JPEG compressed and texture images. ZMD performs better
than GFD and other methods due to the incorporation of sinusoid function
in the radial kernel, and they have similar properties of spectral features
[7]. On the other hand, among the local descriptors FD has the worst per-
formance. CPDH performs better than WLD, and its performance is com-
parable to ZMD or to GFD in some of the databases. The performance
of WLD is better for texture images. It is worth noticing that the contour
based descriptors are translation invariant as compared to region based
descriptors. This is due to the fact that centroid of the image is used while
computing contour based features for making them translation invariant.
While comparing the phase based methods with the proposed solution, the
proposed corrected real and imaginary coefficients of ZMs perform better
than them. The proposed ZMs approach eliminates the step of estimation of
rotation angle in order to correct the phase coefficients of ZMs and making
it rotation invariant. In fact, it directly uses the relationship among original
and rotated phase coefficients of ZMs to compute qθ . The value of qθ is
used to correct real and imaginary coefficients of ZMs individually, rather
than using them to compute magnitude. When both ZMs and HT based
features are combined the performance of the proposed system extremely
supersedes the existing approaches.
Hence, in this chapter we provide a novel solution to the image retrieval
system in which ZMs based global features and HT based local features
are utilized. The corrected real and imaginary coefficients of ZMs are used
as feature vectors representing the global aspect of images. On the other
hand, the histograms of distances between linear edges and centroid of
image represent local feature vectors. Both global and local features are
combined by the Bray-Curtis similarity measure to compute the overall
similarity among images. The experimental results reveal that the proposed
ZMs and ZMs+HT methods outperform existing recent region and contour
based descriptors. The vast analyses also reveal that the proposed system
is robust to geometric and photometric transformations. The average
retrieval performance over all the databases represents that the proposed
(ZMs+HT) attains 95.49 percent and proposed (ZMs) attains 89.2 percent
accuracy rate.
REFERENCES
[1] Alsmadi, M. K. (2020). Content-Based Image Retrieval Using Color, Shape
and Texture Descriptors and Features, Arabian Journal for Science and
Engineering. 45, 3317–3330.
[2] Zhang. X., Shen, M., Li, X., Fang, F. (2022). A deformable CNN-based
triplet model for fine-grained sketch-based image retrieval. Pattern
Recognition. 125, 108508.
224 Intelligent Systems and Applications in Computer Vision
[3] Datta, R., Joshi, D., Li, J., Wang, J.Z. (2008). Image retrieval: ideas,
influences, and trends of the new age. ACM Computing Surveys. 40 (2),
1–60.
[4] Chun, Y.D., Kim, N.C. ¸ Jang, I.H. (2008). Content-based image retrieval
using multiresolution color and texture features. IEEE Transactions on
Multimedia. 10 (6), 1073–1084.
[5] Yang. C., Yu., Q. (2019). Multiscale Fourier descriptor based on triangular
features for shape retrieval. Signal Processing: Image Communication, 71,
110–119.
[6] Qin, J., Yuan, S. et al. (2022). SHREC’22 track: Sketch-based 3D shape
retrieval in the wild. Computers & Graphics. 107, 104–115.
[7] Zhang, D.S., Lu, G.J. (2004). Review of shape representation and descrip-
tion techniques. Pattern Recognition. 37, 1–19.
[8] Rui, Y., She A., Huang, T.S. (1998). A modified Fourier descriptor for
shape matching in MARS. Image Databases and Multimedia Search. 8,
165–180.
[9] Mokhtarian, F., Mackworth, A.K. (1992). A theory of multiscale, curva-
ture based shape representation for planar curves. IEEE Transactions on
Pattern Analysis and Machine Intelligence. 14, 789–805.
[10] Xu, C.J., Liu, J.Z., Tang, X. (2009). 2D shape matching by contour flexi-
bility. IEEE Transactions on Pattern Analysis and Machine Intelligence. 31
(1), 180–186.
[11] Belongie, S., Malik, J.¸ Puzicha, J. (2002). Shape matching and object rec-
ognition using shape contexts. IEEE Transactions on Pattern Analysis and
Machine Intelligence. 24 (4), 509–522.
[12] Zhang, D., Lu, G. (2002). A comparative study of Fourier descriptors
for shape representation and retrieval, in: The 5th Asian Conference on
Computer Vision (ACCV02).
[13] Shu, X., Wu, X.-J. (2011). A novel contour descriptor for 2D shape matching
and its application to image retrieval. Image and Vision Computing. 29 (4)
286–294.
[14] Chen, J., Shan, S., He, C., Zhao, G., Pietikainen, M.¸ Chen, X., Gao, W.
(2010). WLD: A robust image local descriptor. IEEE Transactions on
Pattern Analysis and Machine Intelligence. 32 (9), 1705–1720.
[15] Sonka, M., Hlavac, V.¸ Boyle, R. (1993). Image Processing, Analysis and
Machine Vision, Springer, NY. 193–242.
[16] Hu, M.-K. (1962). Visual pattern recognition by moment invariants. IEEE
Transactions on Information Theory. 8 (2), 179–187.
[17] Zhang, D., Lu, G. (2002). Shape-based image retrieval using generic Fourier
descriptor, Signal Processing: Image Communication. 17 (10), 825–848.
[18] Kim, W.- Y., Kim, Y.-S. (2000). A region based shape descriptor using
Zernike moments, Signal Processing: Image Communication. 16, 95–102.
[19] Teague M. R. (1980). Image analysis via the general theory of moments.
Journal of Optical Society of America. 70, 920–930.
[20] Singh, C., Pooja (2012). Local and global features based image retrieval
system using orthogonal radial moments. Optics and Lasers in Engineering.
55, 655–667.
Zernike Moments for improved content based image retrieval 225
[36] Qi, H., Li, K., Shen, Y., Qu, W. (2010). An effective solution for trade-
mark image retrieval by combining shape description and feature
matching. Pattern Recognition. 43, 2017–2027. https://doi.org/10.1016/j.
patcog.2010.01.007
[37] Kokare, M., Chatterji, B.N.¸ Biswas, P.K. (2003). Comparison of similarity
metrics for texture image retrieval, TENCON Conference on Convergent
Technologies for Asia-Pacific Regio, 2, 571–575.
[38] C Singh, Pooja (2011). Improving image retrieval using combined
features of Hough transform and Zernike moments, Optics and Lasers in
Engineering, 49 (12) 1384–1396.
Chapter 17
17.1 INTRODUCTION
Nowadays, most people use language translation software, such as Google
Translate and Microsoft Translator, to translate texts from one language
into another. For example, if we were in some foreign country where we did
not know the native language (foreign language) of the people. It is neces-
sary for us to be able to communicate in their native (foreign) language in
order to eat. In order to deal with this concern, a translator would have to
assist them in their communication.
There have been significant advances brought about by Google’s research
into neural machine translation along with other competitors in recent
years, providing positive prospective for the industry. As a result of recent
advances in computer vision and speech recognition, machine translation
can now do more than translate raw texts, since various kinds of data
(pictures, videos, audio) are available across multiple languages. With the
help of language translator applications, the user can translate the text to
their own language, however, the picture cannot be connected with the text.
If the user wants the picture with the translated text, then the user needs
a language translator and image editing software to replace the translated
text with the image. Our proposed work presents a method for translating
text from one language to another, as well as leaving the background image
unaltered. Essentially, this will allow users to connect with the image in
their native language.
This chapter presents the design of a system that includes three basic
modules: text extraction, machine translation, and inpainting. As well as
these modules, adding the spelling correction network after the text extrac-
tion layer can solve the problem of spelling mistakes in OCR extracted text,
since it highlights misspelled words and corrects them before translation. In
this module, the input image is sequentially processed in order to translate
the Tamil text into English text with the same background.
A method of improving image text translation for Tamil is presented in
this work. For ancient languages like Tamil, OCR engines perform poorly
17.2 LITERATURE SURVEY
In using the proposed method, separating texts from a textured background
with similar color to texts is performed [1]. Experimentation is carried
out in their own data set containing 300 image blocks in which several
challenges such as generating images manually by adding text on top of rela-
tively complicated background. From experimentation with other methods,
the proposed method achieves more accurate result, that is, precision of
95 percent, recall of 92.5 percent and F1 score of 93.7 percent are obtained,
and the proposed algorithm is robust to the initialized value of variables.
Using LBP base feature, separation of text and non-text from handwritten
document images is proposed [2]. Texture based features like Grey Level
Co-occurrence Matrix (GLCM) are proposed for classifying the segmented
regions. In this a detailed analysis of how accurately features are extracted
by different variants of local binary pattern (LBP) operator is given, a data-
base of 104 handwritten engineering lab copies and class notes collected
from an engineering college are used for experimentation purposes. For
classification of text and non-text, Nave Bayes (NB), Multi-layer perceptron
(MLP), K-nearest neighbor (KNN), Random forest (RF), Support vector
machine (SVM) classifiers are used. It is observed that Rotation Invariant
Uniform Local Binary Pattern (RIULBP), performed better than the other
feature extraction methods.
The authors proposed [4] a robust Uyghur text localization method in
complex background images, which provide a CPU-GPU various paral-
lelization system. A two stage component classification system is used to
filter out non-text components, and a component connected graph algo-
rithm is used for the constructed text lines. Experimentation is conducted on
UICBI400 dataset; the proposed algorithm achieves the best performance
which is 12.5 times faster.
In the chapter, Google’s Multilingual Neural Machine Translation System
[5] proposes a Neural Machine Translation (NMT) model to translate
between multiple languages. This approach enables Multilingual NMT
systems with a single model by using a shared word piece vocabulary. The
models stated in this work can learn to perform implicit bridging between
language pairs never seen explicitly during training, which shows that
Translate and recreate text in an image 229
for other languages. This is one of the reasons for inefficient spell checkers
in Tamil, as there is no proper dataset to test and validate the accuracy of
the system.
“DDSpell, A Data Driven Spell Checker and Suggestion Generator for
the Tamil Language” [13]. This is an application developed using a data-
driven and language-independent approach. The proposed spell checker and
suggestion generator can be used to check misspelled words. The model
uses a dictionary of 4 million Tamil words, created from various sources, to
check the spelling. Spelling correction and suggestion are done by a char-
acter level bi-gram similarity matching, minimum edit distance measures
and word frequencies. In addition, the techniques like hash keys and hash
table were used to improve the processing speed of spell checking and
suggestion generation.
In the chapter, “Vartani Spellcheck Distance Automatic Spelling Error
Detection and Context-Sensitive Error Correction” [14] can be used to
improve accuracy by post- processing the text generated by these OCR
systems. This model uses the context- sensitive approach for spelling
correction of Hindi text using a state-of-the-art transformer –BERT –in
conjunction with the Levenshtein distance algorithm. It uses a lookup dic-
tionary and context-based named entity recognition (NER) for detection of
possible spelling errors in the text.
This work, “Deep Learning Based Spell Checker for Malayalam
Language,” [15] is a novel attempt, and the first of its kind to focus on
implementing a spell checker for Malayalam using deep learning. The spell
checker comprises two processes: error detection and error correction. The
error detection section employs a LSTM based neural network, which is
trained to identify the misspelled words and the position where the error
has occurred. The error detection accuracy is measured using the F1 score.
Error correction is achieved by the selecting the most probable word from
the candidate word suggestion.
This study, “Systematic Review of Spell-Checkers for Highly Inflectional
Languages,” [16] analyzes articles based on certain criteria to identify the
factors that make spellchecking an effective tool. The literature analyzed
regarding spellchecking is divided into key sub-areas according to the lan-
guage in use. Each sub-area is described based on the type of spellchecking
technique in use at the time. It also highlights the major challenges faced
by researchers, along with the future areas for research in the field of spell-
checkers using technologies from other domains such as morphology, parts-
of-speech, chunking, stemming, hash table, and so forth.
In this work, “Improvement of Extract and Recognizes Text in Natural
Scene Images” [17], spell checking is employed in the proposed device in
order to correct any spelling issues that may also arise while optical per-
sonality recognition is taking place. For recognizing unknown characters,
optical persona cognizance spell checking is applied. The optical persona
Translate and recreate text in an image 233
is randomly assigned to a pixel in the patch region. Here, the Patch Match
algorithm is applied, which accelerated the search for ANN. Secondarily,
the patch is further refined with initialization using an onion-peel approach.
In tertiary, the texture features are reconstructed in the similar manner.
A pyramid scheme is implemented with multiple levels consisting of different
texture choices from which the best one is chosen.
17.3 EXISTING SYSTEM
In traditional machine translation tools, a combination of a first level
detector network, a second level recognizer network, and a third level NMT
network is employed.
NETWORK PURPOSE
Text detector network Region proposal network (RPN) which uses variants of
Faster-RCNN and SSD to find character, word, and line
bounding boxes.
Text extraction network Convolution neural network (CNN) with an additional
quantized long shorter memory (LSTM) network to
identify text inside the bounding box.
Neural machine translation Sequence-to-sequence network, which uses variants of
LSTM and Transformers to translate the identified text
to target language.
17.4 PROPOSED SYSTEM
The proposed system consists of three basic modules, including text extrac-
tion, machine translation, and inpainting. In these modules, the input image
17.4.1 Flow chart
17.4.2 Experimental setup
In terms of hardware, the following requirements are required:
17.4.3 Dataset
Pm-India corpus dataset
For this work, the Seq2Seq encoder decoder model is trained using a corpus
of Tamil sentences. It is possible to obtain corrupted error data for this
training by creating a picture that includes the sentences that were included
in the PM-corpus dataset. The images are then fed into OCR to produce
the OCR extracted text, which contains errors due to the low accuracy of
OCR for Tamil in comparison to other Western languages. The error corpus
and error free data corpus are then used to train the Seq2SeqAuto spell
correction model.
17.5 IMPLEMENTATION
test result. We performed tests on the two models with Pm-India corpus
dataset to decide which one performs well for Tamil language. For this test
we converted the dataset text into images (Figure 17.7) contain a random
1,000 words from the dataset and experimented on both models. Error
240 Intelligent Systems and Applications in Computer Vision
• Simple RNN
• Embed RNN
• Bidirectional LSTM
• Encoder decoder with LSTM
• Encoder decoder with Bidirectional LSTM +Levenshtein Distance
length, called the “context vector.” This vector is passed from the encoder
to the decoder once it has processed all the tokens. Target sequences are
predicted token by token using context vectors read from the decoder.
Figure 17.15 shows how vectors are built so that they will help the
decoder make accurate predictions by containing the full meaning of the
input sequence.
Translate and recreate text in an image 245
max ( a, b) if min ( a, b) = 0
lev ( a, b) = lev ( a − 1, b) + 1
min lev ( a, b − 1) + 1 else
1 if a ≠ b
lev ( a − 1, b − 1) +
0 else
17.5.3 Machine translation
We would like to propose using Python Deep Translator for Tamil to English
translation. The Deep Translator is a python package that allows users to
translate between a variety of languages. Basically, the package aims to
integrate multiple translators into a single, comprehensive system, such as
Google Translator, DeepL, Pons, Linguee and others.
17.5.4 Inpainting
In the in-painting module, the mask image is created from the input image
by masking the detected text in an image. In the next step both original and
masked images are used to recreate the image portion at the masked area.
Finally, the translated text is superimposed onto the image at specific pos-
ition by bounding box coordinates of detected text.
The process of inserting new text into the image in place of the old text
involves four steps:
17.6 RESULT ANALYSIS
In this chapter, we explore the various variants of RNN network models in
order to find the most suitable one for OCR spell checker and correction for
Tamil. The result and analysis are mentioned below.
248 Intelligent Systems and Applications in Computer Vision
17.6.1 Simple RNN
Using the GRU network as the input for the simple RNN model, we can
correct spelling by using GRU for spell check. Since GRU does not have
any internal memory, it is only able to handle a limited number of datasets,
which progressively decreases its training accuracy. The training loss results
are mentioned in the graph (Figure 17.19).
17.6.2 Embed RNN
As GRU does not come equipped with an internal memory, Embed RNN
replaces GRU with LSTM and adds a new input layer that converts input
sequences into a dictionary of vectors. The modifications to the simple RNN
make it faster, but the loss during training does not reduce. This is due to
the conversion of the text to identical alphabets, a figure of the loss graph
during training is shown in Figure 17.20.
17.6.3 Bidirectional LSTM
A bidirectional LSTM model can provide a greater level of detail about
future data in order to analyze patterns. LSTMs can receive input in two
directions, which makes them unique. LSTMs only allow input to flow
Translate and recreate text in an image 249
1 if c > r
Bp = 1− r if c ≤ r
e c
252 Intelligent Systems and Applications in Computer Vision
N
BLEU = BP.exp ∑wn.log pn
n =1
Translate and recreate text in an image 253
As a result of the procedure mentioned above, our mode has a BLEU score
of 0.67. The sample input and output are shown below.
17.7 CONCLUSION
The chapter presents a computer vision and seq2seq encoder decoder model
used to translate the information contained in an image (Tamil) to English
and put it on the same image background. There are several modules
incorporated into the system for extracting, spelling checking, translating,
and inpainting text. Machine Translation can be improved through modules
such as text extraction with easy OCR and spelling correction with the
Seq2Seq model.
It requires improvement in two areas, the first one being that the
system currently only handles Tamil to English translation. By adding
several Indian languages as well as Western languages, the scope of the
project can be expanded. The PM-India dataset is currently available for
all Indian languages. Furthermore, if new words not used in training are
encountered in any phase of the process, it will lead to mistranslation and
overcorrection that ultimately leads to incorrect results. To improve the
process, it is recommended that the dataset be expanded and new words
made part of it.
254 Intelligent Systems and Applications in Computer Vision
ACKNOWLEDGMENTS
We would like to express our sincere gratitude to our respected principal-
in-charge, PSG College of Technology, Coimbatore Dr. K. Prakasan for pro-
viding us the opportunity and facilities to carry out our work.
We also like to express our sincere thanks to Dr. G. Sudha Sadasivam,
head of Computer Science and Engineering department for her guidance
and support given to complete our work.
Our sincere thanks to our program coordinator, Dr. G. R. Karpagam,
professor, Department of Computer Science and Engineering for guiding
and encouraging us in completing our project.
Our sincere gratitude to our mentor, Dr. S. Suriya, associate professor,
Department of Computer Science and Engineering, for her valuable
guidance, additional knowledge, and mentorship throughout the course of
our work.
We would also like to extend our gratitude to our tutor Dr. C. Kavitha,
assistant professor, Department of Computer Science and Engineering, for
her continuous support and evaluation throughout our work.
REFERENCES
[1] Shervin Minaee and Yao Wang, Fellow. Text Extraction From Texture
Images Using Masked Signal Decomposition, IEEE(Global SIP), pp. 01–05
(2017).
[2] Sourav Ghosh, Dibyadwati Lahiri, Showmik Bhowmik, Ergina
Kavallieratou, and Ram Sarkar. Text/ Non- Text Separation from
Handwritten Document Images Using LBP Based Features: An Empirical
Study, mdpi, pp. 01–15 (2018).
[3] Frank D. Julca-Aguilar, Ana L. L. M. Maia and Nina S. T. Hirata. Text/
non-text classification of connected components in document images,
SIBGRAPI, pp. 01–06 (2017).
[4] Yun Song, Jianjun Chen, Hongtao Xie, Zhineng Chen, Xingyu Gao Xi
Chen. Robust and parallel Uyghur text localization in complex background
images, Machine Vision and Applications, vol. 28, pp. 755769 (2017).
[5] Johnson, Melvin, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu,
Zhifeng Chen, Nikhil Thorat et al. “Google’s multilingual neural machine
translation system: Enabling zero-shot translation.” Transactions of the
Association for Computational Linguistics 5 (2017).
[6] Bapna, Ankur, Naveen Arivazhagan, and Orhan Firat. “Simple, scal-
able adaptation for neural machine translation.” arXiv preprint
arXiv:1909.08478 (2019).
[7] Tan, Xu, Jiale Chen, Di He, Yingce Xia, Tao Qin, and Tie- Yan Liu.
“Multilingual neural machine translation with language clustering.” arXiv
preprint arXiv:1908.09324 (2019).
[8] Liu, Yinhan, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan
Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. “Multilingual
Translate and recreate text in an image 255
18.1 INTRODUCTION
Language identification [1] deals with predicting the script of the text in a
scene image. It is a sub-module of a scene text understanding system [2], as
depicted in Figure 18.1. It is also taken as the successor of the text detec-
tion system [3,4] as well as a predecessor module of the scene text recogni-
tion system [5,6]. As text recognition algorithms are language-dependent,
selecting a correct language model is essential. This is where our applica-
tion will be essential. It is a prominent research area in the computer vision
community [7] owing to its wide range of potential applications, such as
language translation, image-to-text conversion, assistance for tourists, scene
understanding [8], intelligent license reading systems, and product reading
assistance for specially abled people in indoor environments, and so forth.
Although two different languages can have the same script, we have used
language and script interchangeably in this chapter.
While language identification from document analysis [9–11] is a well-
explored problem, scene text language identification still remains an unex-
plored problem. Scene text comprises very few words, contrary to the
presence of longer text passages in document images. Due to huge stroke
structural differences, it is easy to classify using a simple classifier in cases
such as identifying scripts between English and Chinese. However, it is
quite cumbersome if the scripts have strong inter- class similarities, like
Russian and English. The existing works on script identification in the wild
are mainly dedicated to English, Chinese, Arabic, and a few East Asian
languages [12–14] and have so far been limited to video overlaid text.
The challenges associated with the scene image text language identifica-
tion task are: (1) enormous difference in the aspect ratios of the text images;
(2) close similarity among scripts in appearance such as Kannada and
Malayalam; (3) character sharing such as in English, Russian, and Greek;
(4) variability in text fonts, scene complexity, and distortion; and (5) presence
of two different languages per cropped text. Research in script identification
on scene text images is scarce and, to the best of our knowledge, it is mainly
Figure 18.1 Scene text language identification module in the scene text understanding
pipeline.
dedicated to English with some Chinese [15] and Korean texts [12]. It also
mainly works on video overlaid text images [15,16] which are horizontal
and clear.
India is a diverse country with 22 officially recognized scripts that are
disproportionately distributed concerning the country’s demographic.
According to the Wikipedia source [17], Hindi is spoken by around 57 per-
cent of the total population, whereas Bengali, Kannada and Malayalam are
used by 8 percent, 4.8 percent and 2.9 percent, respectively, of the overall
country’s population. Due to this, there is lack of research in case of low
resource Indian languages owing to its data scarcity. Though English is part
of the official Indian languages, it is not prevalent in many parts of the
Indian sub-continent. Furthermore, the existing datasets for script identifica-
tion consists of a single language per image (SIW-13 [1], MLe2e [12], CVSI
[15]) but, in reality, more than one language can be present in the scene text
images. Hence, it is currently an open challenge for the research community
in academia and industry. This motivates us to focus our area of research in
this domain of multi-label regional Indian language identification.
To the best of our knowledge, there is not yet a work that deals with
regional Indian scripts that involve many compound characters and struc-
tural similarities (for instance, Kannada and Malayalam). Although datasets
are available for scene text script identification [12,15,18], they consist of
one word per cropped image. In contrast, two or more world languages can
occur in a scene image in a real environment.
To bridge the gap, we strive to solve the problems associated with script
identification in the wild via multi-label learning, where a text image can
be associated with multiple class labels simultaneously. We create a dataset
called IIITG-MLRIT2022 consisting of two languages of text per image in
Multi-label Indian scene text language identification 259
18.2 RELATED WORKS
The work of script or language identification initially started in the printed
document text analysis domain. In the document image analysis domain,
there are few established works for some Indian script identification. Samita
et al. presented a script identification scheme [9] that could be used to iden-
tify 11 official Indian scripts from document text images. In their approach
they first identify the text’s skew before counter-rotating the text to make
it oriented correctly. To locate the script, a multi-stage tree classifier is
employed. By contrasting the appearance of the topmost and bottommost
curvature of lines, edge detection is employed in Phan et al. [21] to iden-
tify scripts comprising English, Chinese, and Tamil languages. Singh et al.
[22] extracted gray level co- occurrence matrix from handwritten docu-
ment images and classify the scripts into Devanagari, Bangla, Telugu, and
Roman using various classifiers such as SVM (Support Vector Machine),
MLP (Multi-Layer Perceptron), Naive Bayes, Random Forest, and so forth,
260 Intelligent Systems and Applications in Computer Vision
and concluded that the MLP classifier performed best among the classifiers
on the dataset used. By redefining the issue as a sequence-to-label problem,
Fuji et al. [23] proposed script identification at line level for document texts.
A conventional method for cross-lingual text identification was introduced
by Achint and Urmila in their paper [24]. When given images with texts
in Hindi and English as input, the algorithm converts those texts into a
single language (English). These conventional component analysis and
binarization methods appear to be pretty unfit for images of natural scenes.
Moreover, document text encompasses a series of words, whereas scene text
images contain mostly less than two or three words.
The concept of script identification on scene text images was presumably
originally introduced by Gllavata et al. [18]. In their research, a number of
hand-crafted features were used to train an unsupervised classifier to dis-
tinguish between Latin and Chinese scripts. The subset of characters used
in languages such as English, Greek, Latin, and Russian is the same. As a
result, it is difficult to recognize the indicated scripts when using the hand-
crafted features representation directly. Incorporating deep convolutional
features [26–30] and spatial dependencies can solve differentiating such
kinds of issues.
CNN and Naive- Bayes Nearest Neighbour classifier are combined in
a multi-stage manner in Gomez et al. [12]. The images are first divided
into patches and use a sliding window to extract stroke parts features. The
features are fed into the CNN to obtain feature vectors that are further
classified by using the traditional classifier. For features representation,
deep features, and mid-level representation are merged in Shi et al. [14].
Discriminative clustering is carried out to learn the discriminative pattern
called the codebook. They are optimized in a deep neural network called
discriminative convolutional neural network. Mei et al. [13] combine CNN
with Recurrent Neural Network (RNN) to identify scripts. The CNN struc-
ture comprising of convolutional and max-pooling layers without the fully
connected layer are stacked up to extract the feature representations of the
image. Then, these image representations are fed to the Long Short Term
Memory (LSTM) layers. The outputs of the LSTM are amalgamated by
average pooling. Finally, a softmax layer is built on top of LSTM layers
to give out the normalized probabilities of each class. A novel semi-hybrid
approach model integrates a BoVW (bag of visual words) with convolu-
tional features in [14]. Local convolutional features in the form of more
discriminative triplet descriptors are used for generating the code word dic-
tionary. The merging of strong and weak descriptors increases the strength
of the weak descriptors. Ankan et al. [32] developed script identification in
natural scene image and video frames using Convolutional-LSTM network.
The input to the network is image patches. It entails extracting local and
global features using the CNN-LSTM framework and dynamically weights
them for script identification. As far as we know, Keserwani et al. [33] were
Multi-label Indian scene text language identification 261
probably the first and the only researchers to propose a scene text script
identification technique via few-shot learning. However, their method is
based on multi-model approaches where word corpora of the text scripts
are used along with text images for training to generate the global feature
and semantic embedding vector.
So far, the works mentioned in the related work section for scene text
script identification consider a single class label per cropped image. In con-
trast, in a real-time environment, more than one language can be present in
scene images (for example, roads/railways’ signboards). Therefore, it would
be helpful to develop an application that can identify multiple languages
simultaneously. The output of this application can be succeeded by passing
it to its corresponding language recognizing system for further processing
such as language translation, image-to-speech processing, and so forth.
18.3 IIITG-M LRIT2022
The objective of creating the dataset is to provide realistic and rigorous
benchmark datasets to evaluate the multi- label language identification
system, particularly for regional Indian scripts. As mentioned in the pre-
vious sections, the public benchmarks available [1,13,15] have concentrated
primarily on English, European, and East Asian languages/scripts with little
mention of diverse Indian languages. Moreover, the existing datasets have
only one language per cropped image, unlike in reality, where there can exist
more than one language per image. Hence, we present a novel dataset for
multi-label language identification in the wild for regional Indian languages
called the IIITG-MLRIT2022. It contains five languages (Hindi, Bengali,
Malayalam, Kannada, and English) with the presence of two scripts per
image (implying the multi-linguality).
The newly created dataset is diverse in nature with the existence of curved,
perspective distorted, and multi-oriented text in addition to the horizontal
text. This diversity is achieved by applying various image transformation
techniques such as affine, arcs, and perspective distortion with different
angular degrees. The dataset is harvested from multiple sources: captured
from mobile cameras, existing datasets, and web sources. We illustrate
sample images both with regular (Row 1 and Row 2) and irregular (Row
3 and Row 4) texts in Figure 18.2. The language combinations of the
proposed dataset are: (Bengali, Hindi); (English, Kannada); and (English,
Malayalam). The collected images are first concatenation into pairs and
then resized to a fixed dimension while passing as an input to the proposed
baseline. We show the statistical distribution of the language text pairs in
Table 18.1.
The IIITG-MLRIT2022 dataset has a total of 1,385 text images that are
cropped from scene images. The dataset classes are slightly imbalanced
owing to the existence of more combinations of English with other regional
262 Intelligent Systems and Applications in Computer Vision
Figure 18.2 Sample images from the proposed benchmark IIITG-MLRIT2022 for regional
Indian scene text script identification.
Indian languages. A pie chart is shown in Figure 18.3 (a), representing the
class distribution.
We also illustrate the co-occurrence matrix of the data labels in Figure 18.3
(b). It represents the number of times entities in the row appear in the same
contexts as each entity in the columns. It also defines the number of times
the combination occurs via color-coding. We hope that the proposed IIIT-
MLRITS2021 will serve as a benchmark for multi-label language identifica-
tion in the scene images.
18.4 PROPOSED METHODOLOGY
CNN can be used to implement a model g(i,ф) for multi-label classification
with an input I and a C- dimensional score vector s as the output. In
deep neural network, feature extraction and classification are integrated in
a single framework, thus, enabling end-to-end learning. Multi-label lan-
guage identification is defined as the task of generating sequential labels
and forecasting the possible labels in a given scene image containing text.
Multi-label Indian scene text language identification 263
18.4.1 Transfer learning
The principal philosophy behind transfer learning is using knowledge
learned from performing a particular task to solve problems in a different
field (refer to Figure 18.5).
If a learning task Гt is given based on Dt, we can get help from Ds for
the learning task Гt. D(.) denotes the domain of the respective, tasks which
is made up of two parts: the feature space and the edge probability distri-
bution. A task is represented by a pair: a label space and target prediction
function f (.). Transfer learning tries to improve the predictive function ft(.) for
Multi-label Indian scene text language identification 265
knowledge”) and sends its own feature vectors to all the successive layers
via concatenation. It increases the efficiency of the network. It lowers the
vanishing gradient problem with improved feature propagation in the for-
ward and backward directions.
DenseNet is composed of N layers. Every layer implements a non-linear
transformation Tn[x0,x1,...,xn-1] where n refers to the index of the layer;
Tn(.) is a composite function such as a combination of Batch normaliza-
tion, ReLU, pooling, and convolutional layers; [x0, x1, ..., xn-1] is the feature
vectors concatenation produced in layers 0 to n-1. To ease the process of
down-sampling in the network, the entire architecture is divided into mul-
tiple compactly connected dense blocks, and the layers between these blocks
are the transition layer that do convolution and pooling. DenseNet121 is
used in our case.
( ) ( ( ( )
Y = Z N θ = zL z3 z2 N θ /θ3 /θL ) (18.1)
where zl(.|ϴ) is the layer l of the vanilla CNN. If the parameter of layer-l is
ϴl=[X,b] where X denotes the corresponding filter and b denotes the vector
bias term. The convolution operation is represented as:
( θ ) = h ( X * N + b)
Yl = zl Nl
l
l (18.2)
268 Intelligent Systems and Applications in Computer Vision
where * denotes the convolution operation and h(.) represents the point-wise
activation. Pooling layers aid in the multi-scale analysis and input image size
reduction. Max pooling applies a max filter to (usually) non-overlapping
sub-regions of the initial representation. The activation function of ReLU
returns the value provided as input or 0 if the value provided is less than
0. We chose this function as it accelerates the gradient descent towards a
global minimum of the loss function.
Our CNN architecture is constructed by using seven convolutional layers.
Every two convolutional layers are followed by a max-pooling layer and a
dropout rate ranging between 0.2 to 0.6. The filter size for each convolu-
tional layer is set to 3 x 3 with a stride of 1. The pooling dimension for the
max-pooling layer is set as 2 x 2. In addition to these layers, the network
contains a linear layer of 1,024 neurons and a final layer of five classes. We
used Adam optimizer for training the network.
N
max ∑ ki , j (18.3)
1≤ j ≤C
i =1
s
weight j = (18.4)
L∑ k {yk = j}
s
1
MWBCE = ∑ weight j × (y j * log y + (1 − y j )* (18.5)
L∑ k =1 {yk = j }
s j
log(1 − y ))
270 Intelligent Systems and Applications in Computer Vision
where y and ŷ are respectively the target label, and the predicted label’s
probability. weightj is the class weight. The frequently occurring label
samples will have less incentive and vice versa. This objective function is
used for training the component models separately. The individual model’s
parameters are updated iteratively in a direction that reduces this loss
function.
of the three best models while utilizing the proposed objective function,
it fails to overtake the five-model combination while utilizing the normal
BCE. We also noticed that, overall, the mAP scores are slightly lower than
the F1-score.
A comparison of the proposed loss function plot (MWBCE) with the
ordinary BCE for the CNN and MobileNetV2 components with respect to
the training and validation loss is depicted in Figure 18.8 and 18.9, respect-
ively. The plot implies that the proposed objective function is more stable,
the training convergence uniformly within the set epoch, and a much lower
value of the loss is also obtained (Figure 18.8 (b) and Figure 18.9 (b)).
To have better insight into the F1-score distribution of the different class
label combinations present in the IIITG-MLRIT2022, the performance of
the individual models and the ensemble are checked for each combination,
as viewed in Figure 18.10. As depicted in the plot, the English-Kannada text
script combination label has the highest overall accuracy scores, while the
Bengali-Hindi combination label has the least prediction score. This may be
due to the similar appearance of Bengali and Hindi, owing to the fact that
the two languages share the same origin. Of all the component models, the
MobileNetV2 output the lowest prediction score. It is observed from the
Multi-label Indian scene text language identification 273
18.7 CONCLUSION
In this chapter, we introduced in an end- to-
end multi-
label scene text
language’s identification preliminary framework, which we believe is the
first research to incorporate multiple languages in an image. We created
multi-label scene text word images using two images of different languages.
The word images are classified using a majority voting deep ensemble archi-
tecture that achieved better prediction accuracy than the individual com-
ponent models. The ensemble model includes MobileNetV2, ResNet50,
DenseNet, Xception Network and 7-layers of vanilla CNN. We further
investigated the impacts of varying the number of base learners and its
effect on the voting strategy. We found that the F1-score of combination
of all the base learners is superior than the performance of combination of
best three highest performing deep learning models on application of the
proposed weighted objective function. We have also created a multi-label
scene text language dataset called the IIITG-MLRIT2022, the first of its
kind based on regional Indian languages. In future, The IIITG-MLRIT2022
dataset can be extended to more Indian language scene text images. Also,
exploring an end-to-end multi-lingual natural scene text understanding
system by emphasizing the regional Indian languages will be a good
research direction.
Multi-label Indian scene text language identification 275
REFERENCES
[1] Shi, B., Yao, C., Zhang, C., Guo, X., Huang, F., & Bai, X. (2015, August).
Automatic script identification in the wild. In 2015 13th International
Conference on Document Analysis and Recognition (ICDAR) (pp. 531–
535). IEEE.
[2] Naosekpam, V., & Sahu, N. (2022). Text detection, recognition, and script
identification in natural scene images: a Review. International Journal of
Multimedia Information Retrieval, 1–24.
[3] Naosekpam, V., Kumar, N., & Sahu, N. (2020, December). Multi-lingual
Indian text detector for mobile devices. In International Conference on
Computer Vision and Image Processing (pp. 243–254). Springer, Singapore.
[4] Naosekpam, V., Aggarwal, S., & Sahu, N. (2022). UTextNet: A UNet
Based Arbitrary Shaped Scene Text Detector. In International Conference
on Intelligent Systems Design and Applications (pp. 368–378). Springer,
Cham.
[5] Naosekpam, V., Shishir, A. S., & Sahu, N. (2021, December). Scene Text
Recognition with Orientation Rectification via IC- STN. In TENCON
2021–2021 IEEE Region 10 Conference (TENCON) (pp. 664–669). IEEE.
[6] Sen, P., Das, A., & Sahu, N. (2021, December). End-to-End Scene Text
Recognition System for Devanagari and Bengali Text. In International
Conference on Intelligent Computing & Optimization (pp. 352– 359).
Springer, Cham.
[7] Naosekpam, V., Bhowmick, A., & Hazarika, S. M. (2019, December).
Superpixel Correspondence for Non-parametric Scene Parsing of Natural
Images. In International Conference on Pattern Recognition and Machine
Intelligence (pp. 614–622). Springer, Cham.
[8] Naosekpam, V., Paul, N., & Bhowmick, A. (2019, September). Dense and
Partial Correspondence in Non-parametric Scene Parsing. In International
Conference on Machine Intelligence and Signal Processing (pp. 339–350).
Springer, Singapore.
[9] Ghosh, S., & Chaudhuri, B. B. (2011, September). Composite script
identification and orientation detection for indian text images. In 2011
International Conference on Document Analysis and Recognition (pp.
294–298). IEEE.
[10] Phan, T. Q., Shivakumara, P., Ding, Z., Lu, S., & Tan, C. L. (2011,
September). Video script identification based on text lines. In 2011
International Conference on Document Analysis and Recognition (pp.
1240–1244). IEEE.
[11] Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic detection and
language identification of multilingual documents. Transactions of the
Association for Computational Linguistics, 2, 27–40.
[12] Gomez, L., & Karatzas, D. (2016, April). A fine-grained approach to scene
text script identification. In 2016 12th IAPR workshop on document ana-
lysis systems (DAS) (pp. 192–197). IEEE.
[13] Mei, J., Dai, L., Shi, B., & Bai, X. (2016, December). Scene text script
identification with convolutional recurrent neural networks. In 2016 23rd
international conference on pattern recognition (ICPR) (pp. 4053–4058).
IEEE.
276 Intelligent Systems and Applications in Computer Vision
[14] Shi, B., Bai, X., & Yao, C. (2016). Script identification in the wild via
discriminative convolutional neural network. Pattern Recognition, 52,
448–458.
[15] Sharma, N., Mandal, R., Sharma, R., Pal, U., & Blumenstein, M. (2015,
August). ICDAR2015 competition on video script identification (CVSI
2015). In 2015 13th international conference on document analysis and
recognition (ICDAR) (pp. 1196–1200). IEEE.
[16] Naosekpam, V., & Sahu, N. (2022, April). IFVSNet: Intermediate Features
Fusion based CNN for Video Subtitles Identification. In 2022 IEEE 7th
International conference for Convergence in Technology (I2CT) (pp. 1–6).
IEEE.
[17] Wikipedia contributors. (2022, September 8). List of languages by
number of native speakers in India. In Wikipedia, The Free Encyclopedia.
Retrieved 06:04, September 12, 2022, from https://en.wikipedia.org/w/
index.php?title=List_of_languages_by_number_of_native_speakers_in_In
dia&oldid=1109262324
[18] Gllavata, J., & Freisleben, B. (2005, December). Script recognition in images
with complex backgrounds. In Proceedings of the Fifth IEEE International
Symposium on Signal Processing and Information Technology, 2005. (pp.
589–594). IEEE.
[19] Sammut, C., & Webb, G. I. (Eds.). (2011). Encyclopedia of machine
learning. Springer Science & Business Media.
[20] Matan, O. (1996, April). On voting ensembles of classifiers. In Proceedings
of AAAI-96 workshop on integrating multiple learned models (pp. 84–88).
[21] Phan, T. Q., Shivakumara, P., Ding, Z., Lu, S., & Tan, C. L. (2011,
September). Video script identification based on text lines. In 2011
International Conference on Document Analysis and Recognition (pp.
1240–1244). IEEE.
[22] Jetley, S., Mehrotra, K., Vaze, A., & Belhe, S. (2014, October). Multi-script
identification from printed words. In International Conference Image
Analysis and Recognition (pp. 359–368). Springer, Cham.
[23] Fujii, Y., Driesen, K., Baccash, J., Hurst, A., & Popat, A. C. (2017,
November). Sequence-to-label script identification for multilingual ocr. In
2017 14th IAPR international conference on document analysis and recog-
nition (ICDAR) (Vol. 1, pp. 161–168). IEEE.
[24] Kaur, A., & Shrawankar, U. (2017, February). Adverse conditions and
techniques for cross- lingual text recognition. In 2017 International
Conference on Innovative Mechanisms for Industry Applications (ICIMIA)
(pp. 70–74). IEEE.
[25] Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y.
(2015, February). The loss surfaces of multilayer networks. In Artificial
intelligence and Statistics (pp. 192–204). PMLR.
[26] Mahajan, S., Abualigah, L., & Pandit, A. K. (2022). Hybrid arithmetic
optimization algorithm with hunger games search for global optimization.
Multimedia Tools and Applications, 1–24.
[27] Mahajan, S., & Pandit, A. K. (2022). Image segmentation and optimiza-
tion techniques: a short overview. Medicon Eng Themes, 2(2), 47–49.
Multi-label Indian scene text language identification 277
[28] Mahajan, S., Abualigah, L., Pandit, A. K., & Altalhi, M. (2022). Hybrid
Aquila optimizer with arithmetic optimization algorithm for global opti-
mization tasks. Soft Computing, 26(10), 4863–4881.
[29] Mahajan, S., & Pandit, A. K. (2021). Hybrid method to supervise fea-
ture selection using signal processing and complex algebra techniques.
Multimedia Tools and Applications, 1–22.
[30] Mahajan, S., Abualigah, L., Pandit, A. K., Nasar, A., Rustom, M.,
Alkhazaleh, H. A., & Altalhi, M. (2022). Fusion of modern meta-heuristic
optimization methods using arithmetic optimization algorithm for global
optimization tasks. Soft Computing, 1–15.
[31] Chen, Z. M., Wei, X. S., Wang, P., & Guo, Y. (2019). Multi-label image
recognition with graph convolutional networks. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition (pp.
5177–5186).
[32] Bhunia, A. K., Konwer, A., Bhunia, A. K., Bhowmick, A., Roy, P. P., &
Pal, U. (2019). Script identification in natural scene image and video
frames using an attention based Convolutional-LSTM network. Pattern
Recognition, 85, 172–184.
[33] Keserwani, P., De, K., Roy, P. P., & Pal, U. (2019, September). Zero shot
learning based script identification in the wild. In 2019 International
Conference on Document Analysis and Recognition (ICDAR) (pp. 987–
992). IEEE.
[34] Kuncheva, L. I. (2014). Combining Pattern Classifiers: Methods and
Algorithms. John Wiley & Sons.
[35] Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable
are features in deep neural networks?. Advances in neural information pro-
cessing systems, 27.
[36] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018).
Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings
of the IEEE conference on computer vision and pattern recognition (pp.
4510–4520).
[37] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for
image recognition. In: Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 770–778).
[38] Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., &
Keutzer, K. (2014). Densenet: Implementing efficient convnet descriptor
pyramids. arXiv preprint arXiv:1404.1869.
[39] Chollet, F. (2017). Xception: Deep learning with depthwise separable
convolutions. In Proceedings of the IEEE conference on computer vision
and pattern recognition (pp. 1251–1258).
[40] Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., & Xu, W. (2016).
Cnn- rnn: A unified framework for multi- label image classification. In
Proceedings of the IEEE conference on computer vision and pattern recog-
nition (pp. 2285–2294).
[41] Ganaie, M. A., & Hu, M. (2021). Ensemble deep learning: A review. arXiv
preprint arXiv:2104.02395.
Chapter 19
19.1 INTRODUCTION
The wrist watches have seen tremendous development towards being called
“smart,” and mainly for the utilization in distant wellbeing checking and
versatile wellbeing [1]. Smart watches can be considered as important
an innovation as the smart phone (mobile phone), which highlights the
persistent information checking for advance wellbeing, for example,
step’s count, pulse observing, energy use, and actual work levels [2].
They can give input to the users, who can screen their wellbeing, perform
mediations in the nick of time –for example, drug utilization dependent
on discussions, and direct correspondence with guardians and doctors
[3]. The extensive technology use in medical services and telemedicine is
restricted by the boundaries that are specific to the smart watches, such
as expense, wearability, and battery life [4]. Hence, the further mentioned
analysis studies the applications of the smart watch in the medical field
and its related characteristics leading to possible smart watch applications
to monitor health remotely. The medical services and telemedicine depend
on the utilization of cell (mobile) phones to empower distant wellbeing
observation of patients [5]–[12]. The instances of practical use of a smart-
phone in the medical sector are management of a prolonged disease at
home [5], discontinuation of the smoking habit [6], planning one’s family
[7], mental health treatment [8], and various other applications of clinical
studies [9]–[12].
Smartphones contemplate nonstop intuitive correspondence from any
area, figuring the ability to help media programming applications, and
consistent observation of patients through remote detecting innovations
[12]. In any case, smartphones cannot be worn on the body (owing to their
large size) to give certain real-time pulse data, and are not generally carried
during the practices of interest, for example, during constrained physical
activities [13]. Smart watches, on the other hand, can be worn on the body
(especially the wrist) to gather real-time pulse and action data of the human
user, which makes it ideal for medical care and associated applications [14].
These questions helped determine whether a smart watch fits the present
lab trial blueprint for varied medical sector applications. These questions
helped in finding the areas of the medical sector that would be benefitted
by using a smart watch in their system layout, along with determining
the sensing technologies and the best-fit smart watches to be utilized. The
above-mentioned questions were answered via an organized study, which is
explained below. The analysis studied the types of applications in the med-
ical sector, explanations about the tests carried out, and the characteristics
280 Intelligent Systems and Applications in Computer Vision
of this tiny device’s technologies, which were the most significant of all.
For analyzing the resultant innovative characteristics amongst the selected
articles, a short search was carried out via search engines (available on the
worldwide Web), so that the smart watches present in the market could be
located along with their technical characteristics. Section 2 represents the
systematic review, which is followed by discussion in Section 3 and finally a
conclusion is given in Section 4.
19.2 SYSTEMATIC REVIEW
19.2.2 Source of information
To dig out appropriate smart watch research for the medical sector, material
for smart watch application in the mentioned field was looked for in the
ACM Digital Library bank from 1998– 2020, IEEE Xplore, Elsevier,
Springer, EBSCO, and PubMed databases. The findings in every information
bank was restricted to English and translated into to English papers. “Smart
watch health care,” “smart watch apps,” “smart watch app” and “smart
watch application,” were the applied keywords to identify the required
research papers. The short-listed papers from the above-mentioned informa-
tion banks were then filtered for relevant research papers, as per the stated
criteria. The papers falling under the eligibility criteria, were dispatched to
two independent persons who evaluated and then verified the appropriate
material. Also, reference lists from short-listed research were added in the
findings, and analysis was done as per the eligibility criteria.
19.2.3 Outcomes
A total of 1391 papers were considered in the pursuit (Figure 19.1 illustrates
PRISMA interaction stream chart of the list items), and 610 exceptional
282 Intelligent Systems and Applications in Computer Vision
articles stayed after duplicates were taken out. In the wake of assessing the
articles, as per the incorporation standards mentioned, 161 articles were
chosen for additional consideration, dependent of the above qualification
measures. The screening cycle showed that 128 papers were rejected out of
the 161 papers barred from additional investigation, on the grounds that
a smart watch was not utilized. In the wake of playing out a theoretical
survey, selection of 33 papers was made to audit the complete text. From
these 33 papers, 25 were considered for the last investigation. Out of these,
none of the papers had sufficient plan homogeneity for meta-examination.
Eight from the final survey were prohibited on the grounds that they
utilized a half breed wellness band and smart watch: for example, the
Microsoft watch was hazy whether they tried utilizing a smart watch or
another wearable sensor or just examined the coordination of the frame-
work and did not unambiguously clarify whether pilot information from
human members was received. Besides, a few frameworks did not indicate
which sort of innovation was used in the intercession, or permitted members
to utilize either wellness band like a Fit Bit or a smart watch. Wellness
groups were rejected from the survey, as they do not give comparable input
and user interface highlights such as extraordinary accessible programming
applications as smart watches.
19.2.4 Healthcare applications
Smart watches have been tried out in very few applications of clinical survey;
25 papers were scrutinized, and a majority of the assessments focused on
utilizing smart watches to check different activity types and consistent
AI based wearables for healthcare applications 283
19.2.5.2 Sensors
Table 19.3 [36]–[51] shows that financially accessible smart watches
have a plethora of sensors. Accessible sensors include microphones, GPS,
compasses, altimeters, gauges, pulse sensors, pedometers, magnetometers,
proximity sensors, gyroscopes, and accelerometers. The modular smart
watch portrayed in Table 19.3 is a custom smart watch based on the Android
OS and permits scientists to pick detecting instruments they might want on
the watch, which allows additional types of sensors, significant for wellness
surveillance. These sensors incorporate sweat sensors, skin temperature
sensor, ECG (electrocardiogram) and heartbeat oximeters. The smart watch
may turn out to be especially helpful in future medical care applications,
as it takes into consideration physiological sensors not commonly found
in smart watches, allowing to persistently screen people locally and along
these lines take into account more potential ailments to be checked and
intercessions to be investigated.
19.3 DISCUSSION
Even though the majority of the features highlighted above are accessible
on cell phones, there were a few strong reasons to include smart watches
in these investigations. In the first place, as recently referenced by the
researchers in multiple studies [13], [17], [33], [34], [52–58], [18], [19],
[21], [22], [27–29], [31], with the use of inertial sensors, the smart watches
newgenrtpdf
288
Table 19.2 Comparative analysis of smart watches used for different application areas in the literature
289
newgenrtpdf
Table 19.2 (Continued)
290
Smart watch Operating
Battery
Rating Blue-tooth
Manufacturer Edition Price (USD) (mAh) Available Sensors OS Version
Pebble Technology [36] Pebble Classic Smart 100 140 Microphone, Compass, Accelerometer. Android Wear BLE 4.0
watch
Samsung [37] Samsung Gear Live 100 300 Accelerometer, gyroscope, heart rate, Android Wear BLE 4.0
compass, camera.
291
newgenrtpdf
Table 19.3 (Continued)
292
Battery
19.4 CONCLUDING REMARKS
It has been observed that only 25 articles out of the 1391 studies on smart
watch utilization are directed towards its use in medical care. Moreover,
these examinations had restricted applications, which included healthcare
education, home-based care, nursing, self-management of chronic diseases
and activity monitoring. All examinations were viewed as a possibility or
convenience studies, and in this way had an extremely small number of
study subjects tried out. Due to the lack of random clinical trial research,
further examination on bigger populaces is recommended. This will evaluate
the adequacy of utilizing smart watches in medical services intercessions and
may at last prompt an inescapable selection of the innovation in this field.
294 Intelligent Systems and Applications in Computer Vision
REFERENCES
[1] T. C. Lu, C. M. Fu, M. H. M. Ma, C. C. Fang, and A. M. Turner, “Healthcare
Applications of Smart Watches: A Systematic Review,” Appl. Clin. Inform.,
vol. 7, no. 3, p. 850, 2016, doi: 10.4338/ACI-2016-03-RA-0042
[2] E. M. Glowacki, Y. Zhu, E. Hunt, K. Magsamen- Conrad, and J. M.
Bernhardt, “Facilitators and Barriers to Smartwatch Use Among
Individuals with Chronic Diseases: A Qualitative Study.” Presented at the
annual University of Texas McCombs Healthcare Symposium. Austin, TX,
2016.
[3] B. Reeder and A. David, “Health at hand: A systematic review of smart
watch uses for health and wellness,” J. Biomed. Inform., vol. 63, pp. 269–
276, Oct. 2016, doi: 10.1016/J.JBI.2016.09.001
[4] D. C. S. James and C. Harville, “Barriers and Motivators to Participating in
mHealth Research Among African American Men,” Am. J. Mens. Health,
vol. 11, no. 6, pp. 1605–1613, Nov. 2017, doi: 10.1177/1557988315620276
[5] T. de Jongh, I. Gurol-Urganci, V. Vodopivec-Jamsek, J. Car, and R. Atun,
“Mobile phone messaging for facilitating self-management of long-term
illnesses,” Cochrane database Syst. Rev., vol. 12, no. 12, Dec. 2012,
doi: 10.1002/14651858.CD007459.PUB2
[6] R. Whittaker, H. Mcrobbie, C. Bullen, A. Rodgers, and Y. Gu, “Mobile
phone- based interventions for smoking cessation,” Cochrane database
Syst. Rev., vol. 4, no. 4, Apr. 2016, doi: 10.1002/14651858.CD006611.
PUB4
[7] C. Smith, J. Gold, T. D. Ngo, C. Sumpter, and C. Free, “Mobile phone-
based interventions for improving contraception use,” Cochrane database
Syst. Rev., vol. 2015, no. 6, Jun. 2015, doi: 10.1002/14651858.CD011159.
PUB2
[8] E. Fisher, E. Law, J. Dudeney, C. Eccleston, and T. M. Palermo,
“Psychological therapies (remotely delivered) for the management of
chronic and recurrent pain in children and adolescents,” Cochrane database
Syst. Rev., vol. 4, no. 4, Apr. 2019, doi: 10.1002/14651858.CD011118.
PUB3
[9] J. S. Marcano Belisario, J. Jamsek, K. Huckvale, J. O’Donoghue, C. P.
Morrison, and J. Car, “Comparison of self- administered survey ques-
tionnaire responses collected using mobile apps versus other methods,”
Cochrane database Syst. Rev., vol. 2015, no. 7, Jul. 2015, doi: 10.1002/
14651858.MR000042.PUB2
[10] E. Ozdalga, A. Ozdalga, and N. Ahuja, “The smartphone in medicine: a
review of current and potential use among physicians and students,” J.
Med. Internet Res., vol. 14, no. 5, 2012, doi: 10.2196/JMIR.1994
[11] T. L. Webb, J. Joseph, L. Yardley, and S. Michie, “Using the internet to
promote health behavior change: a systematic review and meta-analysis
of the impact of theoretical basis, use of behavior change techniques, and
mode of delivery on efficacy,” J. Med. Internet Res., vol. 12, no. 1, 2010,
doi: 10.2196/JMIR.1376
[12] C. Free, G. Phillips, L. Felix, L. Galli, V. Patel, and P. Edwards, “The effect-
iveness of M-health technologies for improving health and health services: a
AI based wearables for healthcare applications 295
systematic review protocol,” BMC Res. Notes, vol. 3, 2010, doi: 10.1186/
1756-0500-3-250
[13] G. F. Dunton, Y. Liao, S. S. Intille, D. Spruijt- Metz, and M. Pentz,
“Investigating children’s physical activity and sedentary behavior using
ecological momentary assessment with mobile phones,” Obesity (Silver
Spring)., vol. 19, no. 6, pp. 1205– 1212, Jun. 2011, doi: 10.1038/
OBY.2010.302
[14] F. Ehrler and C. Lovis, “Supporting Elderly Homecare with
Smartwatches: Advantages and Drawbacks,” Stud. Health Technol. Inform.,
vol. 205, pp. 667–671, 2014, doi: 10.3233/978-1-61499-432-9-667
[15] S. Mann, “Wearable Computing: A First Step Toward Personal Imaging,”
Cybersquare Comput., vol. 30, no. 2, 1997, Accessed: Dec. 26, 2021.
[Online]. Available: http://wearcam.org/ieeecomputer/r2025.htm
[16] H. Ali and H. Li,, “Designing a smart watch interface for a notifica-
tion and communication system for nursing homes,” in Human Aspects
of IT for the Aged Population. Design for Aging: Second International
Conference, ITAP 2016, Held as Part of HCI International 2016, Toronto,
ON, Canada, July 17–22, Proceedings, Part I 2, pp. 401–411. Springer
International Publishing.
[17] E. Årsand, M. Muzny, M. Bradway, J. Muzik, and G. Hartvigsen,
“Performance of the first combined smartwatch and smartphone diabetes
diary application study,” J. Diabetes Sci. Technol., vol. 9, no. 3, pp. 556–
563, 2015.
[18] O. Banos et al., “The Mining Minds digital health and wellness frame-
work,” Biomed. Eng. Online, vol. 15, no. 1, pp. 165–186, 2016.
[19] C. Boletsis, S. McCallum, and B. F. Landmark, “The use of smartwatches
for health monitoring in home-based dementia care,” in Human Aspects of
IT for the Aged Population. Design for Everyday Life: First International
Conference, ITAP 2015, Held as Part of HCI International 2015, Los
Angeles, CA, USA, August 2–7, 2015. Proceedings, Part II 1, pp. 15–26.
Springer International Publishing.
[20] P. Chippendale, V. Tomaselli, V. d’Alto, G. Urlini, C. M. Modena, S.
Messelodi, ... and G. M. Farinella, “Personal shopping assistance and
navigator system for visually impaired people,” in Computer Vision-ECCV
2014 Workshops: Zurich, Switzerland, September 6-7 and 12, 2014,
Proceedings, Part III 13, pp. 375–390. Springer International Publishing.
[21] H. Dubey, J. C. Goldberg, K. Mankodiya, and L. Mahler, “A multi-
smartwatch system for assessing speech characteristics of people with
dysarthria in group settings,” in 2015 17th International Conference on
E-health Networking, Application & Services (HealthCom), pp. 528–533.
IEEE.
[22] H. Dubey, J. C. Goldberg, M. Abtahi, L. Mahler, and K. Mankodiya,
“{EchoWear}: smartwatch technology for voice and speech treatments of
patients with Parkinson’s disease,” 2015.
[23] M. Duclos, G. Fleury, P. Lacomme, R. Phan, L. Ren, and S. Rousset, “An
acceleration vector variance based method for energy expenditure estima-
tion in real-life environment with a smartphone/smartwatch integration,”
Expert Syst. Appl., vol. 63, pp. 435–449, 2016.
296 Intelligent Systems and Applications in Computer Vision
[24] S. Faye, R. Frank., & T. Engel, “Adaptive activity and context recogni-
tion using multimodal sensors in smart devices,” in Mobile Computing,
Applications, and Services: 7th International Conference, MobiCASE
2015, Berlin, Germany, November 12–13, 2015, Revised Selected Papers
7, pp. 33–50. Springer International Publishing.
[25] M. Haescher, J. Trimpop, D. J. C. Matthies, G. Bieber, B. Urban, and T.
Kirste, “aHead: considering the head position in a multi-sensory setup
of wearables to recognize everyday activities with intelligent sensor
fusions,” in Human-Computer Interaction: Interaction Technologies:
17th International Conference, HCI International 2015, Los Angeles, CA,
USA, August 2–7, 2015, Proceedings, Part II 17, pp. 741–752. Springer
International Publishing.
[26] Y. Jeong, Y. Chee, Y. Song, and K. Koo, “Smartwatch app as the chest com-
pression depth feedback device,” in World Congress on Medical Physics
and Biomedical Engineering, June 7–12, 2015, Toronto, Canada, pp.
1465–1468. Springer International Publishing.
[27] H. Kalantarian and M. Sarrafzadeh, “Audio-based detection and evalu-
ation of eating behavior using the smartwatch platform,” Comput. Biol.
Med., vol. 65, pp. 1–9, 2015.
[28] J. Lockman, R. S. Fisher, and D. M. Olson, “Detection of seizure-like
movements using a wrist accelerometer,” Epilepsy Behav., vol. 20, no. 4,
pp. 638–641, Apr. 2011, doi: 10.1016/J.YEBEH.2011.01.019
[29] W. O. C. Lopez, C. A. E. Higuera, E. T. Fonoff, C. de Oliveira Souza, U.
Albicker, and J. A. E. Martinez, “Listenmee\textregistered{}and Listenmee\
textregistered{}smartphone application: synchronizing walking to rhythmic
auditory cues to improve gait in Parkinson’s disease,” Hum. Mov. Sci., vol.
37, pp. 147–156, 2014.
[30] B. Mortazavi et al., “Can smartwatches replace smartphones for posture
tracking?,” Sensors, vol. 15, no. 10, pp. 26783–26800, 2015.
[31] L. de S. B. Neto, V. R. M. L. Maike, F. L. Koch, M. C. C. Baranauskas,
A. de Rezende Rocha, and S. K. Goldenstein, “A wearable face recogni-
tion system built into a smartwatch and the blind and low vision users,”
in Enterprise Information Systems: 17th International Conference, ICEIS
2015, Barcelona, Spain, April 27–30, 2015, Revised Selected Papers 17,
pp. 515–528. Springer International Publishing.
[32] C. Panagopoulos, E. Kalatha, P. Tsanakas, and I. Maglogiannis,
“Evaluation of a mobile home care platform,” in Ambient Intelligence:
12th European Conference, AmI 2015, Athens, Greece, November 11–13,
2015, Proceedings 12, pp. 328–343. Springer International Publishing.
[33] V. Sharma et al., “{SPARK}: personalized parkinson disease interventions
through synergy between a smartphone and a smartwatch,” in Design,
User Experience, and Usability. User Experience Design for Everyday Life
Applications and Services: Third International Conference, DUXU 2014,
Held as Part of HCI International 2014, Heraklion, Crete, Greece, June
22–27, 2014, Proceedings, Part III 3, pp. 103–114. Springer International
Publishing.
[34] E. Thomaz, I. Essa, and G. D. Abowd, “A practical approach for rec-
ognizing eating moments with wrist- mounted inertial sensing,” in
AI based wearables for healthcare applications 297
20.1 INTRODUCTION
In order to solve difficult or time-sensitive problems, optimization is the
process of determining the best possible solution. These optimization
algorithms can be stochastic or deterministic in nature. Different types of
approaches are used for optimization, but nature is the best way to solve
optimization problems as the mapping between the nature and the engin-
eering problems is readily possible, because these nature-inspired algorithms
mimic the behavior of nature in technology and therefore prove much better
than the traditional or other approaches for solving complex tasks.
Nature is a marvelous teacher, and we human beings perhaps are the
best learners on earth. The term “nature-inspired computing” (NIC) refers
to a group of computing methodologies that have been inspired by natural
systems and processes. These systems and processes can be seen in nature
and can be modelled for computing applications. Table 20.1 lists few of
the many computing techniques that have been inspired by nature. NIC
mimics the natural processes and systems to develop algorithms that can
be used by computing machines to solve highly complex and non-linear
problems. A typical NIC system is a sophisticated, autonomous computer
system run by a population of independent entities inside a context. The
NIC system’s autonomous entity is made up of two components: effectors,
processing elements, and sensors. One or more sensors, processing elements
(PEs), and effectors may be present. Sensors gather data about their wider
surroundings and their immediate surroundings. The information obviously
relies on the system being represented or the issue being handled. Effectors
alter the internal present conditions, display certain behaviors, and alter
the environment based on the input and output of PEs. In essence, the PEs
and effectors enable information exchange between autonomous units. The
NIC system provides a database of local behavior regulations. The behavior
codes are essential to an autonomous unit.
Fuzzy logic, artificial neural networks, evolutionary computation,
rough sets, granular computing, swarm intelligence, and physics and
any theories and methods that make use of fuzzy sets .Thus we shall state
that “Use of fuzzy sets in logical expressions is called fuzzy logic”. Fuzzy
logic based systems are modelled on the psychology and philosophy of the
working of the human brain.
Figure 20.1’s block diagram may be used to describe a computing para-
digm based on fuzzy logic. The system consists of four main parts. The
crisp input(s) are transformed into fuzzy values via the fuzzification module.
Then, using the knowledge base (rule base and procedural knowledge)
provided by the domain expert, the inference engine processes these values
in the fuzzy domain (s). Finally, the defuzzification module converts the
processed output from the fuzzy domain to the crisp domain.
Apart from fuzzy sets [1], rough sets [4], granular computing [5,6],
Perception-Based Computing, Wisdom Technology, Anticipatory
Computing [3] also find extensive applications in solution modelling for a
complex problem.
Biogeography-based
Atmosphere clouds model optimization Brain storm optimization
Differential evolution Dolphin echolocation Egyptian vulture
Japanese tree frogs calling Eco-inspired evolutionary Fish-school search
algorithm
Flower pollination Gene expression Great salmon run
algorithm
Group search optimizer Human-inspired algorithm Invasive weed optimization
OptBees Paddy field algorithm Roach infestation algorithm
Queen-bee evolution Shuffled frog leaping Termite colony optimization
algorithm
Nature inspired computing for optimization 305
inheritance. These methods are being used more often to solve a wide
range of issues, from cutting-edge scientific research to actual practical
applications in business and industry. Evolutionary Computing is the
study and application of the theory of evolution to an engineering and
computing context.
In EC systems we usually define five things: A phenotype, a genotype, gen-
etic operators (like combination/crossover and mutation), a fitness function,
and selection operators. The phenotype is a solution to the problem we
want to solve. The genotype is the representation of that solution, which
will suffer variation and selection in the algorithm. Most often, but not
always, the phenotype and the genotype are the same. A type of optimization
approach called evolutionary computation (EC) [15–17] is motivated by the
mechanics of biological evolution and the behaviors of living things. In gen-
eral, EC algorithms include learning classifier systems, genetic algorithms,
evolutionary strategies, and evolutionary programming (EP, GA, GP) (LCS).
Differential evolution (DE) and the estimate of distribution method are also
included in EC (EDA).
20.3 SWARM INTELLIGENCE
James Kennedy and Russell Eberhart initially put out the idea of swarm intel-
ligence. This was inspired by various ant, wasp, bee, and other swarming
behaviors. They lack intellect as a whole, but their capacity for coordinated
action in the absence of a coordinator makes them appear clever. These
agents communicate with one another to develop “intelligence”, and they
do so without any centralized control or supervision. Swarm intelligence-
based algorithms are among the most widely used. A lot of them include bat
algorithms, firefly algorithms, artificial bee colonies, cuckoo searches, and
particle swarm optimization.
306 Intelligent Systems and Applications in Computer Vision
the product of their masses and inversely proportional to the square of their
distance from one another”. The searcher agents are a group of masses that
communicate with one another using the principles of motion and gravity.
The performance of the agents is determined by their mass, which is treated
like an object. All items gravitate toward other objects with heavier weights
due to the gravitational force. The algorithm’s exploitation step is guaran-
teed, and excellent solutions correlate to the slower movement of heavier
masses. Actually, the masses are obedient to gravity’s laws.
Apart from the above-mentioned search and optimization approaches, a
number of optimization approaches based on physics or chemistry laws are
available in the literature and are shown in the Table 20.3.
20.5 CONCLUSION
This chapter presents an extensive survey of available nature- inspired
search and optimization approaches in existing literature. These optimiza-
tion approaches can be successfully further applied in the different fields
of engineering, like wireless communication, control engineering, neural
networks, and so forth. Depending upon the nature and the requirement of
the problem, any of the optimization approaches can be chosen. It has been
found that these nature-inspired optimization approaches are promoted
by researchers due to their better results by comparison to the classical
approaches.
REFERENCES
1. Zadeh, L.A. “Fuzzy sets”, Information and Control, Vol. 8 (3), pp. 338–
353, 1965.
2. Yen, J. and Langari, R. Fuzzy Logic Intelligence, Control and Information.
Prentice Hall, Upper Saddle River, NJ, 1999, pp. 548.
3. Lavika Goel, Daya Gupta, V.K. Panchal and Ajith Abraham. “Taxonomy of
Nature Inspired Computational Intelligence: A Remote Sensing Perspective”,
Fourth World Congress on Nature and Biologically Inspired Computing
(NaBIC-2012), pp. 200–206.
4. Pawlak, Z. “Rough Sets”, International Journal of Computer and
Information Sciences, 11, pp 341–356, 1982.
5. Bargiela, A. and Pedrycz W. Granular Computing: An Introduction, Kluwer
Academic Publishers, Boston, 2002.
6. Yao, Yiyu. “Perspectives of Granular Computing” Proceedings of IEEE
International Conference on Granular Computing, Vol. I, pp. 85–90, 2005.
7. Bishop, Chris M. “Neural Networks and their applications”, Review of
Scientific Instruments, Vol. 65, No. 6, June 1994, pp. 1803–1832.
8. Soroush, A.R., Kamal- Abadi, Nakhai Bahreininejad A. “Review on
applications of artificial neural networks in supply chain management”,
World Applied Sciences Journal 6 (supplement 1), pp. 12–18, 2009.
Nature inspired computing for optimization 309
9. Jin, Yaochu, Jingping Jin, Jing Zhu. “Neural Network Based Fuzzy
Identification and Its Applications to Control of Complex systems”, IEEE
Transactions on Systems, Man and Cybernetics, Vol, 25, No. 6, June 1995,
pp. 990–997.
10. Simon Haykin. Neural Networks: A Comprehensive Foundation, Prentice
Hall PTR, Upper Saddle River, NJ, 1994.
11. Jacek, M. Zurada. Introduction to Artificial Neural Systems, West Publishing
Co., 1992.
12. Martin T. Hagan, Howard B. Demuth, Mark H. Beale, Neural Network
Design, Martin Hagan, 2014.
13. Widrow, Bernard, Lehar, Michael A. “30 Years of Adaptive Neural
Networks: Perceptron, Medaline and Back Propagation” Proceedings of the
IEEE, Vol 78, No. 9, Sep 1990, pp. 1415–1442.
14. Jain, Anil K., Mao Jianchang, Mohiuddin, “Artificial Neural Networks: a
Turorial”, IEEE Computers, Vol. 29, No. 3, March 1996 pp. 31–44.
15. Back, T.. Evolutionary computation: comments on the history and current
state. IEEE Trans. Evol. Comput. 1:3–17. 1997.
16. S. Kumar, S.S. Walia, A. Kalra. “ANN Training: A Review of Soft Computing
Approaches”, International Journal of Electrical & Electronics Engineering,
Vol. 2, Spl. Issue 2, pp. 193–205. 2015.
17. A. Kalra, S. Kumar, S.S. Walia. “ANN Training: A Survey of classical and
Soft Computing Approaches”, International Journal of Control Theory and
Applications, Vol. 9, issue 34 pp. 715–736. 2016.
18. Jaspreet, A. Kalra. “Artificial Neural Network Optimization by a Hybrid
IWD- PSO Approach for Iris Classification”, International Journal of
Electronics, Electrical and Computational System IJEECS, Vol. 6, Issue 4,
pp. 232–239. 2017.
19. Comparative Survey of Swarm Intelligence Optimization Approaches for
ANN Optimization, Springer, ICICCD. AISC Advances in Intelligent
Systems and Computing book series by Springer (scopus indexed). Vol. 624,
pp. 305. 2017.
20. Mahajan, S., Abualigah, L. & Pandit, A.K. Hybrid arithmetic opti-
mization algorithm with hunger games search for global optimization.
Multimed Tools Appl 81, 28755–28778. 2022. https://doi.org/10.1007/s11
042-022-12922-z
21. Mahajan, S., Abualigah, L., Pandit, A.K. et al. Hybrid Aquila opti-
mizer with arithmetic optimization algorithm for global optimization
tasks. Soft Comput 26, 4863– 4881. 2022. https://doi.org/10.1007/s00
500-022-06873-8
22. Run Ma, Shahab Wahhab Kareem, Ashima Kalra, Rumi Iqbal Doewes,
Pankaj Kumar, Shahajan Miah. “Optimization of Electric Automation
Control Model Based on Artificial Intelligence Algorithm”, Wireless
Communications and Mobile Computing, vol. 2022, 9 pages, 2022. https://
doi.org/10.1155/2022/7762493
310 Intelligent Systems and Applications in Computer Vision
23. Soni, M., Nayak, N.R., Kalra, A., Degadwala, S., Singh, N.K. and Singh, S.
“Energy efficient multi-tasking for edge computing using federated learning”,
International Journal of Pervasive Computing and Communications,
Emerald. 2022.
24. Wolpert H. David and William G. Macready. “No free lunch theorems for
optimization”. IEEE Transactions on Evolutionary Computation. Vol. 1,
No. 1, April 1997, pp. 67–82.
Chapter 21
21.1 INTRODUCTION
Retail outlets have taken on huge importance in ordinary life. People in
metropolitan areas routinely go to malls to purchase their daily necessities.
In such a situation, the environment ought to be uncontroversial. This
system is planned for edibles like fresh food varieties and other consumable
produce, normalized label stickers, and RFID names can’t be used as they
should be stuck on all of the things and the quality of everything should be
freely assessed. This chapter proposes a system that contains a camera that
itemizes the purchase using simulated intelligence methodology and a load
cell that measures the item added to the shopping bag. This system also
creates the shopper’s final bill.
1. Machine learning
3. Image processing
21.2 LITERATURE SURVEY
IoT is unquestionably driving humankind to a superior world, except that
it is important to keep in mind factors like energy utilization, time required,
cost factors, and so on. This features issues in eight categories:
These topics are essential. From this we can arrive at a goal that IoT is the
vision to what is to accompany extended intricacy in distinguishing, incit-
ation, exchanges, control, and in making data from immense proportions
of data, achieving an abstractly special and more direct lifestyle than is
experienced today.
Taking care of pictures is an assessment and control of digitized pictures,
especially dealing with the idea of picture processing. In picture processing,
there are several stages: Picture Pre-processing, Image Segmentation, Image
Edge Detection, Feature Extraction, and Picture Acknowledgment. The
huge Picture Handling beneficiaries are Agriculture, Multimedia Security,
Remote Sensing, Computer Vision, Medical Applications, and so forth.
Using Otsu procedure picture thresholding is done, and a while later Pre-
treatment of the picture is done and by using the K-Means algorithm, Fuzzy
C Means [1] algorithm, TsNKM [4] computation further division is done
Automated smart billing cart for fruits 313
ready cycle. Then, using the KNN approach, calculate the gap between
new normal item image components and present item image features infor-
mational index using Euclidian distance, which is then planned using the
gathering findings.
This chapter [3] relies on the usage of a speeded up solid component.
The methodology removes the close by part of the separated picture and
portrays the article affirmation. The fundamental advances are to make
an informational collection of the picture to be portrayed. Then, picture
pre-taking care of done through various picture taking care of strategies to
chip away at the idea of the image and later a couple of channels are used
to de-stretch the image. Finally, picture classifiers are utilized to determine
how to proceed. Picture is changed over from RGB picture to constrain pic-
ture. Taking into account a speeded up lively technique area incorporate is
removed and portrayed. To portray the outer layer of the data picture, veri-
fiable assessment of inconsistency. Various components were eliminated,
for instance, object affirmation, picture enrollment, seeing limit, and pic-
ture recuperation. Articles and cutoff lines of pictures are acquired by pic-
ture division. Then, incorporate extraction like shape, size, concealing, and
the surface of normal not set in stone using computation. Then, for dis-
ease request configuration planning is applied. The system in like manner
consolidates obvious surface flaws distinguishing proof estimations, not to
solely to identify them, but to prevent their differences from hindering the
creation of a standard.
On the shape and hiding ward on examination systems, two- layered
normal item photos are required in this study. Using counterfeit neural
association (ANN), a system was developed to increase the precision of
the regular item quality area. The basic idea is to obtain the image of a
common object. The image of the standard item tests is obtained using a
standard contemporary camera with a white backdrop and a stand. The
image of the second step is to prepare neural affiliation, natural items are
layered into MATLAB® to fuse the component extraction of every single
model in the dataset. The final phase eliminates aspects of the usual object
testing. In the fourth step, neural affiliation is used to organize the data.
The standard object test is chosen for testing in the fifth phase from an
enlightening assortment. ANN preparation module button is used to exe-
cute a sixth testing under a concordance condition. Finally, ANN-based
outcomes are obtained, with the client having the option of selecting the
instance of the standard thing that must be obtained in total.
The typical item testing’ features have been deleted. The data is prepared in
the fourth stage using neural association. From the informational collection,
a regular item test is chosen for testing in the fifth phase. ANN getting ready
module button is used to do a sixth testing in a condition of harmony.
Finally, ANN-based findings are obtained, with the consumer having the
option of selecting a standard item case to test and ultimately purchase.
Automated smart billing cart for fruits 315
21.3 PROPOSED METHOD
Our system will automatically identify the fruit or vegetable put into the
cart. We are using the TensorFlow Lite object detection module for image
processing. Collection of data is done with Raspberry PI. A load cell will
measure the weight of the fruits added. A camera will capture images of
fruits. Bluetooth sensor will send all the data to the mobile device. After
adding all the fruits to the cart a bill will be generated on screen. After
entering CVV and bank details, payment will be made.
21.4 IMPLEMENTATION
• The user login in our application using the given design. The user
enters personal information such as his or her name, phone number,
and email address. The user can then login to the program after suc-
cessfully registering. All the credentials are stored in the database.
• User has to put fruit in front of camera on the load cell. Load cell
will weigh the fruits. Then the camera will recognize fruit in front of
it using the TensorFlow Lite model. After the working of this model,
the fruit is recognized. On the screen fruit is displayed with an image
of the fruit.
• Once the fruit is recognized then the HC-05 sensor will transfer data
from rpi to mobile app. Then all the fruits are added to the cart and the
bill is generated on the screen. After clicking on proceed to payment
button, a new window is displayed. A user must then enter their bank
details with their card number and CVV.
• One-time password (OTP) is generated on the registered mobile
number .After entering CVV details and otp QR code is generated on
the app. A bill is also received on registered email ID.
1.
Customer registration on the application.
2.
On successful registration, login with valid credentials.
3.
Add your Credit/Debit card details.
4.
Connect to respective trolley via the application.
5.
Place a fruit /vegetable on the load cell.
6.
Camera will capture the image and recognize the object.
7.
Details of the object will be sent to the customer’s mobile phone with
the help of HC-05 (Bluetooth Module).
8. Data received can be seen once the view cart button is hit.
9. Further, proceed to checkout.
10. Enter CVV and receive an OTP.
Automated smart billing cart for fruits 317
11. Enter the OTP and the process is completed and you receive a QR
code (bill).
Image processing using TensorFlow Lite module gives fast results and
accuracy. It also requires less processing power and offers more inference
time. These models are used to obtain fast and accurate results in real time
applications such as a smart cart system.
21.6 RESULTS
• User registers on the mobile application by providing personal details.
After the successful registration, the user can login to the application.
All the credentials are stored in the database (see Figures 6.1 and 6.2).
• Users put fruit in front of camera on the load cell. Load cell will weigh
the fruits. Then camera will recognize fruit in front of it using tensor
flow lite model. After the working if this model, fruit is recognized. On
the screen fruit is displayed with an accuracy of the fruit (see Figure
6.3).
• Once the fruit is recognized then HC-05 sensor will transfer data from
rpi to mobile app. Then all the fruits are added to the cart and bill is
generated on the screen. After clicking on proceed to payment button
new window is displayed. A user must have to enter their bank details
with their card number and CVV.
• OTP is generated on the registered mobile number. After entering
CVV details and otp QR code is generated on the app (see Figure 6.4).
A bill is also received on registered email ID.
21.7 CONCLUSION
This chapter explained the system that proposes the mechanized smart trolley
that can be utilized by any shopping center and which will save time as well
as decrease the number of customers close to the checkout. The proposed
Automated smart billing cart for fruits 319
trolley is not difficult to use and guarantees saving time and creating gains
for shopping center proprietors. This framework is likewise truly reasonable
for clients as it saves time and establishes a problem free climate. Tested
recognition accuracy for oranges (65%), bananas (60%), apples (70%) and
strawberries (68%) on this module. This automated smart shopping cart is
user friendly and anyone can access it in supermarkets.
REFERENCES
[1] Yogesh, Iman Ali, Ashad Ahmed, “Segmentation of Different Fruits Using
Image Processing Based on Fuzzy C-means Method,” in 7th International
Conference on Reliability, Infocom Technologies and Optimization. August
2018. DOI: 10.1109/ICRITO.2018.8748554
[2] Md Khurram Monir Rabby, Brinta Chowdhury and Jung H. Kim,”
A Modified Canny Edge Detection Algorithm for Fruit Detection and
Classification,” in International Conference on Electrical and Computer
Engineering (ICECE). December 2018. DOI:10.5121/ijcsit.2017.9508
[3] Jose Rafael Cortes Leon, Ricardo Francisco Martínez- González, Anilu
Miranda Medinay, “Raspberry Pi and Arduin Uno Working Together as
A Basic Meteorological Station,” In, International Journal of Computer
Science & Information Technology. October 2017. DOI: 10.1109/
ICECE.2018.8636811
[4] H. Hambali, S.L.S. Abdullah, N. Jamil, H. Harun, “Intelligent segmenta-
tion of fruit images using an integrated thresholding and adaptive K-means
method (TsNKM),” Journal Technology, vol. 78, no. 6-5, pp. 13–20. 2016.
DOI: 10.11113/jt.v78.8993
[5] D. Shukla and A. Desai, “Recognition of fruits using hybrid features and
machine learning.” In International Conference on Computing, Analytics
and Security Trends (CAST), pp. 572–577. December 2016, DOI: 10.1109/
CAST.2016.7915033
[6] M. Zawbaa, M. Hazman, M. Abbass and A.E. Hassanien, “Automatic fruit
classification using random forest algorithm,” 2014, in 14th International
Conference on Hybrid Intelligent Systems, pp. 164–168. 2014. DOI: 10.1109/
HIS.2014.7086191
[7] Y. Mingqiang, K. Kpalma, and J. Ronsin, ”A Survey of shape feature extrac-
tion techniques.” In Book Pattern Recognition Techniques, Technology and
Applications pp. 43–98. 2008.
[8] D.G. Lowe, ”Object recognition from local scale- invariant features.”
In Computer Vision, in the proceedings of the seventh IEEE inter-
national conference, Corfu, Greece. pp. 1150–1157. 1999. DOI: 10.1109/
ICCV.1999.790410
Index
321
322 Index