Face Mask Detector: A Project Report Submitted in Partial Fulfillment of The Requirement For The Award of The Degree of

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Face Mask Detector

A Project Report submitted in partial fulfillment


of the requirement for the award of the degree of

TECHNICAL UNIVERSITY DEGREE

Submitted by

ACHRAF CHTITEH ZAKARIA ELASRI

ADNANE ELMOURABITI

Under supervision of

Mr. MRANI NABIL


ACKNOWLEDGEMENT

First of all, We would like to express our deepest gratitude


to the almighty Allah for giving us the ability to work hard
successfully. Words actually will never be enough to
express our gratefulness.

Then, we would like to give a special thanks and respect to


our honorable teacher and supervisor Mr. MRANI NABIL
for his constant guidance, advice, encouragement & every
possible help in the overall preparation of this report.

Finally , we would also like to thanks our colleagues and


everyone who has contributed directly or indirectly to
complete this report.
TABLE OF CONTENTS

PROJECT INTRODUCTION P4

WHAT IS DEEPLEARNING P5
DEFINITION OF FUNDAMENTAL CONCEPT P6
INTRODUCTION TO KERAS & TENSORFLOW & MOBILENETV2 P8

PROJECT STRUCTURE P9
OBJECTIVE P9
APPROACH P9

THE PROPOSED METHOD P10


DATAPROCESSING P11
BUILDING BLOCKS OF CNNARCHITECTURE P12

IMPLEMENTATION OFTHE MASK DETECTOR P20

SCREENSHOTS FROM EXECUTION P25

CONCLUSION P27
Abstract
The corona virus COVID-19 pandemic is causing a global health crisis
so the effective protection methods is wearing a face mask in public
areas according to the World Health Organization(WHO).
After the breakout of the worldwide pandemic COVID-19, there arises
a severe need of protection mechanisms, face mask being the primary
one. The basic aim of the project is to detect the presence of a face
mask on human faces on live streaming video as well as on images. .
A hybrid model using deep and classical machine learning for face
mask detection will be presented. A face mask detection dataset
consists of with mask and without mask images , we are going to use
TensorFlow to do real-time face detection from a live stream via our
webcam. We will use the dataset to build a COVID-19 face mask
detector with computer vision using Python Tensor Flow and Keras.
Our goal is to identify whether the person on image/video stream is
wearing a face mask or not with the help of computer vision and deep
learning.

1 Project Introduction

The year 2020 has shown mankind some mind-boggling series of events
amongst which the COVID- 19 pandemic is the most life-changing event
which has startled the world since the year began. Affecting the health
and lives of masses, COVID-19 has called for strict measures to be
followed in order to prevent the spread of disease. From the very basic
hygiene standards to the treatments in the hospitals, people are doing all
they can for their own and the society’s safety; face masks are one of the
personal protective equipment. People wear face masks once they step
out of their homes and authorities strictly ensure that people are wearing
face masks while they are in groups and public places.
To monitor that people are following this basic safety principle, a
strategy should be developed. A face mask detector system can be
implemented to check this. Face mask detection means to identify
whether a person is wearing a mask or not. The first step to recognize
the presence of a mask on the face is to detect the face, which makes the
strategy divided into two parts: to detect faces and to detect masks on
those faces. Face detection is one of the applications of object detection
and can be used in many areas like security, biometrics, law enforcement
and more. There are many detec- tor systems developed around the
world and being implemented. However, all this science needs
optimization; a better, more precise detector, because the world cannot
afford any more increase in corona cases.
What is deeplearning?

This chapter covers


◾ High-level definitions of fundamental concepts
◾ Introdcutionto keras and Tensorflow and
mobilenetv2

In the past few years, artificial intelligence ( AI) has been a subject of intense media
hype. Machine learning, deep learning, and AI come up in countless articles, often
outside of technology-minded publications. We’re promised a future of intelligent
chatbots, self-driving cars, and virtual assistants a future sometimes painted in a
grim light and other times as utopian, where human jobs will be scarce and most
economic activity will be handled by robots or AI agents. For a future or current
practitioner of machine learning.
1.1 Artificial intelligence, machinelearning,
and deep learning
First, we need to define clearly what we’re talking about when we
mention AI. What are artificial intelligence, machine learning, and deep
learning (see figure 1.1)? How do they relate to each other?

Artificial
intelligence

Machine
learning Figure 1.1 Artificial intelligence,
machine learning, and deep
Deep learning
learning

1. Artificial intelligence
Artificial intelligence was born in the 1950s, when a handful of
pioneers from the nascent field of computer science started asking
whether computers could be made to “think” a question whose
ramifications we’re still exploring today.
Although symbolic AI proved suitable to solve well-defined, logical
problems, such as playing chess, it turned out to be intractable to figure
out explicit rules for solving more complex, fuzzy problems, such as
image classification, speech recognition, and language translation. A
new approach arose to take symbolic AI’s place: machine learning.

2. Machine learning
Machine learning arises from this question: could a computer go
beyond “what we know how to order it to perform” and learn on its own
how to perform a specified task? Could a computer surprise us? Rather
than programmers crafting data-processing rules by hand, could a
computer automatically learn these rules by looking at data?
This question opens the door to a new programming paradigm. In
classical programming, the paradigm of symbolic AI, humans input rules
(a program) and data to be processed according to these rules, and out
come answers (see figure 1.2). With machine learning, humans input data
as well as the answers expected from the data,
and out come the rules. These rules can then be applied to new data to
produce original answers.
Rules Classical
Answers
Data programming

Data Machine Figure 1.2 Machine learning:


learning Rules
Answers a new programming
paradigm

A machine-learning system is trained rather than explicitly programmed. It’s


presented with many examples relevant to a task, and it finds statistical
structure in these examples that eventually allows the system to come up with
rules for automating the task

3. Deep learning

Deep learning is a class of machine learning algorithms that uses multiple layers to
progressively extract higher-level features from the raw input. For example,
in image processing, lower layers may identify edges, while higher layers may
identify the concepts relevant to a human such as digits or letters or
faces. Learning can be supervised, semi-supervised or unsupervised.

1. Supervisedlearning
This is Supervised learning (SL) is the machine learning task of learning a function
that maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
◾ Syntax tree prediction—Given a sentence, predict its decomposition into a syntax tree.
◾ Object detection—Given a picture, draw a bounding box around certain objects inside
the picture. This can also be expressed as a classification problem (given many
candidate bounding boxes, classify the contents of each one) or as a joint
classification and regression problem, where the bounding-box coordinates are
predicted via vector regression.
◾ Image segmentation —Given a picture, draw a pixel-level mask on a specific object.

2. Unsupervisedlearning
This branch of machine learning consists of finding interesting
transformations of the input data without the help of any targets, it allows the
model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with the unlabelled data

3. self-supervisedlearning

This is a specific instance of supervised learning, but it’s different


enough that it deserves its own category. Self-supervised learning is
supervised learning without human-annotated labels
1. Introduction to tensorflow
TensorFlow is a free and open-source software library for dataflow and
differentiable programming across a range of tasks. It is a symbolic math
library, and is also used for machine learning applications such as neural
network. In the proposed model, the whole Sequential CNN architecture
(consists of several layers) uses TensorFlow at backend. It is also used to
reshape the data (image) in the data processing.

2. Introduction to Keras
Keras is a deep-learning framework for Python that provides a
convenient way to define and train almost any kind of deep-learning
model. Keras was initially developed for researchers, with the aim of
enabling fast experimentation.
Keras has the following key features:
◾ It allows the same code to run seamlessly on CPU or GPU.
◾ It has a user-friendly API that makes it easy to quickly prototype
deep-learning models.
◾ It has built-in support for convolutional networks (for computer
vision), recur- rent networks (for sequence processing), and any
combination of both.

Keras has well over 200,000 users, ranging from academic


researchers and engi- neers at both startups and large companies to
graduate students and hobbyists. Keras is used at Google, Netflix, Uber,
CERN, Yelp, Square, and hundreds of startups work- ing on a wide
range of problems.

Figure 2.1 Google web search interest for different deep-learning frameworksover time
3. Introduction to MobileNetV2
MobileNetV2 is a significant improvement over MobileNetV1 and pushes
the state of the art for mobile visual recognition including classification, object
detection and semantic segmentation. MobileNetV2 is released as part
of TensorFlow-Slim Image Classification Library, or you can start exploring
MobileNetV2 right away in Colaboratory. Alternately, you can download the
notebook and explore it locally using Jupyter. MobileNetV2 is also available
as modules on TF-Hub, and pretrained checkpoints can be found on github.

MobileNetV2 builds upon the ideas from MobileNetV1 [1], using depthwise
separable convolution as efficient building blocks. However, V2 introduces
two new features to the architecture: 1) linear bottlenecks between the layers,
and 2) shortcut connections between the bottlenecks1. The basic structure
. Is shown below

Overview of MobileNetV2 Architecture. Blue blocks represent composite convolutional building blocks as shown above.
Project structure

a) Objective
To identify the person on image/video stream wearing face
mask with the help of computer vision and deep learning
algorithm by using the tensorflow/keras library.
b) Approach
Train Deep learning model (MobileNetV2 Process data
Apply mask detector over images / live video stream

The dataset/
directory contains the data described in the “Our COVID-19 face mask detection
dataset” section.

This dataset consists


of 1,376 images
belonging to two classes:
•with_mask
690 images
•without_mask
686 images

We’ll be reviewing three Python scripts :


•train_mask_detector.py : Accepts our input dataset and fine-tunes MobileNetV2 upon
it to create our mask_detector.model
•detect_mask_video.py : Using your webcam, this script applies face mask detection to
every frame in the stream
The Proposed Method

The proposed method consists of a cascade


classifier and a pre-trained CNN which contains two
2D convolution layers connected to layers of dense
neurons. The algorithm for face mask detection is as
follows:

The proposed method can locate the face in real time


and assess how the mask is being worn to aid the
control of the pandemic in public areas.
A. Data Processing
Data preprocessing involves conversion of data from a given format to
much more user friendly, desired and meaningful format. It can be in any form
like tables, images, videos, graphs, etc. These organized information fit in
with an information model or composition and captures relationship between
different entities

• Conversion of RGB image to Gray image


Modern descriptor-based image recognition systems regularly work on
grayscale images, without elaborating the method used to convert from color-
to-grayscale. This is because the color-to-grayscale method is of little
consequence when using robust descriptors.
Building blocks of CNN architecture

A Convolutional Neural Network (ConvNet/CNN) is a Deep


Learning algorithm which can take in an input image, assign importance
(learnable weights and biases) to various aspects/objects in the image and be
able to differentiate one from the other. The pre-processing required in a
ConvNet is much lower as compared to other classification algorithms. While
in primitive methods filters are hand-engineered, with enough training,
ConvNets have the ability to learn these filters/characteristics.

Which is CNN's greatest advantage?


What is the biggest advantage utilizing CNN? Little dependence on
pre processing, decreasing the needs of human effort developing its
functionalities. It is easy to understand and fast to implement. It has
the highest accuracy among all alghoritms that predicts images.
Anatomy of a convolutional neural network
In neural networks, Convolutional neural network (ConvNets or CNNs) is one
of the main categories to do images recognition, images classifications.
Objects detections, recognition faces etc., are some of the areas where CNNs
are widely used.

Computers sees an input image as array of pixels and it depends on the image
resolution. Based on the image resolution, it will see h x w x d( h = Height, w =
Width, d = Dimension ). Eg., An image of 6 x 6 x 3 array of matrix of RGB (3 refers
to RGB values) and an image of 4 x 4 x 1 array of matrix of grayscale image.

Figure 3: Every image is a matrix of pixel


values.

Technically, deep learning CNN models to train and test, each input image will pass it
through a series of convolution layers with filters (Kernals), Pooling, fully connected
layers (FC) and apply Softmax function to classify an object with probabilistic values
between 0 and 1.
1. Convolution Layer

Convolution is the first layer to extract features from an input image. Convolution preserves the
relationship between pixels by learning image features using small squares of input data. It is a
mathematical operation that takes two inputs such as image matrix and a filter or kernel.

Consider a 5 x 5 whose image pixel values are 0, 1 and filter matrix 3 x 3 as shown in
below

Then the convolution of 5 x 5 image matrix multiplies with 3 x 3 filter matrix which is
called “Feature Map” as output shown in below
2. Padding
Sometimes filter does not fit perfectly fit the input image. We have two options:
• Pad the picture with zeros (zero-padding) so that it fits
•Drop the part of the image where the filter did not fit. This is called valid padding which keeps only
valid part of the image.

Non Linearity (ReLU)


ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x).
Why ReLU is important : ReLU’s purpose is to introduce non-linearity in our ConvNet. Since, the
real world data would want our ConvNet to learn would be non-negative linear values.

the ReLU operation can be understood clearly from Figure 9 below. It shows the ReLU
operation applied to one of the feature maps obtained in the figure above. The output feature
map here is also referred to as the ‘Rectified’ feature map.

Figure 4: ReLU operation


3. Poolinglayers
Pooling layers section would reduce the number of parameters when the images are too large.
Spatial pooling also called subsampling or downsampling which reduces the dimensionality of
each map but retains important information. Spatial pooling can be of different types:
• Max Pooling
• Average Pooling
• Sum Pooling
Max pooling takes the largest element from the rectified feature map. Taking the largest element
could also take the average pooling. Sum of all elements in the feature map call as sum pooling

Just like in the convolution step, the creation of the pooled feature map also makes us dispose of
unnecessary information or features. In this case, we have lost roughly 75% of the original information
found in the feature map since for each 4 pixels in the feature map we ended up with only the maximum
value and got rid of the other 3. These are the details that are unnecessary and without which the network
can do its job more efficiently.

The reason we extract the maximum value, which is actually the point from the whole pooling step,
is to account for distortions. Let's say we have three cheetah images, and in each image the
cheetah's tear lines are taking a different angle
4. Flattening

After finishing the previous two steps, we're supposed to have a pooled feature map by
now. As the name of this step implies, we are literally going to flatten our pooled feature
map into a column like in the image below.

The reason we do this is that we're going to need to insert this data into an
artificial neural network later on.

As you see in the image above, we have multiple pooled feature maps
from the previous step.

What happens after the flattening step is that you end up with a long
vector of input data that you then pass through the artificial neural
network to have it processed further.
5. Fully Connected Layer
The layer we call as FC layer, we flattened our matrix into vector and feed it into a fully
connected layer like a neural network.

Figure 9 : After pooling layer, flattened


as FC layer

The Full Connection Process


As we said the input layer contains the vector of data that was created in the flattening
step. The features that we distilled throughout the previous steps are encoded in this
vector.

At this point, they are already sufficient for a fair degree of accuracy in recognizing
classes. We now want to take it to the next level in terms of complexity and precision.

What isthe aim of thisstep?


The role of the artificial neural network is to take this data and combine the features into
a wider variety of attributes that make the convolutional network more capable of
classifying images, which is the whole purpose from creating a convolutional neural
network.
The Convolution Process:AQuick Recap

Since we're now done with this section, let's make a quick recap of what we
learned about convolutional neural networks. In the diagram below, you can see
the entire process of creating and optimizing a convolutional neural network that
we covered throughout the section.

As you see and should probably remember, the process goes as follows:

• We start off with an input image.

• We apply filters or feature maps to the image, which gives us a convolutional layer.

• We then break up the linearity of that image using the rectifier function.

•The image becomes ready for the pooling step, the purpose of which is providing our convolutional
neural network with the faculty of “spatial invariance” which you'll see explained in more detail in the
pooling step.

• After we're done with pooling, we end up with a pooled feature map.

• We then flatten our pooled feature map before inserting into an artificial neural network.
Implementingour COVID-19 face maskdetector

training script with Keras and TensorFlow

Now that we’ve reviewed our face mask


dataset,let’sseehowwe canuse Kerasand
TensorFlowto traina classifier
to automatically detectwhethera personis
wearinga maskor not.

To accomplish this task, we’ll be fine-tuning the MobileNet V2


architecture, a highly efficient architecture that can be applied to
embedded devices with limited computational capacity (ex., Raspberry
Pi, Google Coral, NVIDIA Jetson Nano, etc.).
Note: If your interest is embedded computer vision, be sure to check out
my Raspberry Pi for Computer Vision book which covers working with
computationally limited devices for computer vision and deep learning.
Deploying our face mask detector to embedded devices could reduce
the cost of manufacturing such face mask detection systems, hence why
we choose to use this architecture.
Implementing our COVID-19 face maskdetector
training script with Keras and TensorFlow

The imports for our training script may look intimidating and so many..

tensorflow.keras imports allow for:


• Data augmentation
•Loading the MobilNetV2 classifier (we will fine-tune this model with
pretrained ImageNet weights)
• Building a new fully-connected (FC) head
• Pre-processing
• Loading image data

deep learning hyperparameters :

Here, We’ve specified hyperparameter constants including our initial learning rate,
number of training epochs, and batch size. Later, we will be applying a learning rate decay
schedule, which is why we’ve named the learning rate variable INIT_LR.

At this point, we’re ready to load and pre-process our training data:
In this block, we are:
• Grabbing all of the imagePaths in the dataset
• Initializing data and labels lists (Lines 36 and37)
•Looping over the imagePaths and loading + pre-processing images (Lines 39-48). Pre-
processing steps include resizing to 224×224 pixels, conversion to array format, and scaling the
pixel intensities in the input image to the range [-1, 1] (via the preprocess_input convenience
function)
•Lines 51-53 one-hot encode our class labels, meaning that our data will be in the following
format:

As you can see, each element of our labels


array consists of an array in which only one index is “hot” (i.e., 1).

•Appending the pre-processed image and associated label to the data and labels lists,
respectively (Lines 47 and 48)
• Ensuring our training data is in NumPy array format (Lines 55 and 56)
Using scikit-learn’s convenience method, Lines 58 and 59 segment our data into 80%
training and the remaining 20% for testing.
During training, we’ll be applying on-the-fly mutations to our images in an effort to
improve generalization. This is known as date augmentation, where the random
rotation, zoom, shear, shift, and flip parameters are established on Lines 62-69. We’ll
use the Aug object at training time.

Fine-tuning setup is a three-step process:


1.Load MobileNet with pre-trained ImageNet weights, leaving off head of network (Lines
73 and 74)
2.Construct a new FC head, and append it to the base in place of the old head (Lines 78-87)
3.Freeze the base layers of the network (Lines 91 and 92). The weights of these base layers
will not be updated during the process of backpropagation, whereas the head layer weights
will be tuned.
With our data prepared and model architecture in place for fine-tuning, we’re now ready to
compile and train our face mask detector network:

Lines 96-98
Compile our model with the Adam optimizer, a learning rate decay schedule, and binary cross-
entropy. If you’re building from this training script with > 2 classes, be sure to use categorical
cross-entropy.
Face mask training is launched via Lines 102-107. Notice how our data augmentation object
(aug) will be providing batches of mutated image data.

Here, Lines 111-115 make predictions on the test set, grabbing the highest probability class
label indices. Then, we print a classification report in the terminal for inspection.
Line 123 serializes our face mask classification model to disk.
Screenshots fromExecution

Here both persons wear masks correctly

When persons don’t wear a mask or try to deceive the


program
Even if there is a group of people the program will disting those
who wear masks from who do not

Here we tried to show the program a monkey face which is close to human being
facial features , but it does not consider it as a human face
CONCLUSION

To mitigate the spread of COVID-19 pandemic, measures


must be taken. We have modeled a face mask detector
learning methods in neural networks. To train, validate
and test the model, we used the dataset that consisted
of 686 masked faces images and 690 unmasked faces
images. These images were taken from various resources
like Kaggle datasets.
The model was inferred on images and live video
streams. To select a base model, we evaluated the
metrics like accuracy, precision and recall and selected
MobileNetV2 architecture with the best performance
having 100% precision and 99% recall.

It is also computationally efficient using MobileNetV2


which makes it easier to install the model to embedded
systems. This face mask detector can be deployed in
many areas like shopping malls, airports and other heavy
traffic places to monitor the public and to avoid the
spread of the disease by checking who is following basic
rules and who is not.

You might also like