Retinafacemask: A Single Stage Face Mask Detector For Assisting Control of The Covid-19 Pandemic

The paper proposes RetinaFaceMask, a single stage face mask detector to help control the COVID-19 pandemic. It establishes a new dataset distinguishing correct and incorrect mask wearing states. It introduces a context attention module to focus on discriminative features for mask wearing states. It also transfers knowledge from face detection to improve mask detection performance. Experiments show the advantages of the proposed model and its state-of-the-art performance on public and new datasets.

Uploaded by

Anup Kumar Paul

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

Retinafacemask: A Single Stage Face Mask Detector For Assisting Control of The Covid-19 Pandemic

Uploaded by

Anup Kumar Paul

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

RetinaFaceMask: A Single Stage Face Mask Detector for

Assisting Control of the COVID-19 Pandemic

Xinqi Fan∗ , Mingjie Jiang∗

Abstract— Coronavirus 2019 has made a significant impact

on the world. One effective strategy to prevent infection for
people is to wear masks in public places. Certain public service
providers require clients to use their services only if they
properly wear masks. There are, however, only a few research
arXiv:2005.03950v3 [cs.CV] 15 Dec 2021

studies on automatic face mask detection. In this paper, we

proposed RetinaFaceMask, the first high-performance single
stage face mask detector. First, to solve the issue that existing
studies did not distinguish between correct and incorrect mask
wearing states, we established a new dataset containing these
annotations. Second, we proposed a context attention module
to focus on learning discriminated features associated with face
mask wearing states. Third, we transferred the knowledge from
the face detection task, inspired by how humans improve their
ability via learning from similar tasks. Ablation studies showed
the advantages of the proposed model. Experimental findings Fig. 1. Challenges in face mask detection.
on both the public and new datasets demonstrated the state-
of-the-art performance of our model.

I. INTRODUCTION state can be treated as a distinct class. As shown in Fig. 1,

According to the World Health Organization (WHO), the challenges of face mask detection include a variety of
coronavirus disease 2019 (COVID-19) has infected over 79.2 in-the-wild situations with a complex background, confused
million individuals and caused over 1.7 million fatalities until faces without masks where faces may be obscured by other
the end of 2020 [1]. Numerous computer-assisted approaches objects, a variety of mask types with different shapes and
have been developed to aid in the fight against COVID- colors, and incorrect mask wearing cases.
19, including automatic detection of COVID-19 cases based Typically, traditional object detectors are built on hand-
on X-ray or computed tomography (CT) images [2], [3], crafted feature extractors. The Viola Jones detector utilized
COVID-19 trend prediction [4], and analysis of human the Haar feature in conjunction with the integral image
reactions to COVID-19 [5]. It is, however, more critical approach [11], whilst other studies utilized a variety of fea-
for individuals to protect themselves from the COVID-19 ture extractors, including the histogram of oriented gradients
virus. Fortunately, the study [6] demonstrated that surgical (HOG), the scale-invariant feature transform (SIFT), and
face masks can help limit coronavirus dissemination. At the others [12]. Recently, object detectors based on deep learning
moment, the WHO recommends that people wear face masks demonstrated superior performance and have dominated the
if they have respiratory symptoms or are caring for someone development of new object detectors. Without relying on
who does [7]. Additionally, several public service providers prior knowledge to construct feature extractors, deep learning
require users to use services only while wearing masks [8]. can learn the features in an end-to-end manner [13]. There
Therefore, automatic face mask detection has emerged as are two types of deep learning based object detectors: one-
a critical computer vision task for assisting the worldwide stage and two-stage detectors. One-stage detectors, such as
community, but research on this is limited. you only look once (YOLO) [14] and single shot detector
Face mask detection entails both the localization of faces (SSD) [15], detected objects using a single neural network.
and the recognition of mask wearing states, which we define The advantage of SSD is that it detects objects using multi-
the states as no mask wearing and mask wearing in general. scale feature maps. By contrast, two-stage detectors, such
Due to the requirements of healthcare, we further classified as region-based convolutional neural network (R-CNN) [16]
the states of mask wearing into correct and incorrect mask and faster R-CNN [17], employed two networks to conduct
wearing states. In one aspect, the face mask detection prob- a coarse-to-fine detection. RetinaFace [18], a dedicated face
lem is similar to face detection [9], as localizing the face mask detector, used a multi-scale detection architecture sim-
is a critical subtask. In another perspective, the problem is ilar to SSD but included a feature pyramid network (FPN)
closely related to general object detection [10], where each to fuse high and low level semantic information to increase
∗ These two authors contributed equally to this paper.
detection performance. Additionally, numerous approaches
The authors are with the City University of Hong Kong, Hong Kong SAR, for studying face mask detection were created. According to
China {xinqi.fan, minjiang5-c}@my.cityu.edu.hk. the timeline, the initial version of this work, RetinaFaceMask
(also known as RetinaMask), can be considered as the first
TABLE I
attempt to introduce the face mask detection work. Li et
D IFFERENCES BETWEEN MAFA AND MAFA-FMD
al. [19] increased the robustness of face mask detection
by implementing a mix-up and multi-scale technique based MAFA MAFA-FMD
on YOLOv3. To enhance the post-processing of YOLOv3 Number of annotations 39,485 56,024
for face mask detection, a distance intersection over union Face type Masked face
Unmasked face,
(IoU) non-maximum suppression (NMS) approach was uti- Masked face
Simple,
lized [20]. However, these algorithms either ignore all pos- No mask wearing,
Mask type / Complex,
Correct mask wearing,
sible face mask wearing states that occur in real healthcare Mask wearing state Hand,
Incorrect mask wearing
applications, or report performance only on limited datasets. Hybrid
Blurred face 7 3
In this paper, we proposed a novel single stage face mask Number of low
1,016 4,567
detector, RetinaFaceMask, which is able to detect face masks resolution annotations
and contribute to public healthcare. We made the following
contributions in this study:
• By reannotating the current MAsked FAces (MAFA)
However, the original MAFA annotations do not address
dataset used for masked face analysis, we created a
the requirements for face mask detection in healthcare
new dataset MAsked FAces for Face Mask Detection
settings. Therefore, we relabelled the MAFA dataset with
(MAFA-FMD). The new annotation includes three dis-
three different mask wearing states, “no mask wearing”,
tinct mask wearing states: no mask wearing, correct
“correct mask wearing”, and “incorrect mask wearing”, and
mask wearing, and incorrect mask wearing, which is
named it MAFA-FMD. The procedure for relabeling is as
more realistic in terms of contributing to public health.
follows. First, we generated reference annotations from orig-
MAFA-FMD contains around 56,000 annotations.
inal annotations. In detail, we kept all box annotations, and
• To focus on learning discriminated features associated
converted “simple”, “complex” mask types as correct mask
with face mask wearing states, we proposed a novel
wearing, “body”, “hybrid” mask types as no mask wearing
context attention module (CAM). The module can ex-
states. Second, we applied RetinaFaceMask trained on the
tract more useful context features, and concentrate on
AIZOO dataset to do inference on MAFA, and recorded all
those that are critical for face mask wearing states.
predictions as another reference. Finally, three professional
• Inspired by how humans enhance their skills via the use
persons manually revised all reference box coordinates and
of knowledge gained from other tasks, we used transfer
class annotations, and used LabelImg to relabel new faces
learning (TL) to transfer the knowledge learned from
as well. When identifying masks, we considered disposable
face detection tasks. Experimentally, we demonstrated
medical masks, medical surgical masks, medical protective
that face detection and face mask detection are highly
masks, dusk masks, gas masks, respirators as valid masks.
correlated, and the feature learned from the former is
In addition, cloth masks were also regarded as valid ones,
useful for the latter task.
since it is also advised by the centers for disease control
Ablation studies showed the effectiveness of the CAM and prevention (CDC) [22]. Certain masks that do not
and TL, since they can boost the mean average precision completely enclose the mouth and nose were deemed invalid.
(mAP) by a large margin. Experimental results on the public For example, those who wear traditional Chinese veils were
dataset AIZOO demonstrated that RetinaFaceMask achieved considered no mask wearing cases, despite the fact that they
the state-of-the-art result, and a 4% increase compared to resemble some forms of masks.
the baseline method. RetinaFaceMask also had the best
The major differences between the original MAFA and the
performance on the MAFA-FMD dataset, which contains
MAFA-FMD are summarized in Table I. In terms of the total
three distinct mask wearing states and is notoriously difficult.
number of annotated faces, MAFA contains 39,485 annotated
The remainder of this paper is structured as follows. Sec- faces, while MAFA-FMD has 56,084 ones, which is around
tion II illustrates the established dataset. Section III presents 16,000 more than that of MAFA. For face types, MAFA does
the proposed RetinaFaceMask. Section IV discusses the used not annotate faces without any masks, but MAFA-FMD con-
datasets, experiment settings, results, and discussion. Finally, tains both masked and unmasked faces. In addition, the mask
Section V concludes the paper and outlines future work. types have been reclassified to mask wearing states as “no
mask wearing”, “correct masking wearing”, “incorrect mask
II. T HE MAFA-FMD DATASET
wearing” with the corresponding numbers 26,463, 28,233,
Gu et al. prepared the original MAFA dataset from the In- and 1,388 for each class. The imbalanced label distribution
ternet using the Flickr, Google, and Bing search engines [21]. shows a long-tailed problem for this in-the-wild dataset.
The dataset contains 35,806 images with a minimum length Furthermore, MAFA-FMD includes blurred faces, which
of 80 pixels. The annotations of the dataset have locations were not included in the original MAFA annotation. The
of faces, mask types, etc. Each image was annotated by two number of low resolution (smaller than 32 × 32 resolution)
individuals and verified by another. More details of MAFA annotations has been increased from approximately 1,000 in
can be found in [21]. MAFA to more than 4,000 in MAFA-FMD.
Fig. 2. Architecture of RetinaFaceMask. FPN fuses high-level information with low-level information through upsampling and adding; CAMs can focus
on learning discriminated face mask wearing states related features; knowledge learned from face detection is transferred into the backbone; B, N , CM
and IM stand for background, no mask wearing, correct mask wearing, and incorrect mask wearing.

III. METHODOLOGY
A. Network Architecture
The architecture of the proposed RetinaFaceMask is shown
in Fig. 2. To cope with the diverse scenes in face mask
detection, a strong feature extraction network ResNet50 is
used as the backbone network. C1 , C2 , C3 , C4 and C5 denote
the intermediate output feature maps of the backbone’s layers
conv1, conv2 x, conv3 x, conv4 x and conv5 x used in the Fig. 3. Illustration of CAM. It has a context enhancement block, a channel
attention block, and a spatial attention block.
original ResNet50 [23]. These feature maps are generated by
convolutions with distinct receptive fields, allowing for the
detection of objects of varying sizes. At this point, we have consisting of one 3 × 3 convolution, two 3 × 3 convolutions,
established the general structure for our multi-scale detection and three 3 × 3 convolutions. Equivalently, these branches
model. However, one disadvantage of shallow layers is that correspond to 3 × 3, 5 × 5 and 7 × 7 receptive fields. Then,
their outputs lack sufficient high-level semantic information, inspired by [25], we apply channel and spatial attention to
which might result in poor detection performance. To address focus on both channel and spatial important features asso-
this, an FPN has been adopted, and the details are as follows. ciated with face mask wearing states. The channel attention
First, we apply a 3×3 convolution on C5 to obtain P5 . Then, block on the input P ∈ RD×H×W can be calculated as
we upsample P5 using nearest interpolation to the same size
as C4 , and merge the upsampled P5 and channel-adjusted C4

Λc = σ FM LP HGAP (P ) + FM LP HGM P (P ) ∈ RD , (1)
with an element-wise addition. Likewise, we obtain P3 from
P4 and C3 . In addition, we also proposed a light-weighted where Λc is the channel attention map; sigmoid function σ
version of RetinaFaceMask (RetinaFaceMask-Light) by us- normalizes the output to (0, 1); FM LP denotes for a three-
ing the backbone of MobileNetV1 for running on embedded layer multi-layer perception; HGAP and HGM P are global
devices efficiently. C3 , C4 and C5 for RetinaFaceMask-Light average pooling and global maximum pooling. Similarly, the
are yielded from the last convolution blocks with the original attention map Λs yielded by the spatial attention block is
output sizes 28 × 28, 14 × 14, and 7 × 7 in [24].

Λs = σ K3×3 HCAP (P ) ⊕ HCM P (P ) ∈ RH×W , (2)
B. Context Attention Module
where denotes a 2D convolution; K3×3 is a 3 × 3 kernel;
In comparison to face detection, face mask detection ⊕ stands for the channel concatenation; HCAP and HCM P
requires both the localization of faces and the discrimination are channel average pooling and channel maximum pooling.
of distinct mask wearing states. To focus on learning more
discriminated features for mask wearing states, we proposed C. Transfer Learning
a CAM as shown in Fig. 3. First, to enhance the context The uncontrolled and diverse in-the-wild scenes make
feature extraction, we employ three parallel subbranches feature learning difficult. One possible solution is to collect
and annotate more data for training. In RetinaFaceMask, we TABLE II
proposed to mimic the human learning process by trans- A BLATION S TUDY OF R ETINA FACE M ASK .
ferring knowledge from face detection to help face mask
CAM TL APN APM mAP
detection. According to [26], [27], TL has aided in feature
7 7 92.8 93.1 93.0
learning as long as these tasks have a correlation. Therefore, 3 7 94.2 93.6 93.9
in our work, we transfer the knowledge learned on a large 7 3 94.5 94.3 94.4
scale face detection dataset Wider Face, which consists of 3 3 95.0 94.6 94.8
32,203 images and 393,703 annotated faces [28] to enhance
the feature extraction ability for FMD.
IV. EXPERIMENT AND DISCUSSION

D. Training A. Dataset

Our network generates two matrices, location offset ybl ∈ 1) AIZOO: The AIZOO Face Mask Dataset [30] has
R np ×4
and class probability ybc ∈ Rnp ×nc , where np and nc 7,959 images, where the faces are annotated either with
refer to the number of anchors and the number of categories a mask or without a mask. The dataset is a composite
of the bounding boxes, respectively. The following data, of the Wider Face [28] and MAFA datasets [21], with
default anchors yda ∈ Rnp ×4 , the ground truth bounding approximately 50% of data from each. The predefined test
boxes yl ∈ Rno ×4 and the true class label yc ∈ Rno ×1 are set is used.
provided, where no is the number of objects to be detected 2) MAFA-FMD: As described in section II, MAFA-FMD
and is variable for different images. is a reannotated dataset, in which there are three classes, “no
To calculate the model’s loss, we begin by selecting the mask wearing”, “correct mask wearing” and “incorrect mask
top class and calculating the offset for each default anchor wearing”. The original test set split of MAFA is kept.
through matching the default anchors yda , the ground truth
bounding boxes yl , and the true class label yc to obtain B. Experiment Setup
matched matrices pml ∈ Rnp ×4 and pmc ∈ Rnp , where The model was developed on PyTorch [31] deep learning
the rows of pml and pmc denote the coordinates offsets framework. The model was trained for 250 epochs with
and the labels with the highest probability for each default a stochastic gradient descent (SGD) algorithm of learning
anchor, respectively. Then, we obtain the positive localization rate 10−3 and momentum 0.9. An NVIDIA GeForce RTX
prediction and positive matched default anchors ybl+ ∈ Rp+ ×4 2080 Ti GPU was employed. The input image resolution
and p+ ml ∈ R
p+
by selecting the foreground boxes, where is 840 × 840 for RetinaFaceMask, and is 640 × 640 for
p+ denotes the number of default anchors with non-zero RetinaFaceMask-Light.
top classification label. The L1 -smooth loss Lloc (b yl+ , p+
ml ) is
used to perform box coordinates regression. Following that, C. Ablation Study
the hard negative mining [29] is performed to obtain the
sampled negative default anchors p− mc ∈ R
p−
and predicted We performed an ablation study to evaluate the effective-
− p−
anchors ybc ∈ R , where p− is the number of sampled ness of CAM and TL using RetinaFaceMask on the AIZOO
negative anchors. Finally, we calculate the classification dataset. We used average precision (AP) for each class, and
confidence loss by Lconf (b yc− , p− yc+ , p+
mc ) + Lconf (b mc ).
mean average precision (mAP) as the evaluation metrics [32].
In summary, the total loss is calculated as follows, APN and APM are APs for no mask wearing and mask
wearing states, respectively. The experiment results were
1
yc− , p− yc+ , p+ yl+ , p+ summarized in Table II and the best result was obtained by
L= nm (Lconf (b mc ) + Lconf (b mc ) + αLloc (b ml )), (3)
combing CAM and TL. The following sections discuss the
where nm is the number of matched default anchors, and α effectiveness of each module.
is a weight for the localization loss. 1) Context Attention Module: By including CAM in the
model, we observed an around 1% increase in mAP. In
particular, AP for no mask wearing increased from 92.8%
E. Inference
to 94.2%, and AP for mask wearing improved from 93.1%
In the inference stage, the trained model generates the to 93.6%. These findings indicate that CAM can be used
object’s localization ybl ∈ Rnp ×4 and confidence ybc ∈ Rnp ×4 , to focus on the desired face and mask features, which can
where the second column of ybc denoted as ybn ∈ Rnp is alleviate the effect of the imbalanced problem.
the probability of no mask wearing states; the third column 2) Transfer Learning: To evaluate the performance of TL
of ybc denoted as ybcm ∈ Rnp is the confidence of correct using face detection knowledge, we added TL to the model.
mask wearing states; the fourth column of ybc denoted as We noticed a considerable rise in mAP from 93.0% to 94.4%
ybim ∈ Rnp is the confidence of incorrect mask wearing when compared to the baseline. The possible reason for this
states. We remove objects with confidences lower than tc is because face detection and face mask detection are highly
and perform the NMS with IoUs larger than tnms to obtain related, and so the features learned for the former become
the final predictions. beneficial for the latter.
(a) AIZOO

(b) MAFA-FMD

Fig. 4. Qualitative Results on AIZOO and MAFA-FMD Datasets. Red boxes are no mask wearing on both datasets; green boxes are mask wearing on
AIZOO, and correct mask wearing on MAFA-FMD; yellow boxes are incorrect mask wearing on MAFA-FMD.

TABLE III
D. Comparison with other Methods
C OMPARISON WITH OTHER METHODS ON AIZOO IN PERCENTAGE .
1) Comparison on AIZOO: In Table III, we compared
Method APN APM mAP
our model’s performance with that of other widely used
SSD [15] 89.6 91.9 90.8
detectors for face mask detection. SSD is the baseline Faster R-CNN [17] 83.3 83.7 83.5
approach released by the AIZOO dataset’s produce [30]. YOLOv3 [33] 92.6 93.7 93.1
YOLOv3 has been used in numerous face mask detection RetinaFace [18] 92.8 93.1 93.0
investigations [19], [20]. RetinaFace was also included in the RetinaFaceMask 95.0 94.6 94.8
comparison as an efficient face detector. We discovered that RetinaFaceMask-Light 93.6 90.4 92.0
RetinaFaceMask can outperform YOLOv3 and RetinaFace
by 1.7% and 1.8%, respectively, and obtain the state-of-the-
art result in terms of mAP. Additionally, for the APs with likely to be harder than the two-class task. Although it is
and without masks, RetinaFaceMask demonstrated the best hard, our method still achieved the state-of-the-art perfor-
outcome. Our lite version, RetinaFaceMask-Light, which mance on mAP and APs of different classes as shown in
utilizes a significantly smaller model, achieved an acceptable Table IV. Compared to the second best method RetinaFace,
result of 92.0% in mAP. It should be noted that the number we had an around 2% improvement in mAP. However, our
of parameters in RetinaFaceMask-Light is much less than light-weighted version RetinaFaceMask-light only obtained a
other models. 59.8% mAP, which may be due to the reason that light and
Additionally, we showed some qualitative AIZOO dataset shallow models are hard to learn enough useful features.
results in Fig 4(a). As seen in the first and fourth images, the Fig. 4(b) illustrates some qualitative findings from the
model is robust to confusing masking types. In the second MAFA-FMD dataset. In comparison to the second AIZOO
and third images, faces with mask wearing were correctly image in Fig. 4(a), the model trained on our reannotated
spotted. We discovered that one of an infant’s little faces dataset is capable of correctly discriminating between correct
was omitted from the last image. One probable explanation and incorrect mask wearing cases, as demonstrated by the
for this is that the training dataset lacks small faces, and first three images. Additionally, the MAFA-FMD trained
hence the model does not learn a good representation for model is capable of capturing some small or blurred faces.
such faces. However, rare failures may occur when the face is occluded
2) Comparison on MAFA-FMD: We also compared our by someone or something.
method’s performance on the MAFA-FMD dataset. Addi-
V. CONCLUSIONS
tional evaluation metrics: APCM for the correct mask wear-
ing, and APIM for the incorrect mask wearing, are included. In this paper, we proposed a novel single stage face mask
Since we only annotated masks that can protect humans detector, namely RetinaFaceMask. We made the following
in healthcare settings as valid masks, some masks which contributions. First, we created a new face mask detection
do not enclose the faces are denoted as no mask wearing. dataset, MAFA-FMD, with a more realistic and informative
This may increase the hardness of learning, because they classification of mask wearing states. Second, we proposed
are hard to distinguish. In addition, the three-class task is a new attention module, CAM, that would be dedicated to
TABLE IV [12] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively
C OMPARISON WITH OTHER METHODS ON MAFA-FMD IN PERCENTAGE . trained, multiscale, deformable part model,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. IEEE,
Method APN APCM APIM mAP 2008, pp. 1–8.
[13] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and
SSD [15] 46.5 80.7 17.7 48.3 M. Pietikäinen, “Deep learning for generic object detection: A survey,”
Faster R-CNN [17] 55.7 86.3 43.9 62.0 International Journal of Computer Vision, vol. 128, no. 2, pp. 261–
YOLOv3 [33] 61.3 88.9 48.1 66.1 318, 2020.
RetinaFace [18] 58.7 87.4 53.3 66.5 [14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
RetinaFaceMask 59.8 89.6 55.6 68.3 once: Unified, real-time object detection,” in Proceedings of the IEEE
RetinaFaceMask-Light 55.9 88.6 34.9 59.8 Conference on Computer Vision and Pattern Recognition, 2016, pp.
779–788.
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,
and A. C. Berg, “SSD: Single shot multibox detector,” in European
learning discriminated features associated with face mask Conference on Computer Vision. Springer, 2016, pp. 21–37.
wearing states. Third, we emulated humans’ ability to trans- [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
fer knowledge from the face detection task to improve face hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE Conference on Computer Vision and
mask detection. The proposed method achieved state-of-the- Pattern Recognition, 2014, pp. 580–587.
art results on the public face mask dataset as well as our new [17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
dataset. In particular, compared with the baseline method on time object detection with region proposal networks,” in Advances in
Neural Information Processing Systems, 2015, pp. 91–99.
the AIZOO dataset, we have improved the mAP by 4% than [18] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “RetinaFace:
the baseline. Therefore, we believe our method can benefit Single-shot multi-level face localisation in the wild,” in Proceedings
both the emerging field of face mask detection and public of the IEEE Conference on Computer Vision and Pattern Recognition,
2020, pp. 5203–5212.
healthcare to combat the spread of COVID-19. Further work [19] C. Li, J. Cao, and X. Zhang, “Robust deep learning method to
may include tackling problems of occlusions or small faces detect face masks,” in Proceedings of the International Conference
in face mask detection. on Artificial Intelligence and Advanced Manufacture, 2020, pp. 74–
77.
ACKNOWLEDGMENT [20] X. Ren and X. Liu, “Mask wearing detection based on YOLOv3,”
in Journal of Physics: Conference Series, vol. 1678, no. 1. IOP
The authors thank Prof. H. Yan for valuable discussion. Publishing, 2020, pp. 1–6.
[21] S. Ge, J. Li, Q. Ye, and Z. Luo, “Detecting masked faces in the
R EFERENCES wild with LLE-CNNs,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 2682–2690.
[1] W. H. Organization et al., “Coronavirus disease 2019 (COVID-19) [22] centers for disease control and prevention, “Types of masks,”
weekly epidemiological update - 29 december 2020,” 2020. https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/
[2] P. Tabarisaadi, A. Khosravi, and S. Nahavandi, “A deep bayesian en- types-of-masks.html, 2021.
sembling framework for COVID-19 detection using chest ct images,” [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
in Proceedings of the IEEE International Conference on Systems, Man, recognition,” in Proceedings of the IEEE Conference on Computer
and Cybernetics. IEEE, 2020, pp. 1584–1589. Vision and Pattern Recognition, 2016, pp. 770–778.
[3] A. Shamsi, H. Asgharnezhad, S. S. Jokandan, A. Khosravi, P. M. Ke- [24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
bria, D. Nahavandi, S. Nahavandi, and D. Srinivasan, “An uncertainty- T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient
aware transfer learning-based framework for COVID-19 diagnosis,” convolutional neural networks for mobile vision applications,” arXiv
IEEE Transactions on Neural Networks and Learning Systems, 2021. preprint arXiv:1704.04861, 2017.
[4] A. Kunjir, D. Joshi, R. Chadha, T. Wadiwala, and V. Trikha, “A com- [25] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional
parative study of predictive machine learning algorithms for COVID- block attention module,” 2018.
19 trends and analysis,” in Proceedings of the IEEE International [26] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese,
Conference on Systems, Man, and Cybernetics. IEEE, 2020, pp. “Taskonomy: Disentangling task transfer learning,” in Proceedings of
3407–3412. the IEEE Conference on Computer Vision and Pattern Recognition,
[5] A. M. Rafi, S. Rana, R. Kaur, Q. J. Wu, and P. M. Zadeh, “Un- 2018, pp. 3712–3722.
derstanding global reaction to the recent outbreaks of COVID-19: [27] X. Fan, R. Qureshi, A. R. Shahid, J. Cao, L. Yang, and H. Yan, “Hybrid
Insights from instagram data analysis,” in Proceedings of the IEEE separable convolutional inception residual network for human facial
International Conference on Systems, Man, and Cybernetics. IEEE, expression recognition,” in 2020 International Conference on Machine
2020, pp. 3413–3420. Learning and Cybernetics. IEEE, 2020, pp. 21–26.
[6] Y. Cheng, N. Ma, C. Witt, S. Rapp, P. S. Wild, M. O. Andreae, [28] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider Face: A face detection
U. Pöschl, and H. Su, “Face masks effectively limit the probability benchmark,” in Proceedings of the IEEE Conference on Computer
of SARS-CoV-2 transmission,” Science, 2021. Vision and Pattern Recognition, 2016, pp. 5525–5533.
[7] S. Feng, C. Shen, N. Xia, W. Song, M. Fan, and B. J. Cowling, [29] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based
“Rational use of face masks in the COVID-19 pandemic,” The Lancet object detectors with online hard example mining,” in Proceedings of
Respiratory Medicine, 2020. the IEEE Conference on Computer Vision and Pattern Recognition,
[8] Y. Fang, Y. Nie, and M. Penny, “Transmission dynamics of the 2016, pp. 761–769.
COVID-19 outbreak and effectiveness of government interventions: [30] D. Chiang., “Detect faces and determine whether people are wearing
A data-driven analysis,” Journal of Medical Virology, vol. 92, no. 6, mask,” https://github.com/AIZOOTech/FaceMaskDetection, 2020.
pp. 645–659, 2020. [31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
[9] A. Kumar, A. Kaur, and M. Kumar, “Face detection techniques: a T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
review,” Artificial Intelligence Review, vol. 52, no. 2, pp. 927–948, imperative style, high-performance deep learning library,” in Advances
2019. in Neural Information Processing Systems, 2019, pp. 8024–8035.
[10] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with [32] R. Padilla, S. L. Netto, and E. A. da Silva, “A survey on performance
deep learning: A review,” IEEE Transactions on Neural Networks and metrics for object-detection algorithms,” in Proceedings in the Interna-
Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019. tional Conference on Systems, Signals and Image Processing. IEEE,
[11] P. Viola and M. Jones, “Rapid object detection using a boosted 2020, pp. 237–242.
cascade of simple features,” in Proceedings of the IEEE Conference [33] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
on Computer Vision and Pattern Recognition, vol. 1. IEEE, 2001, arXiv preprint arXiv:1804.02767, 2018.
pp. I–I.