Retinafacemask: A Single Stage Face Mask Detector For Assisting Control of The Covid-19 Pandemic
Retinafacemask: A Single Stage Face Mask Detector For Assisting Control of The Covid-19 Pandemic
III. METHODOLOGY
A. Network Architecture
The architecture of the proposed RetinaFaceMask is shown
in Fig. 2. To cope with the diverse scenes in face mask
detection, a strong feature extraction network ResNet50 is
used as the backbone network. C1 , C2 , C3 , C4 and C5 denote
the intermediate output feature maps of the backbone’s layers
conv1, conv2 x, conv3 x, conv4 x and conv5 x used in the Fig. 3. Illustration of CAM. It has a context enhancement block, a channel
attention block, and a spatial attention block.
original ResNet50 [23]. These feature maps are generated by
convolutions with distinct receptive fields, allowing for the
detection of objects of varying sizes. At this point, we have consisting of one 3 × 3 convolution, two 3 × 3 convolutions,
established the general structure for our multi-scale detection and three 3 × 3 convolutions. Equivalently, these branches
model. However, one disadvantage of shallow layers is that correspond to 3 × 3, 5 × 5 and 7 × 7 receptive fields. Then,
their outputs lack sufficient high-level semantic information, inspired by [25], we apply channel and spatial attention to
which might result in poor detection performance. To address focus on both channel and spatial important features asso-
this, an FPN has been adopted, and the details are as follows. ciated with face mask wearing states. The channel attention
First, we apply a 3×3 convolution on C5 to obtain P5 . Then, block on the input P ∈ RD×H×W can be calculated as
we upsample P5 using nearest interpolation to the same size
as C4 , and merge the upsampled P5 and channel-adjusted C4
Λc = σ FM LP HGAP (P ) + FM LP HGM P (P ) ∈ RD , (1)
with an element-wise addition. Likewise, we obtain P3 from
P4 and C3 . In addition, we also proposed a light-weighted where Λc is the channel attention map; sigmoid function σ
version of RetinaFaceMask (RetinaFaceMask-Light) by us- normalizes the output to (0, 1); FM LP denotes for a three-
ing the backbone of MobileNetV1 for running on embedded layer multi-layer perception; HGAP and HGM P are global
devices efficiently. C3 , C4 and C5 for RetinaFaceMask-Light average pooling and global maximum pooling. Similarly, the
are yielded from the last convolution blocks with the original attention map Λs yielded by the spatial attention block is
output sizes 28 × 28, 14 × 14, and 7 × 7 in [24].
Λs = σ K3×3 HCAP (P ) ⊕ HCM P (P ) ∈ RH×W , (2)
B. Context Attention Module
where denotes a 2D convolution; K3×3 is a 3 × 3 kernel;
In comparison to face detection, face mask detection ⊕ stands for the channel concatenation; HCAP and HCM P
requires both the localization of faces and the discrimination are channel average pooling and channel maximum pooling.
of distinct mask wearing states. To focus on learning more
discriminated features for mask wearing states, we proposed C. Transfer Learning
a CAM as shown in Fig. 3. First, to enhance the context The uncontrolled and diverse in-the-wild scenes make
feature extraction, we employ three parallel subbranches feature learning difficult. One possible solution is to collect
and annotate more data for training. In RetinaFaceMask, we TABLE II
proposed to mimic the human learning process by trans- A BLATION S TUDY OF R ETINA FACE M ASK .
ferring knowledge from face detection to help face mask
CAM TL APN APM mAP
detection. According to [26], [27], TL has aided in feature
7 7 92.8 93.1 93.0
learning as long as these tasks have a correlation. Therefore, 3 7 94.2 93.6 93.9
in our work, we transfer the knowledge learned on a large 7 3 94.5 94.3 94.4
scale face detection dataset Wider Face, which consists of 3 3 95.0 94.6 94.8
32,203 images and 393,703 annotated faces [28] to enhance
the feature extraction ability for FMD.
IV. EXPERIMENT AND DISCUSSION
D. Training A. Dataset
Our network generates two matrices, location offset ybl ∈ 1) AIZOO: The AIZOO Face Mask Dataset [30] has
R np ×4
and class probability ybc ∈ Rnp ×nc , where np and nc 7,959 images, where the faces are annotated either with
refer to the number of anchors and the number of categories a mask or without a mask. The dataset is a composite
of the bounding boxes, respectively. The following data, of the Wider Face [28] and MAFA datasets [21], with
default anchors yda ∈ Rnp ×4 , the ground truth bounding approximately 50% of data from each. The predefined test
boxes yl ∈ Rno ×4 and the true class label yc ∈ Rno ×1 are set is used.
provided, where no is the number of objects to be detected 2) MAFA-FMD: As described in section II, MAFA-FMD
and is variable for different images. is a reannotated dataset, in which there are three classes, “no
To calculate the model’s loss, we begin by selecting the mask wearing”, “correct mask wearing” and “incorrect mask
top class and calculating the offset for each default anchor wearing”. The original test set split of MAFA is kept.
through matching the default anchors yda , the ground truth
bounding boxes yl , and the true class label yc to obtain B. Experiment Setup
matched matrices pml ∈ Rnp ×4 and pmc ∈ Rnp , where The model was developed on PyTorch [31] deep learning
the rows of pml and pmc denote the coordinates offsets framework. The model was trained for 250 epochs with
and the labels with the highest probability for each default a stochastic gradient descent (SGD) algorithm of learning
anchor, respectively. Then, we obtain the positive localization rate 10−3 and momentum 0.9. An NVIDIA GeForce RTX
prediction and positive matched default anchors ybl+ ∈ Rp+ ×4 2080 Ti GPU was employed. The input image resolution
and p+ ml ∈ R
p+
by selecting the foreground boxes, where is 840 × 840 for RetinaFaceMask, and is 640 × 640 for
p+ denotes the number of default anchors with non-zero RetinaFaceMask-Light.
top classification label. The L1 -smooth loss Lloc (b yl+ , p+
ml ) is
used to perform box coordinates regression. Following that, C. Ablation Study
the hard negative mining [29] is performed to obtain the
sampled negative default anchors p− mc ∈ R
p−
and predicted We performed an ablation study to evaluate the effective-
− p−
anchors ybc ∈ R , where p− is the number of sampled ness of CAM and TL using RetinaFaceMask on the AIZOO
negative anchors. Finally, we calculate the classification dataset. We used average precision (AP) for each class, and
confidence loss by Lconf (b yc− , p− yc+ , p+
mc ) + Lconf (b mc ).
mean average precision (mAP) as the evaluation metrics [32].
In summary, the total loss is calculated as follows, APN and APM are APs for no mask wearing and mask
wearing states, respectively. The experiment results were
1
yc− , p− yc+ , p+ yl+ , p+ summarized in Table II and the best result was obtained by
L= nm (Lconf (b mc ) + Lconf (b mc ) + αLloc (b ml )), (3)
combing CAM and TL. The following sections discuss the
where nm is the number of matched default anchors, and α effectiveness of each module.
is a weight for the localization loss. 1) Context Attention Module: By including CAM in the
model, we observed an around 1% increase in mAP. In
particular, AP for no mask wearing increased from 92.8%
E. Inference
to 94.2%, and AP for mask wearing improved from 93.1%
In the inference stage, the trained model generates the to 93.6%. These findings indicate that CAM can be used
object’s localization ybl ∈ Rnp ×4 and confidence ybc ∈ Rnp ×4 , to focus on the desired face and mask features, which can
where the second column of ybc denoted as ybn ∈ Rnp is alleviate the effect of the imbalanced problem.
the probability of no mask wearing states; the third column 2) Transfer Learning: To evaluate the performance of TL
of ybc denoted as ybcm ∈ Rnp is the confidence of correct using face detection knowledge, we added TL to the model.
mask wearing states; the fourth column of ybc denoted as We noticed a considerable rise in mAP from 93.0% to 94.4%
ybim ∈ Rnp is the confidence of incorrect mask wearing when compared to the baseline. The possible reason for this
states. We remove objects with confidences lower than tc is because face detection and face mask detection are highly
and perform the NMS with IoUs larger than tnms to obtain related, and so the features learned for the former become
the final predictions. beneficial for the latter.
(a) AIZOO
(b) MAFA-FMD
Fig. 4. Qualitative Results on AIZOO and MAFA-FMD Datasets. Red boxes are no mask wearing on both datasets; green boxes are mask wearing on
AIZOO, and correct mask wearing on MAFA-FMD; yellow boxes are incorrect mask wearing on MAFA-FMD.
TABLE III
D. Comparison with other Methods
C OMPARISON WITH OTHER METHODS ON AIZOO IN PERCENTAGE .
1) Comparison on AIZOO: In Table III, we compared
Method APN APM mAP
our model’s performance with that of other widely used
SSD [15] 89.6 91.9 90.8
detectors for face mask detection. SSD is the baseline Faster R-CNN [17] 83.3 83.7 83.5
approach released by the AIZOO dataset’s produce [30]. YOLOv3 [33] 92.6 93.7 93.1
YOLOv3 has been used in numerous face mask detection RetinaFace [18] 92.8 93.1 93.0
investigations [19], [20]. RetinaFace was also included in the RetinaFaceMask 95.0 94.6 94.8
comparison as an efficient face detector. We discovered that RetinaFaceMask-Light 93.6 90.4 92.0
RetinaFaceMask can outperform YOLOv3 and RetinaFace
by 1.7% and 1.8%, respectively, and obtain the state-of-the-
art result in terms of mAP. Additionally, for the APs with likely to be harder than the two-class task. Although it is
and without masks, RetinaFaceMask demonstrated the best hard, our method still achieved the state-of-the-art perfor-
outcome. Our lite version, RetinaFaceMask-Light, which mance on mAP and APs of different classes as shown in
utilizes a significantly smaller model, achieved an acceptable Table IV. Compared to the second best method RetinaFace,
result of 92.0% in mAP. It should be noted that the number we had an around 2% improvement in mAP. However, our
of parameters in RetinaFaceMask-Light is much less than light-weighted version RetinaFaceMask-light only obtained a
other models. 59.8% mAP, which may be due to the reason that light and
Additionally, we showed some qualitative AIZOO dataset shallow models are hard to learn enough useful features.
results in Fig 4(a). As seen in the first and fourth images, the Fig. 4(b) illustrates some qualitative findings from the
model is robust to confusing masking types. In the second MAFA-FMD dataset. In comparison to the second AIZOO
and third images, faces with mask wearing were correctly image in Fig. 4(a), the model trained on our reannotated
spotted. We discovered that one of an infant’s little faces dataset is capable of correctly discriminating between correct
was omitted from the last image. One probable explanation and incorrect mask wearing cases, as demonstrated by the
for this is that the training dataset lacks small faces, and first three images. Additionally, the MAFA-FMD trained
hence the model does not learn a good representation for model is capable of capturing some small or blurred faces.
such faces. However, rare failures may occur when the face is occluded
2) Comparison on MAFA-FMD: We also compared our by someone or something.
method’s performance on the MAFA-FMD dataset. Addi-
V. CONCLUSIONS
tional evaluation metrics: APCM for the correct mask wear-
ing, and APIM for the incorrect mask wearing, are included. In this paper, we proposed a novel single stage face mask
Since we only annotated masks that can protect humans detector, namely RetinaFaceMask. We made the following
in healthcare settings as valid masks, some masks which contributions. First, we created a new face mask detection
do not enclose the faces are denoted as no mask wearing. dataset, MAFA-FMD, with a more realistic and informative
This may increase the hardness of learning, because they classification of mask wearing states. Second, we proposed
are hard to distinguish. In addition, the three-class task is a new attention module, CAM, that would be dedicated to
TABLE IV [12] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively
C OMPARISON WITH OTHER METHODS ON MAFA-FMD IN PERCENTAGE . trained, multiscale, deformable part model,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. IEEE,
Method APN APCM APIM mAP 2008, pp. 1–8.
[13] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and
SSD [15] 46.5 80.7 17.7 48.3 M. Pietikäinen, “Deep learning for generic object detection: A survey,”
Faster R-CNN [17] 55.7 86.3 43.9 62.0 International Journal of Computer Vision, vol. 128, no. 2, pp. 261–
YOLOv3 [33] 61.3 88.9 48.1 66.1 318, 2020.
RetinaFace [18] 58.7 87.4 53.3 66.5 [14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
RetinaFaceMask 59.8 89.6 55.6 68.3 once: Unified, real-time object detection,” in Proceedings of the IEEE
RetinaFaceMask-Light 55.9 88.6 34.9 59.8 Conference on Computer Vision and Pattern Recognition, 2016, pp.
779–788.
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,
and A. C. Berg, “SSD: Single shot multibox detector,” in European
learning discriminated features associated with face mask Conference on Computer Vision. Springer, 2016, pp. 21–37.
wearing states. Third, we emulated humans’ ability to trans- [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
fer knowledge from the face detection task to improve face hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE Conference on Computer Vision and
mask detection. The proposed method achieved state-of-the- Pattern Recognition, 2014, pp. 580–587.
art results on the public face mask dataset as well as our new [17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
dataset. In particular, compared with the baseline method on time object detection with region proposal networks,” in Advances in
Neural Information Processing Systems, 2015, pp. 91–99.
the AIZOO dataset, we have improved the mAP by 4% than [18] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “RetinaFace:
the baseline. Therefore, we believe our method can benefit Single-shot multi-level face localisation in the wild,” in Proceedings
both the emerging field of face mask detection and public of the IEEE Conference on Computer Vision and Pattern Recognition,
2020, pp. 5203–5212.
healthcare to combat the spread of COVID-19. Further work [19] C. Li, J. Cao, and X. Zhang, “Robust deep learning method to
may include tackling problems of occlusions or small faces detect face masks,” in Proceedings of the International Conference
in face mask detection. on Artificial Intelligence and Advanced Manufacture, 2020, pp. 74–
77.
ACKNOWLEDGMENT [20] X. Ren and X. Liu, “Mask wearing detection based on YOLOv3,”
in Journal of Physics: Conference Series, vol. 1678, no. 1. IOP
The authors thank Prof. H. Yan for valuable discussion. Publishing, 2020, pp. 1–6.
[21] S. Ge, J. Li, Q. Ye, and Z. Luo, “Detecting masked faces in the
R EFERENCES wild with LLE-CNNs,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 2682–2690.
[1] W. H. Organization et al., “Coronavirus disease 2019 (COVID-19) [22] centers for disease control and prevention, “Types of masks,”
weekly epidemiological update - 29 december 2020,” 2020. https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/
[2] P. Tabarisaadi, A. Khosravi, and S. Nahavandi, “A deep bayesian en- types-of-masks.html, 2021.
sembling framework for COVID-19 detection using chest ct images,” [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
in Proceedings of the IEEE International Conference on Systems, Man, recognition,” in Proceedings of the IEEE Conference on Computer
and Cybernetics. IEEE, 2020, pp. 1584–1589. Vision and Pattern Recognition, 2016, pp. 770–778.
[3] A. Shamsi, H. Asgharnezhad, S. S. Jokandan, A. Khosravi, P. M. Ke- [24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
bria, D. Nahavandi, S. Nahavandi, and D. Srinivasan, “An uncertainty- T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient
aware transfer learning-based framework for COVID-19 diagnosis,” convolutional neural networks for mobile vision applications,” arXiv
IEEE Transactions on Neural Networks and Learning Systems, 2021. preprint arXiv:1704.04861, 2017.
[4] A. Kunjir, D. Joshi, R. Chadha, T. Wadiwala, and V. Trikha, “A com- [25] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional
parative study of predictive machine learning algorithms for COVID- block attention module,” 2018.
19 trends and analysis,” in Proceedings of the IEEE International [26] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese,
Conference on Systems, Man, and Cybernetics. IEEE, 2020, pp. “Taskonomy: Disentangling task transfer learning,” in Proceedings of
3407–3412. the IEEE Conference on Computer Vision and Pattern Recognition,
[5] A. M. Rafi, S. Rana, R. Kaur, Q. J. Wu, and P. M. Zadeh, “Un- 2018, pp. 3712–3722.
derstanding global reaction to the recent outbreaks of COVID-19: [27] X. Fan, R. Qureshi, A. R. Shahid, J. Cao, L. Yang, and H. Yan, “Hybrid
Insights from instagram data analysis,” in Proceedings of the IEEE separable convolutional inception residual network for human facial
International Conference on Systems, Man, and Cybernetics. IEEE, expression recognition,” in 2020 International Conference on Machine
2020, pp. 3413–3420. Learning and Cybernetics. IEEE, 2020, pp. 21–26.
[6] Y. Cheng, N. Ma, C. Witt, S. Rapp, P. S. Wild, M. O. Andreae, [28] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider Face: A face detection
U. Pöschl, and H. Su, “Face masks effectively limit the probability benchmark,” in Proceedings of the IEEE Conference on Computer
of SARS-CoV-2 transmission,” Science, 2021. Vision and Pattern Recognition, 2016, pp. 5525–5533.
[7] S. Feng, C. Shen, N. Xia, W. Song, M. Fan, and B. J. Cowling, [29] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based
“Rational use of face masks in the COVID-19 pandemic,” The Lancet object detectors with online hard example mining,” in Proceedings of
Respiratory Medicine, 2020. the IEEE Conference on Computer Vision and Pattern Recognition,
[8] Y. Fang, Y. Nie, and M. Penny, “Transmission dynamics of the 2016, pp. 761–769.
COVID-19 outbreak and effectiveness of government interventions: [30] D. Chiang., “Detect faces and determine whether people are wearing
A data-driven analysis,” Journal of Medical Virology, vol. 92, no. 6, mask,” https://github.com/AIZOOTech/FaceMaskDetection, 2020.
pp. 645–659, 2020. [31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
[9] A. Kumar, A. Kaur, and M. Kumar, “Face detection techniques: a T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
review,” Artificial Intelligence Review, vol. 52, no. 2, pp. 927–948, imperative style, high-performance deep learning library,” in Advances
2019. in Neural Information Processing Systems, 2019, pp. 8024–8035.
[10] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with [32] R. Padilla, S. L. Netto, and E. A. da Silva, “A survey on performance
deep learning: A review,” IEEE Transactions on Neural Networks and metrics for object-detection algorithms,” in Proceedings in the Interna-
Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019. tional Conference on Systems, Signals and Image Processing. IEEE,
[11] P. Viola and M. Jones, “Rapid object detection using a boosted 2020, pp. 237–242.
cascade of simple features,” in Proceedings of the IEEE Conference [33] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
on Computer Vision and Pattern Recognition, vol. 1. IEEE, 2001, arXiv preprint arXiv:1804.02767, 2018.
pp. I–I.