PR-YOLO Improved YOLO For Fast Protozoa Classifica
PR-YOLO Improved YOLO For Fast Protozoa Classifica
PR-YOLO Improved YOLO For Fast Protozoa Classifica
Research Article
DOI: https://doi.org/10.21203/rs.3.rs-3199595/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
ABSTRACT protozoa, such as Ceratium and Paramecium, play a fundamental role in establishing sustain-
able ecosystems. The distribution and classification of certain protozoa and their species are informative
indicators to evaluate environmental quality. However, protozoa analysis is traditionally performed by
molecular biological (DNA, RNA) or morphological methods, which are time-consuming and require
an experienced laboratory operator. In this work, we adopt a deep learning-based network to solve the
protozoa classification task. This method utilizes microscope images to help researchers analyse the protozoa
population and species, reducing the cost of experimental sample storage and relieving the burden on
laboratory operators. However, the shape and size of protozoa vary greatly, which places a great burden
on the optimization of DCNN feature distillation. It is a great challenge to build a fast and precise protozoa
analysis image. We present a new version of YOLO-v5 with better performance and extend it with instance
segmentation called PR-YOLO. Building on the original YOLOv5, we added two extra parallel branches to
PR-YOLO, which perform different segmentation subtasks: (1) a branch generates a set of prototype masks
(images); (2) the other branch predicts a set of mask coefficients corresponding to prototype masks for each
instance mask generation. Then, to improve the classification accuracy, we introduced transformer encoding
blocks and lightweight Convolution Block Attention Modules (CBAMs) to explore the prediction potential
with a self-attention mechanism. To quantitatively evaluate the performance of PR-YOLO, a comprehensive
experiment was carried out on the hand-segmented microscopic protozoa images. Our model obtained the
best results, with average classification accuracy of 96.83% and mean Average Precision(mAP) of 86.92%
with a speed of 25.2 fps, which proves that the method has high robustness in this application field.
I. Introduction advantages:
A. Attention Mechanism
Building on the original YOLOv5, to improve the clas- FIGURE 5. Instance segmentation results of protozoa images
sification accuracy, we introduced transformer encoding
blocks and lightweight Convolution Block Attention Modules
(CBAMs) to explore the prediction potential with attention
mechanism. As is shown in fig4, The red blocks indicate that
we replaced the original convolution blocks with Transformer
and CBAM modules.The integration of transformer modules FIGURE 6. Mask Coefficients Generation for proto images
in a network can have a significant impact on its accuracy.
Transformers are primarily known for their success in natural
language processing tasks but have also shown promising
results in computer vision applications. By incorporating
transformer modules into a network, we can leverage their
ability to capture long-range dependencies and contextual
information, leading to improved performance in various
tasks.One key advantage of transformer modules is their ca-
pability to capture global information and context effectively.
And CBAMs are integrated to PR-YOLO for exploring local FIGURE 7. Proto images generation
information of feature maps
B. Proto Mask Generation Figure 5) are the deepest. To enhance the detection effect on
The proto mask branch (head) generates a set of 32 pro- small objects, we unsample the feature map from [80,80] to
totype masks. We implement protonet as a Cnn-Transformer [160,160]. In addition, we find that the protonet’s output value
hybrid structure (see Figure 6); its last layer has 32 chan- is unbounded, so overpowering activations for the protohead
nels (each channel generates one prototype mask). Similar is very important. Thus, we have the option of following a
to YOLACT, we do not directly take the generation of the protonet with either a ReLU or no nonlinearity. We choose
prototype mask into the loss function but calculate losses ReLU for generating more interpretable prototype masks.
after the protomask and mask coefficient are combined to The feature map generated by FPN input through transformer
generate the final instance mask. Inspired by the Vision blocks of the proto branch and multilevel upsampling mod-
Transformer, we add some Transformer Encoder Blocks to ules finally generates 32 proto mask (image) binary images,
the proto mask branch. We follow two important principles as shown in Figure 7.
to choose the input feature map of the prototype branch: We added transformer blocks in front of YOLACT’s origi-
taking a protonet from deeper layers and extracting high- nal multilevel upsampling structure to improve the sensitivity
level resolution prototype results as much as possible. The of branches to capture feature-sensitive areas. Our two addi-
former produces higher quality masks, and the latter performs tional mask branches and YOLO v5’s object detection branch
better on smaller objects. Thus, we use FPN’s largest feature jointly reuse the same transformer blocks to reduce the impact
layers (the shape of our case in P4 is [batch,80,80,256]; see on the processing speed of the YOLO model.
VOLUME 11, 2023 5
C. Mask Coefficients Generation mask matrix size of [h, w] is produced for describing each
To reduce the calculation cost, we simplify the structure of instance. In these mask matrices, we set a binary value of 0 or
the coefficient head into several convolution layers and the 1 to distinguish the background and instance areas (0 denotes
tanh activation tail. As described at the beginning of section the background, and 1 denotes the instance area). Finally, we
3, corresponding to 32-size general proto images, mask co- use BCE to evaluate the similarity between Mo and Mt.
efficient heads provide a set of 32-size weight coefficients We realize that it is unreasonable to take all parts of these
for each surviving bounding box. We first filter k surviving two tensors Mo and Mt into the loss calculation (especially
bounding boxes from the output of three detection heads (the the background area outside of the bounding box). Excessive
output shapes are [batch, w, h, 3 * (1+4+cls)], see Fig. 3) attention to the background may influence loss convergence
by confidence score and the NMS. These k bounding boxes in gradient descent. That is, for BCE, setting the instance area
are considered to contain k objects, and we finally generate k values to 1 has the same priority as adjusting the background
instance masks for the k objects. As shown in Fig. 3, there are area (out of the bounding box) values to 0, but the segmenta-
three mask coefficient heads (the shapes are [batch, w, h,32]) tion quality only rests with the area inside the bounding box.
corresponding to three object detection heads (the shapes are Excessive loss calculation of the background area outside of
[batch, w, h, 3 * (1+4+cls)]). We eventually collect k mask the bounding box increases the randomness of the loss func-
coefficients from the three coefficient heads (the shape of co- tion regression in the gradient descent, which is reflected in
efficients is [batch,k,32]).As is shown inFigure 3.We set three two aspects: (1) The extra background area brings redundant
proto heads in our network, these three proto heads generates parameters to participate in the loss calculation. (2) Excessive
mask coefficients from different pixel values feature maps pursuit of setting the binary value of the background to 0
(corresponding to large medium and small size of protozooa), but ignoring the more important but smaller instance area
enabling our model to adapt to protozooa at different scales values should be 1. We improved the original BCE mask loss
function:
AP50 AP75
Classification YOLO-V5 YOLACT PR-YOLO Classification YOLO-V5 YOLACT PR-YOLO
Ceratium 94.76% 99.29% 98.42% Ceratium 93.70% 77.37% 98.42%
Colsterium_ehrenberg 87.62% 94.60% 99.99% Colsterium_ehrenberg 83.24% 94.57% 93.85%
Collodictyon 99.99% 90.08% 99.99% Collodictyon 96.54% 65.55% 99.99%
Didinium 99.63% 99.89% 92.31% Didinium 98.43% 99.99% 92.31%
Dinobryon 67.87% 92.86% 98.59% Dinobryon 33.58% 41.84% 53.34%
Lepocinclis_spirogyroides 98.53% 99.39% 98.20% Lepocinclis_spirogyroides 91.23% 99.08% 91.81%
Pinnularia_neomajor 95.52% 99.93% 97.30% Pinnularia_neomajor 93.42% 99.99% 88.10%
Pleurotaenium_ehrenberg 89.04% 94.05% 83.57% Pleurotaenium_ehrenberg 80.85% 31.23% 55.00%
Pyrocystis_lunula 95.00% 99.69% 96.08% Pyrocystis_lunula 68.18% 99.42% 80.72%
Micrasterias_rotata 94.05% 99.87% 98.37% Micrasterias_rotata 94.05% 88.12% 98.37%
Paramecium_bursaria 99.49% 86.86% 99.99% Paramecium_bursaria 97.24% 49.76% 99.99%
Peridinium_spec 94.38% 71.50% 99.21% Peridinium_spec 87.91% 37.59% 99.21%
Total 92.99% 93.78% 96.83% Total 85.74% 73.32% 87.59%
TABLE 2. segmentation results of YOLACT and PR-YOLO [9] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao.
"Scaled-YOLOv4: Scaling Cross Stage Partial Network", Computer Vision
Classification YOLACT PR-YOLO and Pattern Recognition (2021): 13029-13038.
Ceratium 0.9814 0.8034 [10] Qi Zhao, Binghao Liu, Shuchang Lyu, Chunlei Wang, and Hong Zhang.
Colsterium_ehrenberg 0.9575 0.9357 "TPH-YOLOv5++: Boosting Object Detection on Drone-Captured Sce-
Collodictyon 0.9048 0.8776 narios with Cross-Layer Asymmetric Transformer.", Remote. Sens. 15.6
Didinium 0.9979 0.9528 (2023)
Dinobryon 0.9147 0.7149 [11] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.
Lepocinclis_spirogyroides 0.9705 0.7177 Focal loss for dense object detection. In Proceedings of the IEEE interna-
Pinnularia_neomajor 0.9983 0.9323 tional conference on computer vision, pages 2980–2988, 2017.
Pleurotaenium_ehrenberg 0.8319 0.9212 [12] Mingxing Tan, Ruoming Pang, and Quoc V Le. Effificientdet: Scalable and
effificient object detection. arXiv preprint arXiv:1911.09070, 2019.
Pyrocystis_lunula 0.9965 0.8877
Micrasterias_rotata 0.996 0.9141 [13] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints.
In Proceedings of the European Conference on Computer Vision (ECCV),
Paramecium_bursaria 0.8572 0.9594
pages 734–750, 2018.
Peridinium_spec 0.7402 0.9249
[14] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan,
Total 0.9289 0.8692
Jitendra Malik, and Christoph Feichtenhofer. "Multiscale Vision Trans-
formers.", IEEE International Conference on Computer Vision 2104.11227
(2021): 6804-6815.
[15] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon.
REFERENCES "Cbam: Convolutional Block Attention Module", European Conference on
Computer Vision 11211. (2018): 3-19.
[16] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus
[1] Rui Xu, Miaomiao Zhang, Hanzhi Lin, Pin Gao, Zhaohui Yang, Dongbo Odena. "Self-Attention Generative Adversarial Networks.", arXiv: Ma-
Wang, Xiaoxu Sun, Baoqin Li, Qi Wang, and Weimin Sun. "Response of chine Learning abs/1805.08318. (2019)
soil protozoa to acid mine drainage in a contaminated terrace", Journal of
[17] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-
Hazardous Materials 421 (2022): 126790.
Paz. mixup: Beyond empirical risk minimization. arXiv preprint
[2] Lydia Teel, Adam Olivieri, Richard Danielson, Blaga Delić, Brian Pecson, arXiv:1710.09412, 2017.
James Crook, and Krishna Pagilla. "Protozoa reduction through secondary [18] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk
wastewater treatment in two water reclamation facilities", Science of the Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong
Total Environment 807 (2022): 151053. classifiers with localizable features. In Proceedings of the IEEE/CVF
[3] Jesus Ruiz-Santaquiteria, Gloria Bueno, Oscar Deniz, Noelia Vallez, and International Conference on Computer Vision, pages 6023–6032, 2019.
Gabriel Cristobal. "Semantic versus instance segmentation in microscopic [19] Zhao Hengshuang, Jia Jiaya, and Koltun Vladlen. "Exploring Self-
algae detection", Engineering Applications of Artificial Intelligence 87. attention for Image Recognition", Computer Vision and Pattern Recogni-
(2020) tion 2004.13621 (2020): 10073-10082.
[4] Zhenni Shang, Xiangnan Wang, Yu Jiang, Zongjun Li, and Jifeng Ning. [20] Ali Hebbal, Loic Brevault, Mathieu Balesdent, El-Ghazali Talbi, and
"Identifying rumen protozoa in microscopic images of ruminant with Nouredine Melab. "Multi-Fidelity Modeling With Different Input Domain
improved YOLACT instance segmentation", Biosystems engineering 215 Definitions Using Deep Gaussian Processes", Structural and Multidisci-
(2022): 156-169. plinary Optimization 63.5 (2021): 2267-2288.
[5] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ [21] W Liu, D Anguelov, D Erhan, C Szegedy, S Reed, CY Fu, and
2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/1804. AC Berg. "SSD: Single shot multibox detector. arXiv 2015", user-
02767 6073b1344c775e0497f43bf9 (2020)
[6] TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction [22] Glenn Jocher, Alex Stoken, Jirka Borovec, NanoCode012, Ayush Chaura-
Head for Object Detection on Drone-captured Scenarios sia, TaoXie, Liu Changyu, Abhiram V, Laughing, tkianai, yxNONG, Adam
[7] Fahad Jubayer, Janibul Alam Soeb, Abu Naser Mojumder, Mitun Kanti Hogan, lorenzomammana, AlexWang1900, Jan Hajek, Laurentiu Diaconu,
Paul, Pranta Barua, Shahidullah Kayshar, Syeda Sabrina Akter, Mizanur Marc, Yonghye Kwon, oleg, wanghaoyang0106, Yann Defretin, Aditya
Rahman, and Amirul Islam. "Detection of mold on the food surface using Lohia, ml5ah, Ben Milanko, BenjaminFineran,Daniel Khromov, Ding Yi-
YOLOv5", CURRENT RESEARCH IN FOOD SCIENCE 4 (2021): 724- wei, Doug, Durgesh, and Francisco Ingham. ultralytics/YOLOv5: v5.0 -
728. YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations,
[8] Jennifer N. Hird, Alessandro Montaghi, Gregory J. McDer mid, Jahan Apr. 2021.
Kariyeva, Brian J. Moorman, Scott E. Nielsen, and Anne C. S. McIntosh. [23] Zixuan Xu, Banghuai Li, Ye Yuan, and Miao Geng. "Anchorface: An
“Use of unmanned aerial vehicles for monitoring recovery of forest vege- Anchor-Based Facial Landmark Detector Across Large Poses", AAAI
tation on petroleum well sites. Remote. Sens., 9(5):413, 2017. Conference on Artificial Intelligence 35.4 (2021): 3092-3100.
[24] Ross B. Girshick. "Fast R-CNN.", <i>IEEE International Conference on SUNYANG CHEN received the B.Eng. degree in
Computer Vision</i> abs/1504.08083. (2015): 1440-1448. software engineering from Zhejiang University of
[25] Ren, Shaoqing, He, Kaiming, Ross B. Girshick, and Sun, Jian. "Faster R- Science And Technology, in 2017. He is currently
CNN: Towards Real-Time Object Detection with Region Proposal Net- pursuing the master’s degree in computer science
works", IEEE Transactions on Pattern Analysis and Machine Intelligence with Zhejiang University of Technology. His re-
39.6 (2017): 1137-1149. search interest includes computer vision and nat-
[26] K He, G Gkioxari, P Dollár, and R Girshick. "Mask r-cnn. arXiv 2017", ural language processing.
user-6073b1344c775e0497f43bf9 (2020)
[27] Fang Peng, Zheng Miao, Fei Li, and Zhenbo Li. "S-FPN: A shortcut feature
pyramid network for sea cucumber detection in underwater images", Expert
Systems with Applications 182 (2021): 115306.
[28] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. "Nas-Fpn: Learning Scal-
able Feature Pyramid Architecture For Object Detection", 2019 IEEE/CVF
CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNI-
TION (CVPR 2019) (2019): 7036-7045.
[29] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik
Roy. "Going Deeper in Spiking Neural Networks: VGG and Residual
Architectures.", FRONTIERS IN NEUROSCIENCE 13. (2019)
[30] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. "Wider or Deeper:
Revisiting the ResNet Model for Visual Recognition.", Pattern Recognition
90.1 (2019): 119.0-133.0.
[31] Chongke Bi, Jiamin Wang, Yulin Duan, Baofeng Fu, Jia-Rong Kang, and
Yun Shi. "MobileNet Based Apple Leaf Diseases Identification", Mobile
Networks and Applications (2020): 1-9.
[32] Xuelong Hu, Yang Liu, Zhengxi Zhao, Jintao Liu, Xinting Yang, Chuan-
heng Sun, Shuhan Chen, Bin Li, and Chao Zhou. "Real-Time Detection GUANLIN CHEN received the B.S. and Ph.D.
Of Uneaten Feed Pellets In Underwater Images For Aquaculture Using An degree in computer science and technology from
Improved Yolo-V4 Network", Computers and Electronics in Agriculture Zhejiang University, Hangzhou, China, in 2000
185 (2021): 106135. and 2013, respectively. He is currently a profes-
[33] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. "Path Aggrega- sor in the school of computer and computing sci-
tion Network For Instance Segmentation", Computer Vision and Pattern
ence,Hangzhou City University. His research inter-
Recognition 1803.01534. (2018): 8759-8768.
est includes artificial intelligence and smart city.
[34] Petr Hurtik, Stefania Tomasiello, Jan Hula, and David Hynar. "Binary
cross-entropy with dynamical clipping", Neural Computing and Applica-
tions 34.14 (2022): 12029-12041.
[35] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. "YOLACT++
Better Real-Time Instance Segmentation", IEEE Transactions on Pattern
Analysis and Machine Intelligence 44.2 (2022): 9156-9165.
[36] Ruohao Guo, Dantong Niu, Liao Qu, and Zhenbo Li. "SOTR: Segmenting
Objects with Transformers", arXiv preprint arXiv 2108.06747 (2021)
[37] Duolikun DILIXIATI, ZHANG Tai-hong, and FENG Xiang-ping. "Design
and Implementation of LabelMe Label Checking System", Computer Tech-
nology and Development 32.3 (2022): 214-220.
[38] Tsung-Yi Lin, M. Maire, Serge J. Belongie, James Hays, P. Perona, D.
Ramanan, Piotr Dollár, and C. L. Zitnick. MS COCO:Microsoft Common
Objects in Context.