PR-YOLO Improved YOLO For Fast Protozoa Classifica

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

PR-YOLO: Improved YOLO for fast protozoa

classi cation and segmentation


WUJIAN YANG
Hangzhou City University
SUNYANG CHEN
Hangzhou City University
GUANLIN CHEN
Hangzhou City University
QIHAO SHI (  [email protected] )
Hangzhou City University

Research Article

Keywords: Yolov5, Transformers, Object Detection, Instance Segmentation

Posted Date: July 28th, 2023

DOI: https://doi.org/10.21203/rs.3.rs-3199595/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

PR-YOLO: Improved YOLO for fast protozoa


classification and segmentation
WUJIAN YANG1,2 , SUNYANG CHEN1,2 , GUANLIN CHEN1 , AND QIHAO SHI.1
1
School of Computer and Computing Science, Hangzhou City University, Hangzhou, 310015, China
2
School of Computer and Computing Science, Zhejiang University of Technology, Hangzhou, 310023, China
Corresponding author: Qihao Shi (e-mail: [email protected]).
This work was supported in part by the Zhejiang Science and Technology Plan Project of China under Grant 2020C03091and in part by the
Zhejiang Provincial Central Government Guided Local Science and Technology Development Project under Grant 2020ZY1010.’’

ABSTRACT protozoa, such as Ceratium and Paramecium, play a fundamental role in establishing sustain-
able ecosystems. The distribution and classification of certain protozoa and their species are informative
indicators to evaluate environmental quality. However, protozoa analysis is traditionally performed by
molecular biological (DNA, RNA) or morphological methods, which are time-consuming and require
an experienced laboratory operator. In this work, we adopt a deep learning-based network to solve the
protozoa classification task. This method utilizes microscope images to help researchers analyse the protozoa
population and species, reducing the cost of experimental sample storage and relieving the burden on
laboratory operators. However, the shape and size of protozoa vary greatly, which places a great burden
on the optimization of DCNN feature distillation. It is a great challenge to build a fast and precise protozoa
analysis image. We present a new version of YOLO-v5 with better performance and extend it with instance
segmentation called PR-YOLO. Building on the original YOLOv5, we added two extra parallel branches to
PR-YOLO, which perform different segmentation subtasks: (1) a branch generates a set of prototype masks
(images); (2) the other branch predicts a set of mask coefficients corresponding to prototype masks for each
instance mask generation. Then, to improve the classification accuracy, we introduced transformer encoding
blocks and lightweight Convolution Block Attention Modules (CBAMs) to explore the prediction potential
with a self-attention mechanism. To quantitatively evaluate the performance of PR-YOLO, a comprehensive
experiment was carried out on the hand-segmented microscopic protozoa images. Our model obtained the
best results, with average classification accuracy of 96.83% and mean Average Precision(mAP) of 86.92%
with a speed of 25.2 fps, which proves that the method has high robustness in this application field.

INDEX TERMS Yolov5, Transformers, Object Detection, Instance Segmentation

I. Introduction advantages:

W ITH the continuous growth of the Earth’s popula-


tion and ongoing industrial development, the con-
sumption of water resources by humans, especially fresh-
1) Protozoa are widely distributed globally. As important
members of the plankton community, they share similar
indicator species in terms of morphology, ecology, and
water resources, has multiplied. Simultaneously, industrial genetics, enabling global comparisons of experimental
wastewater from large-scale industrial production, agri- results.
cultural wastewater from farming activities, and domestic 2) Protozoa are eukaryotic unicellular organisms without
wastewater from daily human life are often discharged di- cell walls, and some species have shells that expose
rectly into natural water bodies without proper treatment. them directly to the water environment. Their cell mem-
Due to the characteristics of human settlements and industrial branes come into direct contact with the surrounding
and agricultural production, wastewater is discharged in large environment or pollutants, making protozoa more sen-
quantities and accumulates in specific rivers or lakes, leading sitive to environmental changes than microorganisms.
to a decline in water quality, foul odors, and bacterial growth. 3) Protozoa are small and have short life cycles, with
In the field of biology, using protozoa as indicators of water some species completing their life cycle in just a few
quality [1], [2] is a common practice, and it offers several hours. Compared to larger eukaryotes such as crus-

VOLUME 11, 2023 1


taceans and fish, protozoa provide quicker monitoring
results, making them ideal for a sensitive early warning
system. The selection of specific protozoa should con-
sider habitat characteristics and research objectives, as
different groups of protozoa exhibit distinct indicative
characteristics that cannot replace one another.
The advantages of protozoa in water environment assess-
ment and monitoring strongly support their widespread ap-
plication in various water bodies. The automatic analysis
of protozoa in different environments poses a challenging
problem with a significant impact on ecosystem assessment
[3], [4]. To efficiently assess protozoa in water samples, many
researchers have focused on the development of automatic FIGURE 1. Microbial images at different magnifications.
tools based on computer vision and machine learning tech-
niques. Numerous recent studies have addressed automatic
protozoa classification and detection, where models attempt [12], and CornerNet [13] detectors. To fill the gap in the
to predict accurate bounding boxes and classifications from field of instance segmentation and increase the precision,
image samples containing protozoa. However, traditional ob- we improve the original network structure of YOLOv5. Our
ject detection methods may not be suitable for the morpho- contributions are as follows:
logical diversity and short life cycles of protozoa. We construct an end-to-end instance segmentation network
In this paper, we present PR-YOLO, a fast end-to-end called PR-YOLO for accurately distilling pixel-level features
instance segmentation system for protozoa images, designed from microbial microscopic protozoa images.
to assist taxonomists in accurately analyzing the living condi- We decompose the instance segmentation task into two
tions of protozoa. Our model primarily performs three tasks: subtasks, and two additional branches are added to the net-
microscopic protozoa classification, bounding box predic- work to perform these two subtasks:
tion, and instance segmentation. We chose to focus on YOLO 1) The proto mask(image) branch for generating a set of
[5] due to its excellent performance in various computer general prototype images.
vision tasks. The YOLO series has been widely utilized in 2) The mask coefficient branch provides a set of weight
object detection applications, including drone capture [6], coefficients corresponding to proto images for each
food science [7], and environmental protection [8]. However, generating instance mask.
few studies have explored its performance in instance seg- We replace some convolutional layers of YOLOv5 with
mentation. This paper specifically investigates microscopic Transformer Encoder Blocks [14] and Convolutional Blocks
protozoa image instance segmentation models using YOLO. Attention Modules (CBAM [15]). Models combining the
We encountered three main challenges: CNN architectures with these self-attention modules [16]
1) CNN structures such as the YOLO series [5], [9] lack can easily capture global features and naturally context se-
feature coherence in advanced visual semantic informa- mantic dependencies from feature maps. In particular, the
tion, limiting their ability to associate instances. This self-attention mechanism of transformers broadly converges
arises from their limited receptive fields, resulting in both feature and positional information from the whole input
suboptimal results when dealing with large objects. domain [20].
2) majority of microorganism samples are collected from We use data augmentation [17] [18] to expand the
complex environments with high levels of impurities dataset and conducted a composite experiment with two
(see Fig. 1). Due to this noise, many protozoa images indicators on the SinfNet1 protozoa dataset,The aver-
cannot be used as CNN datasets. We face a dataset age classification result of 96.83 % and the mean
deficiency issue, where collecting sufficient images for Average Precision(mAP) result of 86.92% reflect that
training an optimized DCNN with numerous parame- the performance of PR-YOLO is not inferior to or
ters is challenging. even superior models such as YOLOv5 or YOLACT.
3) Traditional object detection methods utilizing bound-
ing boxes are not well-suited for complex microbial
scenarios. However, introducing a complex segmenta- II. Relative work
tion structure may impose a heavy burden on network A. Transformer Encoder Block
optimization. Taking inspiration from Vision Transformer [14] and TPH-
YOLOv5 is excellent in terms of processing speed. There- YOLOv5 [10], we introduce Transformer Encoder Blocks
fore, it is a good choice to meet the real-time requirements and Convolution Block Attention modules (CBAM [15]) to
of microbial detection. On the other hand, the accuracy of replace some of the original YOLOv5 network convolutional
YOLOv5 [10] lags behind the RetinaNet [11], EffificientDet blocks and CSP residual modules. While the residual module
2 VOLUME 11, 2023
3

FIGURE 2. The structure of the transformer encoder block contains two


main blocks, a multihead attention block and a feed-forward neural
network (MLP).

in CSPDarknet53 captures global feature context, it may


overlook the finer details of the feature map. To address
this limitation, we adopt a CNN-transformer hybrid network
that combines detail feature representation and global feature
representation at different resolutions. The structure of the
transformer encoder blocks is illustrated in Figure 2.In Vision
Transformer, each Transformer Encoder Block consists of
two sublayers. The first sublayer is the multihead attention
layer, and the second sublayer is the fully connected layer
(MLP). These two sublayers are interconnected to form a
residual network. The transformer encoder block enables the
model to effectively capture global information and contex-
tual cues while leveraging the potential of the self-attention
FIGURE 3. The structure of our network
mechanism [19]. To mitigate the computational and storage
costs, we selectively connect the transformer blocks and
CBAM with three specific types of layers in our network
efficient one-stage YOLOv5 module as the start of our work.
based on YOLOv5:
Next, we separately introduce these three structural details of
1) Concat layers that merge feature maps from the Feature
YOLOv5.
Pyramid Network (FPN) with different resolutions.
Backbone. The backbones that are often used include VGG
2) The tail of the backbone.
[29], ResNet [30] and MobileNet [31], etc. We choose Csp-
3) The front of different functional heads.
Darknet53 [32] as the backbone feature extraction network of
This targeted connectivity approach helps optimize compu- our YOLOv5 module, whose feature extraction capabilities
tational efficiency and resource utilization while leveraging on detection have proven strong on other issues. The main
the benefits of the transformer blocks and CBAM within our structure of the model is shown in the left half of (Fig 3).
network architecture The CSPDarkNet53 backbone feature extraction network
introduces the CSP structure on the basis of the DarkNet53
B. Object Detection network of YOLOv3. This structure enhances the learning
At present, anchor-based [21]–[23] object detection meth- ability of the convolutional neural network and removes the
ods can be divided into two types: two-stage and one-stage. computational bottleneck and speeds up the reasoning speed
The two-stage method suck as Fast R-CNN [24],Faster R- of the network. In CSPDarkNet53, the backbone of the model
CNN [25] abd Mask R-CNN [26] first determines the most is composed of multiple CSP blocks. These block layers
likely objects in the region within a controllable range by are composed of a convolution layer and a Csplayer. The
using a proposal mechanism called the region of interest. former iteratively extracts more complex features from low-
Then, classification and regression results are obtained from level features, and the latter alleviates the problem of gradient
the candidate regions, such as Faster R-CNN. One-stage disappearance caused by increasing depth in the depth neural
methods, such as SSD variants [21], YOLO variants [6], [7], network. On this basis, we add the transformer encoder block
[10], [22] and RetinaNet [11], aim to predict the classification at the tail of CSPDarknet53 to capture the sensitive areas of
and regression results from feature maps of CNN directly image features and explore the potential of feature represen-
without an extra proposed stage. From the perspective of com- tation. Finally, three feature maps with different scales of [80,
ponents, they generally both consist of two parts: (1) CNN- 80, 256] [40, 40, 512] [20, 20, 1024] are output.
based backbone used to extract the image feature map. (2) FPN. The Feature Pyramid Network(FPN) is mainly used
The functional heads used to predict object classification and to strengthen the feature extraction ability of the network.
bounding box parameters. In addition, the object detectors It reprocesses and rationally uses the feature maps extracted
developed in recent years often insert some layers between the by Backbone at different stages. Usually, an FPN consists of
backbone and the head, and people usually call this part the several bottom-up paths and several top-down paths. FPN is
Feature Pyramid Networks (FPN) [27], [28]. We choose the a key link in the object detection framework. The structure
VOLUME 11, 2023 3
of PANet [33] is still used in YOLOv5. Both upsampling and Lobj , and Lcls originate from the prediction bounding box
downsampling operations are used to achieve feature fusion. offset, object existence confidence score, and classification
We also apply CBAM and transformer blocks to improve score, respectively, for the predicted object belonging to each
YOLOv5’s original FPN network (red module in the middle C− A∪B

of Figure 3). To utilize the spatial attention and channel at- GIoU = IoU − (2)
C
tention mechanism of CBAM, we add most CBAM attention
For bbox A and gt box B, C is the smallest rectangle box
modules to the concat layer used to gather different scales of
enclosing both A and B. GIoU first calculates a ratio between
resolution feature maps. Finally, we construct three effective
the volume (area) occupied by C excluding A and B and
feature layers of different scales to ensure the prediction
divide by the total volume (area) occupied by C. This calcula-
accuracy of the detection head. Equal to the shape of the
tion denotes a normalized measure that focuses on the empty
output from the backbone, FPN eventually generates three
volume (area) between A and B. Finally, GIoU is attained by
feature map sizes: [batch, 80,80,256], [batch, 40,40,512], and
subtracting this ratio from the IoU value. [30]
[batch, 20,20,1024].
Detect Head. The detection head in a Convolutional Neural
Lobj (Po , Pt ) = BCEobj (Po , Pt ) (3)
Network (CNN) is responsible for performing the task of
object detection. It is an essential component that processes Lobj . po is the confidence score denoting the predicted
the extracted features from the backbone network and pro- probability of bounding boxes containing objects. pt denotes
duces bounding box prediction class probabilities for objects whether the ground truth boxes contain objects. We use
present in an image.Overall, the detection head plays a crit- BCE[31] (Binary Cross Entropy) as a loss function to qualify
ical role in transforming the extracted features into mean- the prediction accuracy of object existence confidence.
ingful predictions, enabling the CNN to detect and localize
objects accurately in an image. Its design and architecture Lcls (Cp , Ct ) = BCEcls (Cp , Ct ) (4)
heavily influence the performance and efficiency of object
Lcls is similar to Lobj and uses BCE[31] to calculate the
detection models.YOLOv5 uses only one type of detect head
classification loss issue.Cp , Ct are two vectors. Each element
to accomplish bounding box prediction and object classi-
of the former denotes a concrete classification confidence
fication simultaneously. The original object detector of the
score for the predicted object, and the latter indicates the truth
CNN+MLP structure of YOLOv5 is relatively simple, whose
classification of the predicted object.
speed advantage is obvious, but the accuracy is lower. We
add transformer blocks in front of the original detection head, N
which apparently improves the precision accuracy (red dotted 1 X
BCE = − yi log (p (yi ) + (1 − yi ) log (1 − p (yi )))
line box in Figure 3). N 1
(5)
C. Object Detection Losses Binary Cross Entropy(BCE) [34] is used to evaluate the
quality of the prediction results. N is the number of possible
YOLOv5’s object detection loss function consists of three classifications. The yi indicates whether the truth classifica-
parts: bounding box regression loss, object confidence loss tion is class i (ture = 1,false = 0), and p(yi ) is the predicted
and object classification loss. The losses of these three parts confidence score that the predicted object belongs to class i.
are only calculated by matching the positive sample of the
feature map. The final loss is the sum of the three types of III. Method
losses. The calculation formula is as follows:
Inspired by YOLACT [35] and SOTR [36], we add two
additional branches to the YOLOv5 object detection network,
2
X B
S X making it competent for complex instance segmentation vi-
LV 5−obj (tp , tgt ) = αbox Fkobj
i LGIOU + sion tasks. The segmentation result is shown in Figure 5.
j
i=0 j=0 To do this, we decompose the complex instance segmenta-
2 2
(1)
X B
s X s X
X B tion task into two parallel CNN-Transformer hybrid branches
αobj Fkobj
i Lobj + αcls Fkobj
i Lcls and combine their result into the final instance masks. The
j j
i=0 j=0 i=0 j=0 first proto mask branch generates a set of 32 -size prototype
where k, S 2 , and B represent the kt h feature map, cells of images (masks), these proto images do not depend on any
size S × S and the B anchors each cell owns, respectively, instance or bounding box. For each surviving bounding box
and abox , aobj , and acl s are the ratios of the corresponding considered to contain an object, the other mask coefficient
loss issues.kji represents the jt h anchor belonging to the it h branch generates 32-size weight coefficients corresponding to
prototype images (masks). We filter the k surviving bounding
cell in the kt h feature map. If Fkobj
j = 1, then kij is a positive
i boxes by two types of object detection filters: the object
anchor; otherwise, if F Fkobj
j = 0, then kji is a negative anchor, existence confidence threshold and the NMS. Finally, we
i
tp and tgt are prediction tensors and the ground truth tensor, combine the general proto images and weight coefficients to
respectively, and their shapes are both [batch, h, w, c]. Lgiou , obtain an instance mask by linear matrix multiplication.
4 VOLUME 11, 2023
5

FIGURE 4. Network structure of EM-YOLO

A. Attention Mechanism
Building on the original YOLOv5, to improve the clas- FIGURE 5. Instance segmentation results of protozoa images
sification accuracy, we introduced transformer encoding
blocks and lightweight Convolution Block Attention Modules
(CBAMs) to explore the prediction potential with attention
mechanism. As is shown in fig4, The red blocks indicate that
we replaced the original convolution blocks with Transformer
and CBAM modules.The integration of transformer modules FIGURE 6. Mask Coefficients Generation for proto images
in a network can have a significant impact on its accuracy.
Transformers are primarily known for their success in natural
language processing tasks but have also shown promising
results in computer vision applications. By incorporating
transformer modules into a network, we can leverage their
ability to capture long-range dependencies and contextual
information, leading to improved performance in various
tasks.One key advantage of transformer modules is their ca-
pability to capture global information and context effectively.
And CBAMs are integrated to PR-YOLO for exploring local FIGURE 7. Proto images generation
information of feature maps

B. Proto Mask Generation Figure 5) are the deepest. To enhance the detection effect on
The proto mask branch (head) generates a set of 32 pro- small objects, we unsample the feature map from [80,80] to
totype masks. We implement protonet as a Cnn-Transformer [160,160]. In addition, we find that the protonet’s output value
hybrid structure (see Figure 6); its last layer has 32 chan- is unbounded, so overpowering activations for the protohead
nels (each channel generates one prototype mask). Similar is very important. Thus, we have the option of following a
to YOLACT, we do not directly take the generation of the protonet with either a ReLU or no nonlinearity. We choose
prototype mask into the loss function but calculate losses ReLU for generating more interpretable prototype masks.
after the protomask and mask coefficient are combined to The feature map generated by FPN input through transformer
generate the final instance mask. Inspired by the Vision blocks of the proto branch and multilevel upsampling mod-
Transformer, we add some Transformer Encoder Blocks to ules finally generates 32 proto mask (image) binary images,
the proto mask branch. We follow two important principles as shown in Figure 7.
to choose the input feature map of the prototype branch: We added transformer blocks in front of YOLACT’s origi-
taking a protonet from deeper layers and extracting high- nal multilevel upsampling structure to improve the sensitivity
level resolution prototype results as much as possible. The of branches to capture feature-sensitive areas. Our two addi-
former produces higher quality masks, and the latter performs tional mask branches and YOLO v5’s object detection branch
better on smaller objects. Thus, we use FPN’s largest feature jointly reuse the same transformer blocks to reduce the impact
layers (the shape of our case in P4 is [batch,80,80,256]; see on the processing speed of the YOLO model.
VOLUME 11, 2023 5
C. Mask Coefficients Generation mask matrix size of [h, w] is produced for describing each
To reduce the calculation cost, we simplify the structure of instance. In these mask matrices, we set a binary value of 0 or
the coefficient head into several convolution layers and the 1 to distinguish the background and instance areas (0 denotes
tanh activation tail. As described at the beginning of section the background, and 1 denotes the instance area). Finally, we
3, corresponding to 32-size general proto images, mask co- use BCE to evaluate the similarity between Mo and Mt.
efficient heads provide a set of 32-size weight coefficients We realize that it is unreasonable to take all parts of these
for each surviving bounding box. We first filter k surviving two tensors Mo and Mt into the loss calculation (especially
bounding boxes from the output of three detection heads (the the background area outside of the bounding box). Excessive
output shapes are [batch, w, h, 3 * (1+4+cls)], see Fig. 3) attention to the background may influence loss convergence
by confidence score and the NMS. These k bounding boxes in gradient descent. That is, for BCE, setting the instance area
are considered to contain k objects, and we finally generate k values to 1 has the same priority as adjusting the background
instance masks for the k objects. As shown in Fig. 3, there are area (out of the bounding box) values to 0, but the segmenta-
three mask coefficient heads (the shapes are [batch, w, h,32]) tion quality only rests with the area inside the bounding box.
corresponding to three object detection heads (the shapes are Excessive loss calculation of the background area outside of
[batch, w, h, 3 * (1+4+cls)]). We eventually collect k mask the bounding box increases the randomness of the loss func-
coefficients from the three coefficient heads (the shape of co- tion regression in the gradient descent, which is reflected in
efficients is [batch,k,32]).As is shown inFigure 3.We set three two aspects: (1) The extra background area brings redundant
proto heads in our network, these three proto heads generates parameters to participate in the loss calculation. (2) Excessive
mask coefficients from different pixel values feature maps pursuit of setting the binary value of the background to 0
(corresponding to large medium and small size of protozooa), but ignoring the more important but smaller instance area
enabling our model to adapt to protozooa at different scales values should be 1. We improved the original BCE mask loss
function:

D. Instance Mask Generation Lmask (Mo , Mt ) = BCEmask (Mgt Mo , Mgt Mt ) (8)


Instead of using a complex dynamic convolution opera-
tion such as SOTR[28] to generate final instance masks, we IV. Experiments
choose a simple method that combines the proto masks and To evaluate our model, we make our own coco-format
the mask coefficients by linear matrix multiplication to obtain segmentation dataset based on SinfNet 1 2 protozoa images.
the final instance mask:
A. Implements
M = σ PC ⊤

(6)
We conducted experiments on 12-class microbe images of
P is a tensor generated by the proto head, the shape of P the Sinfnet protozoa Dataset. We use labelme annotation tool
is [batch, 160, 160, 32], and 32 represents 32 proto masks to generate segmentation annotations [37]. Through data aug-
(images) (the shape of [160,160]) distilled from one original mentation, the number of images was expanded to 1500, and
microbial image (see Fig. 5b). C is a tensor generated by the number of protozoa instances exceeded 10,000. We divide
coefficient heads. The shape of C is [batch,k,32], where k the train set, val set, and test set in a ratio of [0.8:0.1:0.1].
represents k bounding boxes that survive from the object de- All comparison networks are pretrained on the MS COCO
tection filter, and the 32 we describe it as a set of 32 weight co- Dataset [38]. We qualify the bounding box precision accuracy
efficients corresponding to the proto masks of P. We perform and segmentation quality by mean average precision (mAP)
linear matrix multiplication on P and the transposition of C and mean intersection over union (MIOU).
and normalize the result by the sigmoid function to obtain the
instance masks (the shape is [160, 160, k]). Multiupsampling B. Object Detection comparison
and reshape operations are used, for instance, masks to adjust We use two indicator: Mean Accuracy and mAP to respec-
the final size to [K, h, w] tively evaluate the segmentation and classification effect of
PR-YOLO, the original YOLO-v5 and YOLACT. The three
E. Instance mask generation loss curves Fig 7 show the AP results of YOLO-v5, PR-YOLO
In sections 3.1, 3.2, and 3.3, we use matrix multiplication and YOLACT (the IoU thresholds range from [0.5: 0.95]).
to obtain instance masks. The original instance mask loss With the proposed enhancements, PR-YOLO obtains a better
formula we use is as follows: performance boost over YOLOv5 and YOLACT.
We use the Precision and recall under different score
Lmask (Mo , Mt ) = BCEmask (Mo , Mt ) (7) thresholds of YOLO-v5 and PR-YOLO to depict PR curves.
1 https://github.com/sarisabban/SinfNet
Mo is a predicted instance mask tensor with shape of [k, h, 2 The authority provides protozoa images with objection detection anno-
w], and Mt is a truth instance mask tensor. K represents the tations,the hand segmentation annotations can obtain by contact with our
predicted number of instances (objects) in one image, and a corresponding author

6 VOLUME 11, 2023


7

FIGURE 8. The mAP results range from [0.5:0.95]

FIGURE 10. The classification results of PR-YOLO and YOLACT

value of PR-YOLO is much better than YOLACT in object


detection and classification. Under the score threshold of
75higher than that of the latter. We counted the ratio of
correctly predicted protozoa under each classification as the
classification indicator. The classification results in Fig 9
below intuitively show the advantage of our PR-YOLO: Most
of the protozoa classification accuracy of PR-YOLO is higher
FIGURE 9. The precision and recall curves of YOLO-v5 and PR-YOLO
than that of YOLACT.

As shown in Fig 8, the overall performance of the PR-YOLO V. Conclusion


classifier is better than that of YOLO-v5.
We trained these three models for 200 epochs and fol- In this paper, we have made some attempts and improve-
lowed the reported 12-class mAP results of AP50 and AP75. ments to YOLOv5, including transformer encoder blocks and
Benefitting from the extra transformer blocks and CBAM CBAM self-attention modules[16], two extra segmentation
modules in our object detection network, the object detection branches and loss function adjustment technology. With the
performance of our model is better than that of YOLO-v5 and proposed enhancements, it formed a network that is not in-
YOLACT. Table 1 shows the classification mAP results of ferior to or even superior to YOLACT in terms of segmen-
AP50 and AP75. tation speed and object detection accuracy and classification
accuracy. This detector is good at instance segmentation and
C. Mask Quality Comparison classification in microbial scenes. We conducted experiments
To compare our segmentation quality, we first filter pos- on a hand-segmented protozoa dataset, and the model showed
itive instances by object detection (the IoU threshold = 0.5 advanced performance in microbial image datasets. To im-
and confidence score = 0.5). We choose the mean intersection prove the processing speed of PR-YOLOv5 to 25 fps and
over union (MIOU) as our evaluation indicator, which denotes improve the accuracy of classification, we adjusted the num-
the proportion of pixels correctly marked in the ground truth ber of detection heads from 5 to 3 and reduced the weight of
area. As shown in Table 2, the segmentation result of our the segmentation part in the loss function. These adjustments
PR-YOLO appears inferior to that of YOLACT. There are led to the suboptimal segmentation results of PR-YOLOv5
two factors considered to influence our mask quality: (1) we in detecting small protozoa. In future work, we will further
give higher ratios to classification loss and positioning loss, optimize the segmentation quality on small protozoa. We
which reduces the pixel expression ability at the details of the hope that this report can help developers and researchers gain
segmentation result. (2) Considering the processing speed of better experience in analysing protozoa detection scenarios.
the network, the number of mask coefficient heads is less than
that of YOLACT.
VI. Acknowledgements
D. Classification Results The authors would thank the advanced computing re-
It can be seen in Table 1 and Table 2 that although sources provided by the Supercomputing Center of Hangzhou
YOLACT shows good performance in segmentation, the ap City University.
VOLUME 11, 2023 7
TABLE 1. Object detection mAP results of AP50 and AP75

AP50 AP75
Classification YOLO-V5 YOLACT PR-YOLO Classification YOLO-V5 YOLACT PR-YOLO
Ceratium 94.76% 99.29% 98.42% Ceratium 93.70% 77.37% 98.42%
Colsterium_ehrenberg 87.62% 94.60% 99.99% Colsterium_ehrenberg 83.24% 94.57% 93.85%
Collodictyon 99.99% 90.08% 99.99% Collodictyon 96.54% 65.55% 99.99%
Didinium 99.63% 99.89% 92.31% Didinium 98.43% 99.99% 92.31%
Dinobryon 67.87% 92.86% 98.59% Dinobryon 33.58% 41.84% 53.34%
Lepocinclis_spirogyroides 98.53% 99.39% 98.20% Lepocinclis_spirogyroides 91.23% 99.08% 91.81%
Pinnularia_neomajor 95.52% 99.93% 97.30% Pinnularia_neomajor 93.42% 99.99% 88.10%
Pleurotaenium_ehrenberg 89.04% 94.05% 83.57% Pleurotaenium_ehrenberg 80.85% 31.23% 55.00%
Pyrocystis_lunula 95.00% 99.69% 96.08% Pyrocystis_lunula 68.18% 99.42% 80.72%
Micrasterias_rotata 94.05% 99.87% 98.37% Micrasterias_rotata 94.05% 88.12% 98.37%
Paramecium_bursaria 99.49% 86.86% 99.99% Paramecium_bursaria 97.24% 49.76% 99.99%
Peridinium_spec 94.38% 71.50% 99.21% Peridinium_spec 87.91% 37.59% 99.21%
Total 92.99% 93.78% 96.83% Total 85.74% 73.32% 87.59%

TABLE 2. segmentation results of YOLACT and PR-YOLO [9] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao.
"Scaled-YOLOv4: Scaling Cross Stage Partial Network", Computer Vision
Classification YOLACT PR-YOLO and Pattern Recognition (2021): 13029-13038.
Ceratium 0.9814 0.8034 [10] Qi Zhao, Binghao Liu, Shuchang Lyu, Chunlei Wang, and Hong Zhang.
Colsterium_ehrenberg 0.9575 0.9357 "TPH-YOLOv5++: Boosting Object Detection on Drone-Captured Sce-
Collodictyon 0.9048 0.8776 narios with Cross-Layer Asymmetric Transformer.", Remote. Sens. 15.6
Didinium 0.9979 0.9528 (2023)
Dinobryon 0.9147 0.7149 [11] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.
Lepocinclis_spirogyroides 0.9705 0.7177 Focal loss for dense object detection. In Proceedings of the IEEE interna-
Pinnularia_neomajor 0.9983 0.9323 tional conference on computer vision, pages 2980–2988, 2017.
Pleurotaenium_ehrenberg 0.8319 0.9212 [12] Mingxing Tan, Ruoming Pang, and Quoc V Le. Effificientdet: Scalable and
effificient object detection. arXiv preprint arXiv:1911.09070, 2019.
Pyrocystis_lunula 0.9965 0.8877
Micrasterias_rotata 0.996 0.9141 [13] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints.
In Proceedings of the European Conference on Computer Vision (ECCV),
Paramecium_bursaria 0.8572 0.9594
pages 734–750, 2018.
Peridinium_spec 0.7402 0.9249
[14] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan,
Total 0.9289 0.8692
Jitendra Malik, and Christoph Feichtenhofer. "Multiscale Vision Trans-
formers.", IEEE International Conference on Computer Vision 2104.11227
(2021): 6804-6815.
[15] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon.
REFERENCES "Cbam: Convolutional Block Attention Module", European Conference on
Computer Vision 11211. (2018): 3-19.
[16] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus
[1] Rui Xu, Miaomiao Zhang, Hanzhi Lin, Pin Gao, Zhaohui Yang, Dongbo Odena. "Self-Attention Generative Adversarial Networks.", arXiv: Ma-
Wang, Xiaoxu Sun, Baoqin Li, Qi Wang, and Weimin Sun. "Response of chine Learning abs/1805.08318. (2019)
soil protozoa to acid mine drainage in a contaminated terrace", Journal of
[17] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-
Hazardous Materials 421 (2022): 126790.
Paz. mixup: Beyond empirical risk minimization. arXiv preprint
[2] Lydia Teel, Adam Olivieri, Richard Danielson, Blaga Delić, Brian Pecson, arXiv:1710.09412, 2017.
James Crook, and Krishna Pagilla. "Protozoa reduction through secondary [18] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk
wastewater treatment in two water reclamation facilities", Science of the Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong
Total Environment 807 (2022): 151053. classifiers with localizable features. In Proceedings of the IEEE/CVF
[3] Jesus Ruiz-Santaquiteria, Gloria Bueno, Oscar Deniz, Noelia Vallez, and International Conference on Computer Vision, pages 6023–6032, 2019.
Gabriel Cristobal. "Semantic versus instance segmentation in microscopic [19] Zhao Hengshuang, Jia Jiaya, and Koltun Vladlen. "Exploring Self-
algae detection", Engineering Applications of Artificial Intelligence 87. attention for Image Recognition", Computer Vision and Pattern Recogni-
(2020) tion 2004.13621 (2020): 10073-10082.
[4] Zhenni Shang, Xiangnan Wang, Yu Jiang, Zongjun Li, and Jifeng Ning. [20] Ali Hebbal, Loic Brevault, Mathieu Balesdent, El-Ghazali Talbi, and
"Identifying rumen protozoa in microscopic images of ruminant with Nouredine Melab. "Multi-Fidelity Modeling With Different Input Domain
improved YOLACT instance segmentation", Biosystems engineering 215 Definitions Using Deep Gaussian Processes", Structural and Multidisci-
(2022): 156-169. plinary Optimization 63.5 (2021): 2267-2288.
[5] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ [21] W Liu, D Anguelov, D Erhan, C Szegedy, S Reed, CY Fu, and
2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/1804. AC Berg. "SSD: Single shot multibox detector. arXiv 2015", user-
02767 6073b1344c775e0497f43bf9 (2020)
[6] TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction [22] Glenn Jocher, Alex Stoken, Jirka Borovec, NanoCode012, Ayush Chaura-
Head for Object Detection on Drone-captured Scenarios sia, TaoXie, Liu Changyu, Abhiram V, Laughing, tkianai, yxNONG, Adam
[7] Fahad Jubayer, Janibul Alam Soeb, Abu Naser Mojumder, Mitun Kanti Hogan, lorenzomammana, AlexWang1900, Jan Hajek, Laurentiu Diaconu,
Paul, Pranta Barua, Shahidullah Kayshar, Syeda Sabrina Akter, Mizanur Marc, Yonghye Kwon, oleg, wanghaoyang0106, Yann Defretin, Aditya
Rahman, and Amirul Islam. "Detection of mold on the food surface using Lohia, ml5ah, Ben Milanko, BenjaminFineran,Daniel Khromov, Ding Yi-
YOLOv5", CURRENT RESEARCH IN FOOD SCIENCE 4 (2021): 724- wei, Doug, Durgesh, and Francisco Ingham. ultralytics/YOLOv5: v5.0 -
728. YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations,
[8] Jennifer N. Hird, Alessandro Montaghi, Gregory J. McDer mid, Jahan Apr. 2021.
Kariyeva, Brian J. Moorman, Scott E. Nielsen, and Anne C. S. McIntosh. [23] Zixuan Xu, Banghuai Li, Ye Yuan, and Miao Geng. "Anchorface: An
“Use of unmanned aerial vehicles for monitoring recovery of forest vege- Anchor-Based Facial Landmark Detector Across Large Poses", AAAI
tation on petroleum well sites. Remote. Sens., 9(5):413, 2017. Conference on Artificial Intelligence 35.4 (2021): 3092-3100.

8 VOLUME 11, 2023


9

[24] Ross B. Girshick. "Fast R-CNN.", <i>IEEE International Conference on SUNYANG CHEN received the B.Eng. degree in
Computer Vision</i> abs/1504.08083. (2015): 1440-1448. software engineering from Zhejiang University of
[25] Ren, Shaoqing, He, Kaiming, Ross B. Girshick, and Sun, Jian. "Faster R- Science And Technology, in 2017. He is currently
CNN: Towards Real-Time Object Detection with Region Proposal Net- pursuing the master’s degree in computer science
works", IEEE Transactions on Pattern Analysis and Machine Intelligence with Zhejiang University of Technology. His re-
39.6 (2017): 1137-1149. search interest includes computer vision and nat-
[26] K He, G Gkioxari, P Dollár, and R Girshick. "Mask r-cnn. arXiv 2017", ural language processing.
user-6073b1344c775e0497f43bf9 (2020)
[27] Fang Peng, Zheng Miao, Fei Li, and Zhenbo Li. "S-FPN: A shortcut feature
pyramid network for sea cucumber detection in underwater images", Expert
Systems with Applications 182 (2021): 115306.
[28] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. "Nas-Fpn: Learning Scal-
able Feature Pyramid Architecture For Object Detection", 2019 IEEE/CVF
CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNI-
TION (CVPR 2019) (2019): 7036-7045.
[29] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik
Roy. "Going Deeper in Spiking Neural Networks: VGG and Residual
Architectures.", FRONTIERS IN NEUROSCIENCE 13. (2019)
[30] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. "Wider or Deeper:
Revisiting the ResNet Model for Visual Recognition.", Pattern Recognition
90.1 (2019): 119.0-133.0.
[31] Chongke Bi, Jiamin Wang, Yulin Duan, Baofeng Fu, Jia-Rong Kang, and
Yun Shi. "MobileNet Based Apple Leaf Diseases Identification", Mobile
Networks and Applications (2020): 1-9.
[32] Xuelong Hu, Yang Liu, Zhengxi Zhao, Jintao Liu, Xinting Yang, Chuan-
heng Sun, Shuhan Chen, Bin Li, and Chao Zhou. "Real-Time Detection GUANLIN CHEN received the B.S. and Ph.D.
Of Uneaten Feed Pellets In Underwater Images For Aquaculture Using An degree in computer science and technology from
Improved Yolo-V4 Network", Computers and Electronics in Agriculture Zhejiang University, Hangzhou, China, in 2000
185 (2021): 106135. and 2013, respectively. He is currently a profes-
[33] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. "Path Aggrega- sor in the school of computer and computing sci-
tion Network For Instance Segmentation", Computer Vision and Pattern
ence,Hangzhou City University. His research inter-
Recognition 1803.01534. (2018): 8759-8768.
est includes artificial intelligence and smart city.
[34] Petr Hurtik, Stefania Tomasiello, Jan Hula, and David Hynar. "Binary
cross-entropy with dynamical clipping", Neural Computing and Applica-
tions 34.14 (2022): 12029-12041.
[35] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. "YOLACT++
Better Real-Time Instance Segmentation", IEEE Transactions on Pattern
Analysis and Machine Intelligence 44.2 (2022): 9156-9165.
[36] Ruohao Guo, Dantong Niu, Liao Qu, and Zhenbo Li. "SOTR: Segmenting
Objects with Transformers", arXiv preprint arXiv 2108.06747 (2021)
[37] Duolikun DILIXIATI, ZHANG Tai-hong, and FENG Xiang-ping. "Design
and Implementation of LabelMe Label Checking System", Computer Tech-
nology and Development 32.3 (2022): 214-220.
[38] Tsung-Yi Lin, M. Maire, Serge J. Belongie, James Hays, P. Perona, D.
Ramanan, Piotr Dollár, and C. L. Zitnick. MS COCO:Microsoft Common
Objects in Context.

QIHAO SHI received his Ph.D degree in computer


science from Zhejiang University and B.S. degree
from Nanjing Normal University of China in 2020
and 2014.He is currently an associate professor in
the School of Computing and Computer Science,
Zhejiang University City College. His main re-
search topics are social and information networks,
WUJIAN YANG received B.S. and M.S. degrees algorithmic game theory and Internet economics.
from Zhejiang University, Hangzhou, China in
2000 and 2004, He is currently serving as an As-
sociate Professor at the School of Computer and
Computing Science in Zhejiang University City
College. His research interests primarily focus on
artificial intelligence and data science.

VOLUME 11, 2023 9

You might also like