An improved human-body-segmentation algorithm with attention-based feature fusion and a refined corner-based feature-point design with sub-pixel stereo matching for the stereovision-based anthropometric system are proposed in this paper. The proposed human-body-segmentation algorithm aims to improve the segmentation accuracy and reduce the number of parameters of the model. The proposed feature-point design aims to improve the stereo-matching accuracy and reduce the matching complexity.
In the semantic segmentation process, the girth region is segmented to confine the subsequent stereo matching to a smaller area, so as to increase the matching accuracy and efficiency. The higher the segmentation precision, the better the matching effect. Therefore, the semantic segmentation network PSPNet can be further improved to enhance the performance. In this paper, the feature extraction of human-body contour and semantic information is optimized by CBAM. Four CBAMs were added to the middle convolution layers of ResNet101 to refine the features of human-body segmentation. Moreover, in the residual blocks of ResNet101, the group convolution was chosen to replace the common convolution, so as to reduce the computational overhead.
In the girth fitting process, the feature points rotating along with the turntable are reversely rotated to their initial positions, then polynomical with intermediate variable curve fitting (PIVCF) is used to achieve anthropometry.
3.1. A Human-Body-Segmentation Algorithm Based on a CBAM Attention Mechanism
To increase the segmentation accuracy, it is necessary to focus on the human-body region to be segmented and suppress useless information as much as possible. Due to the fixed distance of the camera and the predetermined posture of the subject, the same category of region to be segmented is located at almost the same position in the image. Therefore, the semantic segmentation network should have strong spatial perception. What is more, different categories of regions to be segmented are similar in size and prone to mis-segmentation. Thus, the network should have strong semantic information perception and cross-channel context information fusion ability [
42]. The CBAM attention mechanism can focus on the space and channel information at the same time; realize the feature fusion of space and channel; enhance the perception of spatial and semantic information of the network; and improve the segmentation performance. Hence, CBAM was selected in this paper to further enhance the segmentation performance of PSPNet.
In CBAM [
41], as shown in
Figure 3, the channel attention module performs maximum pooling and average pooling on the input feature map
F to obtain two 1D vectors which represent the channel information of
F in the local and global features, respectively, and aggregate the spatial information as well. Then, the two 1D vectors are input into a multi-layer perception (MLP) for interaction, and the two perceived 1D vectors are added element by element. Finally, a 1D channel attention map
is generated through the sigmoid activation function and is multiplied with the input feature map
F to obtain the channel refined feature map
.
wherein ⊗ denotes element-wise multiplication,
denotes the sigmoid activation function and ⊕ denotes element-wise addition.
The spatial attention module performs maximum pooling and average pooling along the channel axis on the channel-refined feature map
to obtain two 2D vectors which represent the spatial information of
in terms of local and global features. Then, the two 2D vectors are cascaded and convolved. Finally, a 2D spatial attention map
is generated through the sigmoid activation function and is multiplied with
to obtain the space- and channel-refined feature map
.
wherein ⊗ denotes element-wise multiplication,
denotes the sigmoid activation function,
denotes the convolution layer with a
convolution kernel and [;] denotes cascade.
ResNet101 consists of three parts: the input part, the middle convolution part (layer 1–4) and the output part. The middle convolution part is constructed from residual blocks, among which there are 3 residual blocks in layer1, 4 residual blocks in layer2, 23 residual blocks in layer3, and 3 residual blocks in layer4.
Figure 4 shows the specific embedded positions of CBAMs in the middle convolution layers of the backbone network (ResNet101) of PSPNet. A CBAM is embedded in the output of each of the four layers.
Figure 5 shows the visualization comparison of feature maps between the backbone network of PSPNet and that of CBAM-PSPNet. The visualization of six feature maps in the feature extraction stage is compared, corresponding to the outputs of Conv1, MaxPool, Layer1, Layer2, Layer3 and Layer4 in
Figure 4. According to the visual effect, there is a significant improvement in the extraction of low-level edge information, i.e., human-contour information for CBAM-PSPNet in the feature extraction stage of Conv1, MaxPool, Layer1 and Layer2. Moreover, there is a moderate improvement in the extraction of high-level schematic information, i.e., richer schematic information for CBAM-PSPNet in the feature extraction stage of Layer3 and Layer4. Therefore, the improved CBAM-PSPNet can achieve adaptive feature refinement of the input feature map, along with better spatial perception and cross-channel context information fusion.
Furthermore, to reduce the computational cost of the network, the common convolution in the residual blocks of the backbone network is replaced by the group convolution according to its characteristic that the number of parameters in the model reduces with an increase in the number of groups. Assume that the size of an input feature is
and the size of an output feature is
. For common convolution, there are
convolution kernels of size
, and the parameter number
can be calculated by Equation (
3).
For group convolution, assuming
g groups, there are
convolution kernels of size
in each group, and the parameter number
can be calculated by Equation (
4) [
43].
As shown in Equation (
4), the parameter number of the group convolution is
of the common convolution, which reduces the number of parameters in the model and improves the segmentation efficiency.
Figure 6 is the structural chart of the residual block from the common convolution to the group convolution. For a 256-d input feature map, the output is obtained by processing the input through two branches, a linear branch and a shortcut branch. Sixty-four common convolution kernels of size
in the second layer of the residual block are replaced by four groups of convolution kernels; each group has 16 convolution kernels of size
. Then, the four outputs of each group are concatenated. The parameter number of the second layer of the residual block is reduced from
to
.
Figure 7 shows the schematic diagram of CBAM-PSPNet. Firstly, a feature extraction module extracts the contour features, position features, etc., of the human-body parts from the input image, and generates a feature map containing both channel and spatial attention, which will improve the segmentation accuracy. The feature extraction module is improved by embedding a CBAM module at the end of each layer (1–4) of the backbone network and substituting group convolutions in the second layer of each of the residual blocks in each layer. Then, the pyramid pooling module extracts the context information of the generated feature map. The pyramid pooling kernels have four levels, that is,
,
,
and
, in which the global and local features of different scales are extracted. Next, the features extracted in the four levels and the input features are fused to form a composite feature map which contains both global and local context information. Finally, the human-body segmentation is achieved by the convolution of the input feature map with the composite feature map.
Table 1 shows a comparison of the number of parameters and computational cost between the improved ResNet101 and the original ResNet101. For the input feature map of size
, the number of parameters in ResNet101 is 42.50 million, and the computational cost is 7.84 billion FLOPs. The number of parameters in the improved ResNet101 is 32.52 million, a reduction of 23.5%; and the computational cost is 5.94 billion FLOPs, a reduction of 24.2%. The reductions in the number of parameters and computational cost are mainly attributed to the group convolution substitution, and the experimental data are consistent with the theoretical analysis mentioned above.
To verify the performance of CBAM-PSPNet, 15,795 human-body images were selected as the training set and 4513 human-body images were selected as the test set.
Table 2 shows the performance comparison between CBAM-PSPNet and PSPNet. The pixel accuracy (PA) of PSPNet was 98.36%, the mean pixel accuracy (MPA) was 88.25% and the mean intersection over union (MIOU) was 82.30%. The PA of CBAM-PSPNet was 98.39%, an increase of 0.03%; the MPA was 92.28%, an increase of 4.03%; and the MIOU was 83.11%, an increase of 0.81%. The increases in accuracy can be mainly attributed to the embedding of CBAMs, which helps to generate feature maps that simultaneously fuse channel attention and spatial attention, so as to improve the segmentation accuracy.
3.2. Refined Corner-Based Stereo-Matching Scheme Working at the Sub-Pixel Level
The feature-point design directly affects the matching accuracy, and the matching accuracy directly determines the anthropometry accuracy.
Figure 2 has shown the checkerboard corner design proposed in this paper for optimizing anthropometry accuracy.
Figure 8 shows the schematic diagram of the refined stereo-matching scheme that works at the sub-pixel level based on the corner design in
Figure 2. In the anthropometry of this paper, firstly, the left-view and right-view girth regions of human body were segmented by CBAM-PSPNet. Next, the Shi–Tomasi corner detection algorithm was used to extract the feature-point information at the sub-pixel level in the girth region. Then, a regional constraint was applied to the extracted feature-point set of corners according to the characteristics of the color markers and the checkerboard. Finally, refined stereo matching on a baseline in the region was realized according to the characteristics of corner coordinates, and refined stereo matching on multi-lines in the region was achieved according to the characteristics of the checkerboard, so as to further improve the accuracy of human-body girth measurement.
In the anthropometric system in reference [
17,
18], color markers are used for stereo matching, and the matching point pair closest to the center of the marker is reserved for spatial coordinate calculation. In the anthropometric system in this paper, corners are used for stereo matching.
Figure 9 shows the pixel number comparison between the color markers and the corners in the same shooting conditions and with the same magnification.
Figure 9a is the segmented image of human-body parts in reference [
17,
18], and
Figure 9b is a partial, enlarged view of the color markers.
Figure 9c is the segmented image of the same part in this paper, and
Figure 9d shows the partial, enlarged view of the corners. Since the feature-point matching is carried out within the range of the color marker or the corner, the sizes of the color marker and the corner determine the search range for feature-point matching. As shown in
Figure 9b,d, a color marker contains hundreds of pixels, whereas a corner only includes four pixels. Therefore, the corner design proposed in this paper can greatly reduce the search range of feature-point matching and achieve fast and accurate matching.
Figure 10 shows the result of SURF matching [
44] on the corner-based segmented images. Due to the high similarity between the detected feature points on the checkboard, there must be a lot of mismatches in SURF matching. For example, in
Figure 10, a total of 38 pairs of matching points exist, among which 29 pairs are mismatched and only 9 pairs are matched. This mismatching rate is 76.3%, which is too high to eliminate the mismatching points. Moreover, the SURF-detected feature points are mostly not the checkboard corners, which is not beneficial for accurate girth measurement. Therefore, SURF matching is no longer suitable for feature-point matching in this paper. It is necessary to find a more effective matching method for the checkerboard corners. As shown in
Figure 8, a refined stereo-matching method that works at a sub-pixel level based on the characteristics of corners is proposed in this paper.
For the left-view and right-view human-body regions segmented by the CBAM-PSPNet human-body-segmentation algorithm, the checkerboard corners need to be detected as accurately as possible. The commonly used corner feature detection methods include Harris and Shi–Tomasi’s methods [
45]. The Shi–Tomasi detector [
46] has a similar gradient-based mathematical foundation to the Harris detector [
47], but with higher accuracy, faster speed and fewer parameters. Therefore, the Shi–Tomasi corner detection algorithm was chosen to accurately locate the corners according to the characteristic of gray value variation in the corner neighborhood.
Figure 11 shows the detection result by the Shi–Tomasi corner detection algorithm. The hollow blue dots in
Figure 11 represent the positions of the detected corners. Not only could all corners be detected, but the detection accuracy reached the sub-pixel level, which can greatly improve the accuracy of the subsequent stereo matching.
Next, according to the characteristics of checkerboard corners, the complexity of stereo matching is reduced by regional constraint. A few color markers were preset at the girth measurement region to assist the regional constraint.
Figure 12 shows examples of the preset color markers in the waist region. L1, L2, L3 and L4 are the left-view images of the waist region captured from four different rotation angles of the turntable, respectively; and R1, R2, R3 and R4 are the corresponding right-view images. In each segmented image, a red marker and a cyan marker are shown. A total of four markers were preset to ensure that each image would contain one red marker and one cyan marker. The horizontal distances were 8, 7, 8 and 7 checkerboard intervals from L1(R1) to L4(R4), and the vertical distances were −1, +1, −1 and +1 checkerboard intervals, so that the rectangular area determined by the two markers would contain the same baseline for girth measurement.
In the segmented image, there are four colors, namely, red, cyan, black and white. All pixels in the segmented image constitute a dataset
, wherein
represents a pixel and
N is the total number of pixels in the segmented image. Each pixel
can be expressed as
in the HSV color space and
in the 2D coordinates of the segmented image.
Table 3 shows the HSV ranges corresponding to the four colors. If
is greater than 46,
is greater than 43 and
is greater than 0 but less than 10 or
is greater than 156 and less than 180, the color of
is red. If
is greater than 46,
is greater than 43 and
is greater than 78 but less than 99, the color of
is cyan. If
is greater than 221 and
is less than 30, the color of
is white. If
is less than 46, the color of
is black. Thus, the pixel set of the red marker in the segmented image is extracted from
as a smaller dataset
, and the pixel set of the cyan marker in the segmented image is also extracted from
as another smaller dataset
, wherein the subscripts
R and
C stand for red and cyan, respectively. Taking the waist segmentation images L1 and R1 as examples, a total of four pixel sets of the red and cyan markers for the left and right views are obtained, denoted as
,
,
and
, wherein the subscript
l and
r represent the left view and right view, respectively, and the subscript
R and
C denote red and cyan, respectively.
wherein
,
,
and
represent the pixels in
,
,
and
, respectively;
,
,
and
are the total numbers of pixels in
,
,
and
, respectively. Specifically,
and
are the 2D coordinates of the pixel
in the pixel set
,
,
are the 2D coordinates of the pixel
in the pixel set
,
,
are the 2D coordinates of the pixel
in the pixel set
,
and
are the 2D coordinates of the pixel
in the pixel set
. Thus, the central points of the red and cyan markers in L1 and R1, that is,
,
,
and
, are calculated by averaging all the pixels in the respective pixel sets
,
,
and
, as shown in Equations (
6):
wherein
and
are the 2D coordinates of the central point
for the pixel set
,
,
are the 2D coordinates of the central point
for the pixel set
,
,
are the 2D coordinates of the central point
for the pixel set
,
and
are the 2D coordinates of the central point
for the pixel set
.
By the Shi–Tomasi corner detection algorithm, the corner sets in the segmented images L1 and R1 are extracted at the sub-pixel level, denoted as and , wherein and represent the extracted corners from L1 and R1; and are the total numbers of corners in L1 and R1; and are the 2D coordinates of the corner in the corner set , ; and are the 2D coordinates of the corner in the corner set .
A rectangular region can be determined according to the central point coordinates of the red and cyan markers calculated above.
Figure 13 shows an example of the corner matching by the regional constraining of markers. In L1, with the central points of markers
and
as the regional constraint, a smaller corner set
in the rectangular region defined by the red and cyan markers can be obtained, as expressed in Equation (
7). In R1, with the central points of markers
and
as the regional constraint, another smaller corner set
in the rectangular region defined by the red and cyan markers can be obtained in the same way, as expressed in Equation (
8).
wherein
and
represent the corners in the rectangular region of L1 and R1, respectively;
and
are the 2D coordinates of
;
and
are the 2D coordinates of the central point for the red marker in L1;
and
are the 2D coordinates of the central point for the cyan marker in L1;
and
are the 2D coordinates of
;
and
are the 2D coordinates of the central point for the red marker in R1;
and
are the 2D coordinates of the central point for the cyan marker in R1. The numbers of corners in
and
can be denoted as
and
, wherein
,
and
.
The corner sets
and
in the left- and right-view images for the same baseline are acquired through regional constraint, wherein
and
. According to the characteristic of the checkerboard, the x coordinates of the corners on the same line increase successively. Therefore, the corners in the corner sets
and
are ordered by the x coordinate, as expressed in Equations (
9) and (
10); and the pixels of the same corner in the left- and right-view images correspond in order. That is, the ordered
and
with the same i correspond to the same corner in 3D space, and they are a stereo-matching point pair. Thus, refined stereo matching at the sub-pixel level can be achieved, and with less complexity. Algorithm 1 describes the refined stereo-matching process described above.
To further increase the anthropometry accuracy, multiple measurements can be carried out on the same girth so that the optimal value can be selected from multiple measurement results. Hence, it is necessary to match multiple lines of corners precisely and simply. The central points of the red and cyan markers are moved up or down along the y direction in a step
, wherein
is the pixel difference corresponding to the checkerboard interval in the image.
is inversely proportional to the shooting distance
(m), and the relationship is shown in Equation (
11):
In the experiment, m and pixels. The y coordinates of , , and in the left- and right-view images increased or decrease upward or downward in the step to get , , and . Then, accurate matching of the other two lines of corners in the same segmented region was achieved in the way described above.
By using the binocular calibration parameters, the 3D coordinates of each line of stereo-matching corner pairs were calculated; then the corners were reversely rotated back to the initial positions according to the rotation angle of the turntable. Next, the PIVCF curve fitting method was used to achieve human-body girth fitting, and finally, the human-body parameter measurement data of multiple lines in the same region were calculated. According to GB/T 16160-2017 [
48] “Anthropometric Definitions and Methods for Garment”, the maximum girth data among the three is output as the final girth measurements of bust, hip and thigh, and the minimum girth data are output as the final girth measurement of the waist. Moreover, by moving down the measure line of bust or thigh in 2
, the girth data of the third line are output as the final girth measurements of under-bust or mid-thigh, respectively.
Algorithm 1 The refined stereo-matching process. |
Input: Segmented images L1 and R1; Output: Stereo-matching point pairs; |
- 1:
Extract pixel sets of the red and cyan markers according to H, S, V components, and for L1, and for R1; - 2:
Calculate the central points , , and for , , and , respectively; - 3:
Extract corner sets and for L1 and R1 by Shi–Tomasi corner detection algorithm; - 4:
Get a smaller corner set constrained by and from , and another smaller corner set constrained by and from ; - 5:
Order the corners in and separately according to the x coordinates of the corners; - 6:
return The stereo-matching point pairs .
|