This section states the research ideas from the perspective of visual perception and gives an overview of the proposed metric; then, the proposed metric is described in detail.
2.3. Monocular Perception Module (MPM) for Distorted HSOI
This subsection extracts the monocular perceptual features of the distorted HSOI from three aspects: global color information, symmetric/asymmetric encoding distortion perception, and scene distortion perception. Among them, the color information is a subjective overall perception, and all viewport images need to be reconstructed first when viewed by users, so, the CMP format of IHSOI is used for global color feature extraction. For symmetric/asymmetric distortion perception and scene distortion perception, the perceptual features are extracted based on the viewport images.
- (1)
Global color feature extraction
As the output at the client of HSOV system, the distorted HSOI may consist of encoding distortion, TM distortion and the mixed distortion. Compared with the ERP representation of IHSOI, its CMP representation more easily describes the monocular distortion of HSOI. Furthermore, numerous studies have shown that color information is processed with color-opponency in the human visual system. Therefore, the global color features are extracted in the spatial domain and DCT domain, respectively, based on the color-opponency space.
According to the work of Hasler et al. [
43], the color in an image is quantized to qualitatively process or code the impact on color visually. Here, taking the left view
IL of
IHSOI as an example,
IL = (
RL,
GL,
BL). Its CMP format can be expressed as
IL = {
ILi,
i = 1, 2,…, 6},
ILi is the
i-th face of the six faces in the CMP format. First,
IL is converted from RGB space to the red–green and yellow–blue opponency channels, denoted as
ρLrg and
ρLyb,
ρLrg =
RL − GL,
ρLyb = 0.5(
RL +
GL) −
BL. Let
μLrg and
μLyb denote the mean of
ρLrg and
ρLyb, and
σLrg and
σLyb denote the variance of
ρLrg and
ρLyb; then, two statistic features
μ2Lrg−Lyb and
σ2Lrg−Lyb of
ρLrg and
ρLyb are expressed as follows:
The most intuitive TM operators (TMOs) are generally to change the mean value of the pixel value distribution, and then change the degree of numerical dispersion of pixels. Therefore, the joint statistical measure
JLrg−Lyb of
ρLrg and
ρLyb is expressed to describe the spatial color feature of
IL, and calculated as follows [
43]:
Similarly, for the right view IR of IHSOI, its joint statistical measure JRrg−Ryb can also be obtained. Thus, the global spatial color feature FCS is defined for the distorted HSOI, FCS = (JLrg−Lyb, JRrg−Ryb).
Then, the color features in the transform domain are extracted. Taking the CMP format of
IL as an example, for its two color-opponency channels,
ρLrg and
ρLyb, their non-overlapping
Nu × Nv blocks are transformed with DCT. Let {
ξL,k(
u,
v);
u = 1, 2,…,
Nu,
v = 1, 2,…,
Nv} denote DCT coefficients of a block, where
k represents the color antagonist channel
ρLrg or
ρLyb; here,
Nu and
Nv are set to 5. For {
ξL,k(
u,
v)}, its DC component is discarded, and its AC components are divided into three frequency bands: low frequency (LF), middle frequency (MF) and high frequency (HF) as shown in
Figure 3. The variance of the three frequency bands of each image block is calculated separately as the band energy feature, the mean of the three frequency bands’ variances of all image blocks is considered as the final energy feature of the corresponding frequency band; then, the energy features of six faces of the CMP format of
IL are averaged. For two color-opponency channels,
ρLrg and
ρLyb, of
IL, 6-dimensional features can be extracted. Similarly, for
ρRrg and
ρRyb, of
IR, 6-dimensional energy features can also be obtained, which constitutes 12-dimensional color features in the DCT domain,
FCD, of HSOI.
Finally, the global color features of HSOI are denoted as FCE, FCE = (FCS, FCD).
- (2)
Symmetric/asymmetric encoding distortion measure
Different from 2D image coding, a stereoscopic image can be encoded with asymmetric encoding by using different quantization parameters (QPs) for its left and right views, so as to improve the encoding efficiency by taking the binocular masking effect of human eyes. The distortion-level difference between the left and right views has a great impact on the quality of user’s experience to the encoded stereoscopic image. Here, a correlation measure between the left and right views is designed to evaluate the information difference caused by different distortion levels of the left and right views.
For
IL and
IR, viewport sampling is first performed to obtain the corresponding left and right viewport image sets {
VL,m} and {
VR,m}, respectively, where
m = 1, 2,…,
M + 2. From the multi-resolution perception of the human visual system, in the process of gradually reducing the image resolution from high to low, the focus of the human eyes shifts from fine textures to rough structures. MSR decomposition [
44] is used for image preprocessing in this work. The complementary information of different scales can effectively detect the image content that is not easy to find at a single scale.
For a given image, the illumination component can be estimated by MSR decomposition. Taking
as an example,
, its illumination component
,
= {
can be calculated as follows:
where ⨂ is convolution operation,
g(
x,
y) is Gaussian function,
g(
x,
y) =
Ngexp(−(
x2 +
y2)/
η),
Ng is normalization factor;
η is the scale parameter of Gaussian function. When the value of
η is large, the detail recovery is coarse, and when the value is small, the detail recovery is fine. Here, in order to reflect the multiscale characteristics, three scale factors with significant differences were used: small, medium and large,
η can be set to one element of {
η1,
η2,
η3}.
MSR decomposition can be used to describe illumination features by three different scale filtering on the image and then weighted summation; here, the gray-scale images of the viewport’s left and right views are directly processed to obtain the corresponding illumination components with different η (η ∈ {η1, η2, η3}), which are, respectively, denoted as and , where = {} and = {}.
Figure 4 shows an example of MSR decomposition of distorted HSOI in the HDR stereoscopic omnidirectional image database (HSOID) [
45] at the client of the HSOV system (here,
η1 = 25,
η2 = 100,
η3 = 240). The original HSOI at the server in the HSOV system is encoded with an asymmetric encoding distortion level of (L1, L3), i.e., the encoding distortion level of the left view is L1, and the encoding distortion level of the right view is L3. To visualize the compressed HSOI on HMD with SDR, DurandTMO [
46] is used in TM processing. It can be found as follows:
- (i)
For
and
, their MSR decomposition with
η1 can show more details than those of MSR decomposition with
η2 and
η3, especially in the window regions in
Figure 4. It indicates that the image’s MSR decomposition with different
η values contains different information, and three-scale MSR decomposition can complement each other.
- (ii)
Compared with the viewport’s left view, the distortion level of the viewport’s right view is lower, and its block effect is less, has a clearer texture than that of , especially in the ceiling and ground regions. It indicates that MSR decomposition can reflect different distortion characteristics of the left and right views with different distortion levels to a certain extent.
For
and
, their feature maps are further calculated. First, de-mean normalization is performed on them. Second, a local derivative pattern (LDP) [
47] is used to measure the texture information of
and
after obtaining the second derivative of
and
in four directions of {0°, 45°, 90°, 135°}. The LDP map in each direction can be quantized by a 10-dimensional histogram according to the rotation-invariant uniform local binary pattern. After the above operations, the quantized LDP histograms of
and
are obtained, and respectively expressed as
and
,
η ∈ {25, 100, 240}. The correlation coefficient can be used to measure the correlation degree between two random variables, and its value range is [−1, 1]; the larger the absolute value of the correlation coefficient, the higher the correlation between the two. Then, based on
and
, the absolute correlation coefficient
CA and correlation distance
CD of each 10-dimensional histogram are calculated as the similarity features of the left and right views,
CA and
CD are computed as
and
, where
corr(·) represents the correlation coefficient function,
pdist(·) represents the correlation distance function, and |·| represents the absolute operation.
Finally, the absolute correlation coefficients and correlation distances of Gaussian functions with three scale parameters are taken as the symmetric/asymmetric encoding distortion feature vector fccor.
Taking the scenarios in the HSOID [
45] as examples,
Table 1 shows
CA and
CD of the distorted viewport’s left and right views processed by DurandTMO under encoded with 9 distortion levels (the first 5 are asymmetric encoding distortion, and the last 4 are symmetric encoding distortion). In addition,
Table 2 shows the relationship between
CA and correlation degree, which reflects the correlation degree of the left and right views with different distortion levels. It can be found as follows:
- (i)
The correlation degrees of the distorted viewport’s left and right views, encoded with 5 asymmetric distortion levels, are obviously different. Although the asymmetric distortion levels, (L1, L3), (L3, L4) and (L2, L3), can be judged as very strong correlation degrees, their absolute correlation coefficients CA are distributed in different levels; the first two are in the range of 0.8 to 0.9, while the latter are in the range of 0.9 to 1.0. It shows that CA and CD can effectively distinguish different levels of asymmetric distortion.
- (ii)
The correlation degrees of the distorted viewport’s left and right views, encoded with 4 symmetrical distortion levels, are all very strongly correlated, and generally their CA values tend to be larger as the distortion level is lower.
- (iii)
The CA values of the distorted viewports, encoded with the symmetrical distortion levels, are generally larger than those with the asymmetrical distortion levels; it indicates that CA and CD can effectively distinguish the types of symmetrical/asymmetrical encoding distortion and their degree of distortion to a certain extent.
- (3)
Feature extraction with scene analysis
The HSOID [
45] includes the different scenes of indoor, outdoor, day and night. In the imaging of indoor scenes, most of the light usually comes from the ceiling light, which may not be sufficient for outdoor scenes; at the same time, most of them contain window regions, and the brightness of the window regions is relatively high, which is prone to loss of details in the imaging. In the imaging of outdoor scenes during the daytime, the light is relatively sufficient and the contrast is relatively high. There may be a large region of sky in the outdoor imaging, because the sky region is relatively flat, the block effect caused by encoding distortion is more perceptible. Especially when wearing an HMD to view the HSOI, it can be seen from the near-eye perception that the distortion in this region is more likely to affect the subjective quality. The night scene is generally dark, with relatively low contrast and fuzzy structure.
In summary, based on contrast, detail and structure of an image, feature extraction can be performed to synthesize perceptual distortion features of various scenes. Among them, the details and structures can be represented by the image’s detail layer and base layer in combination with the idea of image decomposition.
To generate the base and detail layers of the image, Laplacian pyramid decomposition [
48] can be used. Taking a distorted viewport’s left view
VL,m as an example, it is decomposed by Laplacian pyramid at three scales, and a total of three detail layers and three base layers are obtained.
For VL,m with the resolution of NH × NV, its 3-layer detail layer set is expressed as DL,m = {D1L,m, D2L,m, D3L,m}, and the corresponding resolutions are NH × NV, (0.5NH) × (0.5NV) and (0.25NH) × (0.25NV), respectively. The 3-layer base layer set of VL,m is expressed as BL,m = {B1L,m, B2L,m, B3L,m}, and their resolutions are (0.5NH) × (0.5NV), (0.25NH) × (0.25NV) and (0.125NH) × (0.125NV), respectively. Considering the resolution of the viewport image, three layers of detail layers and the first two layers of the base layers are used as the layers after Laplacian pyramid decomposition, that is, the set of the base layers is denoted as B′L,m = {B1L,m, B2L,m}. The base layers of VL,m retain most of the information of VL,m, while the detail layers of VL,m show the detail information; and the higher the resolution, the finer the level of detail displayed.
After the above preprocessing of the distorted viewport’s left and right views, the detail layer sets,
DL,m and
DR,m, and the base layer sets,
B′L,m and
B′R,m, can be obtained. First, the detail features are extracted based on
DL,m and
DR,m. Considering that a local binary pattern (LBP) [
49] can describe spatial domain information by encoding the spatial position relationship between the center pixel and its neighbor pixels within a certain radius, different patterns can characterize structures such as points, lines, and edges, and a contrast-weighted LBP (CLBP) is adopted in combination with the contrast information. Taking
DL,m as an example, its LBP can be expressed as follows:
where
P is the number of neighborhoods, and
R is the neighborhood radius,
P and
R are set to 8 and 1 in [
49].
DL,m,c and
DL,m,i represent the values of the pixels in
DL,m and the pixels in the neighborhood centered on
DL,m, respectively.
Th(·) is the threshold function, and expressed as follows:
The rotation invariant uniform LBP is expressed as follows:
where the superscript ‘riu2’ reflects the use of rotation invariant uniform patterns that have
u value less than 2, and
u(·) is the uniform metric, and expressed as follows:
Generally, the rotation invariant uniform LBP has
P + 2 modes, and each mode represents different image content information. Let
k be the mode index. Histogram of CLBP is expressed as follows:
where
N is the number of pixels of an image in
DL,m,
C is the contrast map of
VL,m,
C =
σc/(
μc +
ε),
Ci is the
i-th pixel’s value in
C,
C = {
Ci};
μc and
σc are the mean map and standard deviation map of
VL,m, and
ε is a constant that prevents the denominator from being 0, and is set to 0.00001.
Because P and R are set to 8 and 1, respectively, 30-dimensional histogram features of DL,m are obtained after the above operations. Similarly, 30-dimensional histogram features of DR,m can also be obtained. Finally, the detailed features of the distorted left and right views are expressed as fCLBP.
Then, structural features are extracted from the base layer sets B′L,m and B′R,m. Here, the global structural tensor is used for extracting the base layer’s structural features. Taking B1L,m in B′L,m = {B1L,m, B2L,m} as an example, a 2D structural tensor transformation is performed, the structural tensor matrix is decomposed and its two eigenvalues, λL1 and λL2, are used to describe the structural features of B1L,m. For B2L,m, the eigenvalues of its structural tensor matrix can be also obtained as λL3 and λL4.
Similarly, for B′R,m, the corresponding eigenvalues of the structural tensor matrix can be obtained as λR1, λR2, λR3 and λR4. Then, the structural features of the distorted left and right views are denoted as fst.
Finally, the feature set extracted by the MPM is the global color feature FCE, symmetric/asymmetric distortion feature fcorr, detailed feature fCLBP and structural feature fst.
- (4)
Feature aggregation with viewport significance
When viewing omnidirectional visual contents, users are usually guided to select viewports by the saliency as regions of interest. The contribution of different viewports to the overall quality needs to be weighted according to their saliency. For binocular product salience map
SLR, viewport sampling is performed first to obtain a series of saliency viewport maps,
. The viewport-normalized saliency value
WS = {
Wm;
m = 1, 2,…,
M + 2} is calculated as the significance weight to express user’s preference for different viewport images, which can be calculated as follows:
where
is the pixel’s value at the position of
p.
For the above extracted features fcorr, fCLBP and fst, Equation (12) is used to perform feature aggregation on them to obtain the aggregated features, which are, finally, expressed as Fcorr, FCLBP and Fst.
2.4. Binocular Perception Module (BPM) for Distorted HSOI
Generally, the human binocular perception has three aspects, the initial stage is binocular fusion; information that cannot be fused leads to binocular competition; in the process of competition, if one view is completely dominant, binocular suppression occurs. The process of binocular perception is a complex physiological mechanism. Both the user’s eyes and brain play a role in this process, and it is difficult for traditional signal processing-based methods to achieve this process through mathematical formulas and derivation. Therefore, current research generally expresses the processes of fusion and competition by simulating biological mechanisms to establish binocular effects for perceptual quality evaluation. Based on joint image filtering, this subsection designs new binocular mechanism modeling schemes, and combines the perceptual characteristics of the V1, V2 and V4 areas of the cerebral cortex for feature extraction.
- (1)
Joint image filtering
Previous studies [
50] showed that the content difference between the left and right views of a stereoscopic image was due to the existence of parallax, but the final result of human binocular perception undergoing three fluctuations is to form a stable stereoscopic image. Therefore, it can be inferred that there is an interactive filtering effect between the left and right views.
- (2)
Binocular fusion map and feature extraction
The traditional binocular fusion images are mainly realized based on gray-scale images. However, Den Oudenet et al. [
51] proved the contribution of color information to binocular matching; it indicates that color information helps solve the binocular matching problem of complex images, and color and brightness information have independent contributions. Thus, for one of the distorted viewport images, (
VL,m,
VR,m), it is first converted from RGB space to YUV space. On the channels of Y, U and V, joint image filtering
fg(·) is performed on them, respectively, and the filtered left view of the distorted viewport image is recorded as
ΦL,m =
fg(
VL,m,
VR,m) and
ΦL,m = {
ΦYL,m,
ΦUL,m,
ΦVL,m}. Similarly, the filtered right view of the distorted viewport image is denoted as
ΦR,m =
fg(
VR,m,
VL,m) and
ΦR,m = {
ΦYRm,
ΦUR,m,
ΦVR,m}.
Considering that a log-Gabor filter can simulate the multiscale and multi-directional selection characteristics of the receptive field of the visual cortex. Therefore, the log-Gabor filter amplitude responses of ΦL,m and ΦR,m are calculated as energy factors to further fuse their information. The log-Gabor filter with the scale s and direction o is expressed as , where θ is the orientation angle, and δs and δo are the filter strengths; and ω and ωs are the normalized filter radial frequency and the corresponding filter center frequency, respectively.
Let
be a set of responses of the log-Gabor filter in different directions and scales, then its output amplitude
A is expressed as the sum of the responses of all scales and directions, and expressed as follows:
Calculate the log-Gabor filtering output amplitudes of
ΦL,m and
ΦR,m, respectively, and denote them as
AL,m and
AR,m. Then, their energy weighting factors,
EL,m and
ER,m, are expressed as follows:
ΦL,m and
ΦR,m are further weighted by
EL,m and
ER,m, respectively. Thus, the fusion image
Φm is expressed as follows:
For Φm = {ΦYm, ΦUm, ΦVm}, the final binocular fusion map Φ′m is generated by using the Y, U and V channels of Φm.
Figure 5 shows an example of binocular fusion map obtained by taking
Figure 4 as the test viewport images, including the single-channel fusion map of the Y, U, or V channel and the fusion map
Φ′m after the combination of the three channels. Because YUV space is used in
Figure 5, the color of the map in
Figure 5d is pseudo-color and different from the RGB image. The red part in
Figure 5d is the high brightness region, which corresponds to the white highlight part of the general RGB image, and the green part in
Figure 5d corresponds to the low brightness region of the RGB image. The channels of Y, U, and V display different information, respectively. After combining the three channels, both color information and fused image information are displayed.
Figure 6a1–a5 show the binocular fusion maps of the viewport processed by five TM operators (from left to right is DurandTMO [
46], Khan [
52], KimKautz [
53], Reinhard02 [
54] and Reinhard05 [
55]), where the high brightness and low brightness regions are marked with red rectangular boxes and orange rectangular boxes, respectively.
Figure 6b1–b5 and
Figure 6c1–c5, respectively, correspond to their local enlarged maps. It can be seen that different TM operators have different degrees of information retention in the global color and local high/low brightness regions, especially in the window regions of
Figure 6b1–b5. Accordingly, brightness-based segmentation can be performed, and differentiated feature extraction can be performed according to the characteristics exhibited by different brightness regions.
To sum up, the brightness segmentation based on the maximum entropy threshold [
56] is performed on the fusion map
Φ′m, and the high brightness region
Φ′m,H, the low brightness region
Φ′m,L and the middle brightness region
Φ′m,M can be also obtained. The maximum entropy threshold segmentation can relatively completely separate the three different brightness regions, and is in line with the subjective brightness perception of the human eyes.
Some studies have shown that visual perception of bright/dark information is unbalanced [
57], so different feature extraction schemes should be designed for high/low brightness regions of TM-distorted images; for example, combining the functions of cones and rod cells on retinal photoreceptors, as cone cells mainly work in bright regions and can recognize texture information, while in dark regions, rod cells will work and recognize contour features. Thus, texture features can be extracted for high brightness regions, and contour features can be extracted for low brightness regions. To ensure information integrity, chrominance features are extracted in the middle brightness region. In particular, the process from brightness segmentation to feature extraction of texture, contour and chromaticity is in line with the human vision system, that is, the V1 area perceives primary luminance features, the V2 area perceives high-level features such as texture and shape, and the V4 area perceives color information delivery mechanism.
For the high brightness region
Φ′m,H of the binocular fusion map
Φ′m, its gray-gradient co-occurrence matrix (GGCM) is calculated to characterize its texture features. GGCM combines image’s gray-scale elements and gradient elements. It can clearly describe the statistical characteristics of gray-scale values and gradients of each pixel in an image and the spatial position relationship between each pixel and its neighboring pixels. Here, gradient is added to the gray level co-occurrence matrix to make it accurate for describing texture. Let
Y denote the gray-scale map of
Φ′m,H, and
GY denote the gradient map of
Y. First, gradient normalization
G′Y is performed on
GY as follows:
where INT(·) is the rounding function, (
i,
j) is the position of pixel in
GY,
GYmax and
GYmin are the maximum and minimum values of
GY, respectively, and
Lg is set to 32.
Similarly, gray-scale normalization
Y′ is performed on
Y as follows:
where
Ymax and
Ymin are the maximum and minimum values of
Y, and
Lx is set to 32.
Let
P denote a GGCM, then, its normalized GGCM,
PN, can be expressed as follows:
where
a and
b are the pixel values at the same position in
Y′ and
G′Y, respectively.
Based on GGCM, a series of statistical features are derived to describe the image’s texture features. Here, five important statistical measures are adopted to describe texture features, namely, gray mean square deviation
T1, gradient mean square deviation
T2, gray entropy
T3, gradient entropy
T4 and mixed entropy
T5, and calculated as follows:
where
μY′ and
μG′ are the mean values of
Y′ and
G′Y, respectively.
Then, T1, T2, T3, T4 and T5 are taken as the texture features fGGCM of the high brightness region Φ′m,H.
For the low brightness region
Φ′m,L of
Φ′m, its narrow-sense contour feature is extracted. The contour feature is usually an efficient representation of the shape of an object in an undistorted image. However, considering that the encoding distortion in the low brightness region appears as the block effect, which visually represents the rectangular outline information. The narrow-sense contour feature is defined here to describe the visual distortion phenomenon of the object shape and the block effect. Because different TMOs change the image information in their own ways, when performing brightness segmentation on the TM image, it first appears that the segmented edges are inconsistent; second, different TMOs will cause different visual effects of encoding distortion in low brightness region. In
Figure 6, the encoding distortion of DurandTMO in
Figure 6c1 is more obvious in the low brightness region, and its block effect outline is more obvious, resulting in drastic changes in the gray-scale value of edge pixels; the block effect caused by Reinhard05 in
Figure 6c2 is the second. The results of the other three TMOs in
Figure 6c1–c5 are visually similar. Based on this, the energy of gradient (EoG) function is used to measure this change. Let
Eog denote EoG of
Φ′m,L, then, it is calculated as follows:
Here, the average value of Eog is defined as the narrow-sense contour feature. Considering that the block effect at different resolutions may be different, Φ′m,L is down-sampled with three scales, and the values at the four scales are taken as the final narrow-sense contour features fEOG of the low brightness region Φ′m,L. Obviously, fEOG is a multiscale feature vector.
Table 3 lists the
values of the five TMOs at four scales of
Φ′m,L. Obviously, the
values with DurandTMO [
46] and Reinhard05 [
55] are in the top two positions, followed by Reinhard02 [
54]; while Khan [
52], KimKautz [
53] are numerically similar. According to
Figure 6c1–c5, DurandTMO and Reinhard05 lead to an obvious block effect visually; for Reinhard02, a small amount of block effect can be observed, while the block effect can be hardly observed for Khan [
52], KimKautz [
53]. It means that it is effective to use EoG to measure narrow-sense contour features, which is consistent with subjective perception.
For the middle brightness region Φ′m,M, chrominance statistical features are extracted. Considering that the image distortion changes the natural scene statistical distribution of its mean subtracted contrast normalized (MSCN) coefficients, the asymmetric generalized Gaussian distribution (AGGD) model can fit this distribution, and the difference in the fitting parameters represents the statistical distribution changes. Thus, the four parameters after AGGD fitting, that is, mean δm, shape parameter θm, left scale parameter and right scale parameter , are used as chrominance statistical features. The chrominance statistical features of the U and V channels are used as the natural statistical features fAGGD of the middle brightness region Φ′m,M.
In summary, the above features extracted are expressed as the texture features fGGCM, narrow-sense contour features fEoG and natural scene statistical features fAGGD. These features are all based on viewport images, so according to Equation (12), they are aggregated according to the viewport saliency to obtain FGGCM, FEoG and FAGGD, respectively. Thus, the final features extracted by the binocular fusion model are Ffus = {FGGCM, FEOG, FAGGD}.
- (3)
Binocular difference map and feature extraction
Because the left and right views of HSOI’s viewpoints are processed by the same TMO for viewing HSOI by user’s HMD with SDR, generally speaking, there is no new color difference between the left and right views after TM. Whereby, the binocular difference information is directly described in the gray-scale channel. Based on joint image filtering, let and be the viewport’s left and right views after the joint image filtering of their gray-scale channel, respectively. and can be regarded as image contents that can be initially fused during binocular matching. The absolute difference maps produced by subtracting the distorted viewport images (VL,m, VR,m) from their jointly filtered viewport images are taken as the left and right monocular difference maps (MDL,m, MDR,m), where and . The monocular difference map represents the information that cannot be fused between left and right views, and the information that cannot be fused may lead to binocular competition.
The related studies [
58] showed that binocular competition occurs in all contrast, and the higher the contrast of a monocular stimuli, the stronger its dominant perception. Therefore, a contrast map is calculated as the competition factor, which weights the left and right monocular difference maps (
MDL,m,
MDR,m) to obtain the binocular difference map
BDm. As mentioned, the contrast map is expressed as
C =
σe/(
μe +
ε). Let
CEL,m and
CER,m be the contrast maps of
MDL,m and
MDR,m, respectively, then, the binocular difference map
BDm is computed as follows:
Considering that the binocular difference map mainly represents the contour information dominated by structure, discrete multidimensional differentiators [
59] are used to characterize the binocular difference map
BDm, in which five types of derivative maps on
BDm are computed, that is, first-order horizontal derivative map
gx, first-order vertical derivative map
gy, second-order horizontal derivative map
gxx, second-order vertical derivative map
gyy and second-order mined derivative map
gxy.
Figure 7a shows the MSCN coefficient distribution curves of the five derivative maps of the HSOI which is first compressed by JPEG XT and then processed by DurandTMO.
Figure 7b shows the MSCN distribution curves of
gx of the HSOI which is first compressed by JPEG XT and then processed by the five TMOs. In order to describe their MSCN coefficient distribution, the generalized Gaussian distribution (GGD) model,
, is used for fitting, where
and
represent the shape and variance parameters of the GGD model, respectively.
For BDm, the shape and variance parameters of the GGD model of its five types of derivative maps are extracted as the binocular difference features, and expressed as fdif. With Equation (12), fdif is further weighted by viewport significance, and the aggregated features are generated as Fdif.