Image quality assessment (IQA) plays an essential role in various image processing and visual communication systems. In the literature, many traditional 2D IQA metrics have been proposed [55, 56, 7, 58, 57, 63, 51]. Due to the unprecedented development of 3D/stereoscopic technologies, quality issues for 3D media are becoming more and more attracted in both academy and industry . Through the process of acquisition, compression, transmission, display, etc. original reference stereoscopic images usually suffer from perceptual quality degradation caused by diverse distortion types and degrees. As a result, the demands for effectively assessing the perceptual quality of stereoscopic images are urgent.
Compared with the previous 2D IQA case, stereoscopic image quality assessment (SIQA) is more challenging owing to a wide variety of 2D and 3D influential factors, such as image spatial artifacts, depth perception, visual comfort, and so on . These factors have different effects on evaluating the quality of experience (QoE) for stereoscopic images. Except for  that considers image distortion and depth perception quality simultaneously, existing research works mainly focus on modeling each individual factor for 3D QoE [44, 43, 54, 8]. In this paper, we aim to study the visually perceptual quality assessment of stereoscopic images.
Similar to 2D IQA, according to the availability of original reference stereoscopic images, SIQA models are typically divided into three categories: full-reference (FR) [6, 5, 62, 15, 10, 26], reduced-reference (RR) [38, 30, 27], and blind/no-reference (NR) [1, 41, 9, 48, 35, 60, 64] SIQA metrics.
For FR SIQA algorithms, full information of the reference image is assumed to be exploited. The earliest FR IQA model investigated some off-the-shelf 2D IQA metrics, such as structural similarity index (SSIM) , universal quality index (UQI) , C4  and so on , to assess stereoscopic image quality . Further, disparity information was integrated into the 2D IQA metrics to predict the 3D image quality [5, 62]. Apart from incorporating depth clues, the binocular vision characteristics of the human visual system (HVS) were combined with 2D IQA algorithms. For example, Gorley and Holliman proposed a new stereo band limited contrast (SBLC) metric based on the HVS sensitivity to contrast changes in high frequency regions . Moreover, Chen et al.  proposed cyclopean images according to the binocular rivalry in the human eyes. In addition, Lin and Wu developed a binocular integration based computational model for evaluating the perceptual quality of stereoscopic images .
As for RR SIQA approaches, only part of original non-distorted image data is available. Qi et al. utilized binocular perceptual information (BPI) to perform RR SIQA . By characterizing the statistical properties of stereoscopic images in the reorganized discrete cosine transform (RDCT) domain, Ma et al. presented the RR SIQA method . Furthermore, a new RR-SIQA metric based on natural scene statistics (NSS) and structural degradation was proposed .
However, in most practical applications, the original pristine image cannot be accessible. Therefore, NR SIQA is inevitably required. Some research works have been studied about NR SIQA. Akhter et al. extracted the local features of artifacts and disparity to evaluate the perceptual quality of stereoscopic images . Sazzad et al. also exploited perceptual differences of local features for NR SIQA . However, these methods are distortion-specific NR SIQA approaches, which are only suitable for JPEG coded stereoscopic image pairs. Thus, several general-purpose NR SIQA metrics have emerged. Chen et al. 
extracted 2D and 3D NSS features from stereoscopic image pairs. The shape-parameter features were regressed onto subjective quality scores by using the well-known support vector regression (SVR). Suet al. proposed the stereoscopic/3D blind image naturalness quality (S3D-BLINQ) index by constructing a convergent cyclopean image and extracting bivariate and correlation NSS features in spatial and wavelet domains .
proposed a blind deep quality evaluator for assessing stereoscopic image quality with monocular and binocular interactions based on deep belief network (DBN). The deep NR stereoscopic/3D image quality evaluator (DNR-S3DIQE) was proposed, which extracted local abstractions and then aggregated them into global features by employing the aggregation layer . In addition, Yang et al. took into account the deep perception map and binocular weight model along with the DBN to predict perceived stereoscopic image quality . In , a deep edge and color signal integrity evaluator (DECOSINE) was proposed based on the whole visual perception route from eyes to the frontal lobe. Besides, Zhou et al. proposed a dual-stream interactive network called stereoscopic image quality assessment network (StereoQA-Net) for NR SIQA . However, the above-mentioned algorithms consider little prior domain knowledge of the 3D HVS, and thus having difficulty in accurately predicting the perceptual quality of stereoscopic images with various distortion types and levels.
Binocular rivalry is an important phenomenon in 3D vision for SIQA , in which perception alternates between different views when two eyes see different scenes . In conventional perspective, binocular rivalry is simulated by low-level competition between the input stimulus and it is related to the energy of the stimulus [36, 25, 13]. Recently, more literatures try to explain binocular rivalry by the predictive coding theory [12, 19]. It is a popular theory about how brain process sensing visual stimuli. According to predictive coding theory, the cognition system tries to match bottom-up visual signal with top-down predictions . Different from the traditional statements that binocular rivalry is low-level inter-ocular competition in early visual cortex, binocular rivalry mechanism based on predictive coding theory (BRM-PC) stress both on low-level and high-level competitions . Moreover, the BRM-PC is the HVS guided and more inline with human cognition system . Therefore, we believe that introducing BRM-PC will be beneficial to SIQA.
In this paper, we propose a generic architecture called Predictive Auto-encoDing Network (PAD-Net), which is an end-to-end network for general-purpose NR SIQA. Our contributions of the proposed method are summarized as follows:
We propose a novel predictive auto-encoding network inspired by the binocular rivalry mechanism based on predictive coding theory  which helps to explain the binocular rivalry phenomenon in 3D vision. The source code of PAD-Net is available online for public research usage111http://staff.ustc.edu.cn/~chenzhibo/resources.html.
According to the predictive coding theory in human brain, we adopt the encoder-decoder architecture to reconstruct the sensory input and further exploit the Siamese network to generate the corresponding likelihood as well as prior maps for modeling binocular rivalry in the cognition system.
We demonstrate that the human brain can perceive the differences of fusion information under various distortion types and levels for symmetrically and asymmetrically distorted stereoscopic image pairs.
Compared with state-of-the-art SIQA metrics, the proposed PAD-Net provides more precise quality estimation, especially for those stereopairs under asymmetric distortions thanks to the well consideration of binocular rivalry based on predictive coding theory.
The remainder sections of this paper are organized as follows. Section II explains the proposed Predictive Auto-encoDing Network (PAD-Net) for NR SIQA in details. Section III presents the experimental results and analysis. In Section IV, we conclude the paper with an outlook on the future work.
Ii Proposed Method
Inspired by the binocular rivalry mechanism based on predictive coding theory (BRM-PC) , we design the no-reference PAD-Net to automatically predict the perceptual quality of stereoscopic images. The proposed PAD-Net is a Siamese based end-to-end network, including auto predictive coding and quality regression modules. Given paired distorted 3D images, they are divided into sub-images with size 256x256 first, then the quality scores of these sub-images are estimated through PAD-Net and aggregated as a final one as done in . We pre-train the proposed ‘auto-encoder’ and quality regression module on Waterloo Exploration Database  and LIVE 2D Database , respectively. After that, the entire network is jointly optimized using 3D IQA databases [33, 10] to generate more accurate predictions.
In this section, the architecture of our proposed PAD-Net is described first as shown in Fig. 1. Then, we introduce the Siamese encoder-decoder module and quality regression module in details. Finally, the training and testing methods of our PAD-Net are presented.
The architecture of the PAD-Net is depicted in Fig. 1, which contains a Siamese encoder-decoder module and a quality regression module. The Siamese encoder-decoder module represents the processing procedure of left and right view images. It is inspired by the BRM-PC that the human brain tries match bottom-up visual stimuli with top-down predictions 
and the view with the larger posterior probability will be dominant during the rivalry and inhibit the other one. In, an empirical Bayesian framework based on BRM-PC was proposed and the posterior probability is calculated from the likelihood and prior probability . In our proposed PAD-Net, the likelihood and prior are inferred through the Siamese encoder-decoder, which corresponds to the likelihood and prior in the BRM-PC.
For each branch of the Siamese encoder-decoder, four times down convolution and up convolution with a stride of 2 pixels are performed to reconstruct the distorted images. Non-linear activation function layers follow the first three down convolution and up convolution layers to enhance the representation ability of our neural network. We square the error between the input and reconstructed image as the residual map and utilize this map for likelihood calculation. Moreover, the high-level features are convolved with a stride of one pixel to change the channel size and upscaled to the same resolution as the input image. In the PAD-Net, we use the reconstruction error to obtain likelihood probability map and the high-level representation after four times down convolution to generate prior probability map. The reasons will be given in Section II B.
Once we get the left and right view images as well as their likelihood and prior probability maps, we can put them into the quality regression module and compute the final quality score. It can be seen from the bottom of Fig. 1
that the quality regression module is composed of 1) fusion for input images and probability maps, 2) ResNet-18 without last two layers as feature extractor and 3) the final max pooling and fully connected layer. Note that fusion in Fig.1 includes a one stride convolution layer and activation function layer to match the input size with ResNet-18. The output of the proposed PAD-Net is the predicted quality score of the input stereoscopic image.
Ii-B Siamese Encoder-Decoder Module
, the core task of brain is to represent the environmental causes of its sensory input. Generative model based hierarchical Bayesian inference is used to provide constraints on the mapping between the input and causes (hypotheses). Different from the conventional understanding that competition exists in low-level inter-ocular visual cortex, binocular rivalry mechanism based on predictive coding theory stresses both on low-level and high-level competition. Therefore, we believe the BRM-PC is the HVS guided and able to generate more reliable and interpretable results, which are verified in Section III.
To learn from the detailed predictive coding theory in the visual cortex, the principle of human brain cognition is illustrated in Fig. 2. In the hierarchical predictive coding model, the feedback pathways carry the prediction from lower level and the feedforward pathways return the prediction errors between the prediction and sensory input to update the prediction and get the best hypothesis. To simulate this process, we adopt the encoder-decoder network structure for image compression from [3, 4], which is a hierarchical structure. In Fig. 2, prediction is adjusted with residual errors to obtain better hypothesis, thus, we employ the squared error between predicted (decoded) image and input image 
as our loss functionto pre-train the encoder-decoder network as follows:
where and represent the -th input and predicted image. , and are the width, height and channel of and . is the batch size of a mini-batch training data. is estimated via the encoder-decoder network with weight
. During the training stage, the loss which can be considered as prediction error will be backpropagated through the network. Then, we use the gradient descent algorithm to updateand generate better prediction. Finally, the prediction error will converge and reach a stable value. The decoded image would not change greatly. Thus, the updating policy of the encoder-decoder network is similar to the prediction coding theory of human brain. Note that the encoder-decoder network is pre-trained on the Waterloo Exploration Database , therefore the training data includes both reference and distorted images.
|Encoder||Conv1+GDN1||3, 256, 256||128, 128, 128||5x5/2/2|
|Conv2+GDN2||128, 128, 128||128, 64, 64||5x5/2/2|
|Conv3+GDN3||128, 64, 64||128, 32, 32||5x5/2/2|
|Conv4||128, 32, 32||192, 16, 16||5x5/2/2|
|Decoder||Unconv1+IGDN1||192, 16, 16||128, 32, 32||5x5/2/2|
|Unconv2+IGDN2||128, 32, 32||128, 64, 64||5x5/2/2|
|Unconv3+IGDN3||128, 64, 64||128, 128, 128||5x5/2/2|
|Unconv4||128, 128, 128||3, 256, 256||5x5/2/2|
Unconv: Fractionally-strided convolution
GDN: Generalized divisive normalization
IGDN: Inverse generalized divisive normalization
, generalized divisive normalization (GDN) transform and inverse generalized divisive normalization (IGDN). GDN is inspired by the neuron models in biological visual system and proves to be effective in density estimation, image compression  and image quality assessment . The GDN and IGDN operations are given by:
where means the -th channel of and it is the input of GDN transform. is the -th channel of normalized activation feature map and it is the output of GDN operation. Moreover, and are the parameters to be updated in GDN function. Likewise, , , and share the same meaning as , , and for IGDN transform. The goal of our encoder-decoder module is to reconstruct the input sensory instead of compressing it. Therefore, we remove the quantization step in  for better reconstruction.
Up till now, we get the high-level representation and reconstructed image from the input sensory. Then, according to the BRM-PC , the view with higher posterior probability will win during the rivalry and inhibit to each other as shown in Fig. 3. From the Bayesian perspective, the posterior probability is related to prior and likelihood. In other words, the likelihood is about how well the hypothesis predicts the input and the prior is about how probable the hypothesis is and concern with empirical knowledge . To calculate likelihood, we first obtain the -th squared residual error map in the mini-batch of a size of as follows:
where and denote the -th channel of the input and predicted image. equals 3 in Eq. 6. The likelihood is used to measure the similarity between and and it is inversely proportional to errors. The training stage of encoder-decoder network can be regarded as the procedure of prediction error minimization.
In the Siamese encoder-decoder module, the prior is modeled with the high-level representation of the sensory input since the prior comes from high levels in the cognition system, as assumed in empirical Bayes [19, 50]. Thus, the high-level features are utilized to generate the -th prior map . Before being fed into the quality regression module, the squared error map and the prior map are normalized between left and right view as follows:
where , , and are the prior and error maps for the -th left view and right view images in a mini-batch. , , and indicate the normalized prior and likelihood probability maps. Note that the error is opposite to the likelihood, that is to say, if the error is large, the likelihood will be small. For example, when computing the likelihood map for left view, the error map of right view is adopted and vice versa.
|Prior||Softplus4||192, 16, 16||192, 16, 16||-|
|Conv5+Softplus5||192, 16, 16||1, 16, 16||1x1/1/0|
|Up-sample6||1, 16, 16||1, 256, 256||-|
|Square7a||1, 256, 256||1, 256, 256||-|
|Normlization8a||1, 256, 256||1, 256, 256||-|
|Likelihood||Square7b||1, 256, 256||1, 256, 256||-|
|Normlization8b||1, 256, 256||1, 256, 256||-|
Normlization: Normalization between left view and right view
The detailed structure of prior and likelihood creation is given in Table II. We employ the Softplus activation function  in prior generation to avoid square negative values to positive values. It is defined as:
The Softplus function can be regarded as the smoothing version of ReLU function which is similar to the way cerebral neurons being activated.
Ii-C Quality Regression Module
Based on the distorted stereoscopic images as well as the obtained prior and likelihood probability maps from the Siamese encoder-decoder module, we fuse them as a 3-channel feature map and further feed the 3-channel feature map into the ResNet-18 quality regression network to extract discriminative features for quality estimation. ResNet-18 is chosen for its excellent ability of feature extraction[17, 18]. The last two layers including average pooling and fully connected layer are removed for regressing the feature map into a quality score. Table III illustrates the architecture of quality regression network. The input stem and basic block of the ResNet-18 structure are shown in Fig. 1.
|Fusion||Conv9+GDN9||10, 256, 256||3, 256, 256||1x1/1/0|
|ResNet18 Regression||ResNet-18||3, 256, 256||512,8,8||-|
GDN: Generalized divisive normalization
Fc: Fully connected layer
Ii-D Training and Testing
Owing to the limited size of available 3D image quality assessment database, we train the PAD-Net on the sub-image pairs with the resolution of , the MOS value for the entire image is assumed as the quality scores for several sub-images as done in 
. Thus, sub-images coming from the same test image share the same labels. Moreover, transfer learning is adopted to solve the problem of lacking labeled data and enhance the prediction accuracy of the network[37, 61].
In blind stereoscopic image quality assessment, it is difficult to predict the MOS value precisely . Therefore, we divide the training stage into three steps: 1) pre-training of encoder-decoder on the pristine and distorted 2D images; 2) pre-training ResNet-18 regression on the pristine and distorted 2D images; 3) joint optimization on the 3D IQA database.
Firstly, encoder-decoder is trained to minimize the difference between predicted and input images which is described in Section II B. Then, we get the weight for the Siamese encoder-decoder as follows:
Secondly, we utilize the original and distorted 2D images along with the associated MOS scores to pre-train the ResNet-18 regression network. It is aimed to map the 2D image into a quality score. In addition, ResNet-18 with pre-trained weight on ImageNet is adopted for better initialization. Then, the loss functionfor second step pre-training is defined as:
where and indicate the real MOS and predicted score for the -th input 2D sub-image in a mini-batch. The weight for the ResNet-18 regression network is updated by minimizing as follows:
Finally, the Siamese encoder-decoder and quality regression module are jointly optimized using stereo image pairs. Since the ultimate purpose of PAD-Net is to estimate the perceptual quality of 3D images, we again adopt the -norm between the subjective MOS value and predicted score as loss function:
where and represent the input 3D sub-image pairs. indicates the PAD-Net with encoder-decoder weight , ResNet-18 regression weight and weight training from scratch. At the joint optimization step, are initialized with pre-trained weight and updated with through final loss minimization:
In the testing stage, the stereo image is divided into sub-image pairs with a stride of to cover the whole content. The predicted qualities of all sub-image pairs are averaged to compute the final perceptual quality score.
Iii Experimental Results and Analysis
In this section, we first introduce the databases and performance measures used in our experiment. Then, the experimental results of the proposed PAD-Net on the entire LIVE databases and individual distortion type are illustrated. Meanwhile, the visualization results are provided for better explanation. Finally, we conduct the ablation study to verify the effectiveness of each component in our model and measure the computation complexity.
Iii-a Databases and Performance Measures
LIVE Phase I :
This database contains 20 original and 365 symmetrically distorted stereo image pairs. Five distortion types are included in this database, namely JPEG2000 compression (JP2K), JPEG compression (JPEG), additive white noise (WN), Gaussian blur (BLUR) and Raleigh fast fading channel distortion (FF). Subjective differential mean opinion score (DMOS) is provided for each degraded stereo image. Higher DMOS value means lower visual quality.
LIVE Phase II [10, 9]: It includes 120 symmetrically and 240 asymmetrically distorted stereopairs derived from 8 reference images. This database contains the same distortion types as LIVE Phase I. For each distortion type, the pristine image pair is degraded to 3 symmetrically and 6 asymmetrically image pairs. Subjective scores are also recorded in the form of DMOS.
Waterloo IVC Phase I : This database originates from 6 pristine stereoscopic image pairs. The reference image is altered by three types of distortions, namely WN, BLUR, and JPEG. Altogether, there are totally 78 symmetrically and 252 asymmetrically distorted stereopairs. Subjective mean opinion score (MOS) and individual scores are provided for each stereoscopic image in this database, while higher MOS value means better visual quality.
Performance Measure: Three commonly used criteria  are utilized in our experiment for performance evaluation, including Spearman’s rank order correlation coefficient (SROCC), Pearson’s linear correlation coefficient (PLCC) and root mean squared error (RMSE). SROCC is a non-parametric measure and independent of monotonic mapping. PLCC and RMSE evaluate the prediction accuracy. Higher SROCC, PLCC and lower RMSE indicate better correlation with human judgements. Before calculating PLCC and RMSE, a five-parameter logistic function  is applied to maximize the correlation between subjective ratings and objective metrics as follows:
where indicates the predicted score of objective metrics and represents the mapped output. to are the five parameters to be fitted.
One of the main issues of PLCC and SROCC is that they neglect the uncertainty of the subjective scores . Thus, we also employ the Krasula methodology , which could be used to better assess the capabilities of objective metrics by considering the statistical significance of the subjective scores and getting rid of the mapping functions.
The basic idea of this model is to determine the reliability of objective models by checking whether they are capable of well 1) distinguishing the significantly different stimuli from the similar ones, and 2) indicating whether one stimulus are of better/worse quality than the other. To this end, in the Krasula framework, pairs of stimuli are selected from the database to compute the area under ROC curve of the ‘Different vs. Similar’ category (AUC-DS), area under ROC curve of the ‘Better vs.Worse’ category (AUC-BW), and percentage of correct classification (CC). Higher AUC-DS and AUC-BW mean more capability to indicate different/similar and better/worse pairs. Higher CC represents better prediction accuracy. Please refer to  for more details.
Iii-B Performance Evaluation
In the experiment, the distorted stereo pairs are randomly split into 80% training set and 20% testing set according to . We adopt the Adam algorithm in the pre-training and joint optimization step. During pre-training, the learning rate is set as
and lowered by a factor of 10 every 50 epochs. The pre-trained weights are obtained after 100 epochs. Since encoder-decoder should retain its function, the learning ratefor encoder-decoder weight is set as to avoid drastic change when conducting joint optimization. Moreover, the learning rate for ResNet-18 regression is set as half of for . is initialized as and scaled by 0.25 every 50 epochs. The learning rate remains unchanged after 200 epochs. We apply data augmentation by randomly cropping, horizontal and vertical flipping in the training stage . The results are obtained after 300 epochs. During testing, the stride U is set as 192 for width and 104 for height in a slight overlapping manner to cover the whole resolution in LIVE databases as shown in Fig. 4.
|LIVE Phase I||LIVE Phase II|
|Cyclopean MS-SSIM ||0.916||0.917||6.533||0.889||0.900||4.987|
We compare the proposed PAD-Net with several classic FR, RR and NR SIQA metrics on the LIVE Phase I and II database. The competing FR and RR models include Gorley’s method  , You’s method , Benoit’s method , Lin’s method , Cyclopean MS-SSIM , RR-BPI , RR-RDCT  and Ma’s method . For NR metrics, some hand-crafted features based algorithms including Akhter’s method , Sazzad’s method , Chen’s method , S3D-BLINQ , DECOSINE  and deep neural network based models including Shao’s method , CNN , DNR-S3DIQE , DBN , StereoQA-Net  are considered in the performance comparison. Note that CNN  is computed for left and right view images separately and then average the scores for both views. The SROCC, PLCC and RMSE performance for the above metrics and proposed PAD-Net are listed in Table IV where the best results are highlighted in bold. It could be observed from the table that the proposed method outperforms state-of-the-art SIQA metrics, especially on LIVE Phase II database. Since there are more asymmetrically distorted images in LIVE II, the proposed PAD-Net is more effective for the challenging asymmetric distortion which will be explained in Section III C.
To employ the Krasula methodology , significance analysis of the subjective scores are required. Among the three considered 3D IQA databases [33, 10, 9, 53], only the Waterloo IVC Phase I  is equipped with individual scores. Furthermore, the excusable/source code of most of the state-of-the-art NR 3D metrics are not released. Therefore, we could only conduct the Krasula analysis on the Waterloo IVC Phase I dataset and compare the proposed PAD-Net with StereoQA-Net , which obtains the best performance on both LIVE Phase I  and LIVE Phase II [10, 9] databases. Table V lists the results of SROCC, PLCC, RMSE and Krasula performance evaluation, it can be observed from the table that the proposed PAD-Net achieve the best performance in terms of SROCC, PLCC, RMSE, AUC-DS, AUC-BW and CC. The results of Krasula performance criteria demonstrate that PAD-Net is the most promising metric in distinguishing stereo images with different qualities.
|Cyclopean MS-SSIM ||CNN ||StereoQA-Net ||Proposed PAD-Net|
|Cyclopean MS-SSIM ||0||-1||-1||-1|
Results of the T-Test on the LIVE Phase I Database.
|Cyclopean MS-SSIM ||CNN ||StereoQA-Net ||Proposed PAD-Net|
|Cyclopean MS-SSIM ||0||1||-1||-1|
Moreover, we conduct significance t-tests using the PLCC values of 100 runs  to verify whether our proposed model is statistically better than other metrics. Table VI and VII list the results of t-tests on LIVE Phase I and II where ‘1’ or ‘-1’ indicate that the metric in the row is statistically superior or worse than the competitive metric in the column. The number ‘0’ means that the two metrics are statistically indistinguishable. From Table VI and VII, we can see that our proposed metric is statistically better than other metrics both on LIVE Phase I and II.
Iii-C Performance Evaluation for Symmetric/Asymmetric Distortion
Our proposed PAD-Net is based on the predictive coding theory and applies deep neural networks to model binocular rivalry mechanism for better prediction of the stereo image quality. Binocular rivalry seldom happens in symmetrical distortion but plays an important role in asymmetrically distorted image quality assessment. Table VIII presents the SROCC performance for symmetrically and asymmetrically distorted images in LIVE Phase II and Waterloo IVC Phase I databases. PAD-Net demonstrates the extraordinary ability to predict the perceived quality of asymmetrically distorted stereo pairs by well mimicking the visual mechanism in binocular rivalry.
|LIVE Phase II||Waterloo IVC Phase I|
|Cyclopean MS-SSIM ||0.923||0.842||0.924||0.643|
We provide some visualization results of the PAD-Net for better explanation. In Fig. 5(a), the left view is undamaged while the right view is degraded with white noise. The reconstructed stereo pairs are similar to the input ones. In the normalized prior and likelihood map, lighter color indicates higher probability. For asymmetrically noised image, the view with heavier noise is harder to be reconstructed thus having lower likelihood probability globally. However, at the strong edges that indicate structural information, the undamaged view may be allocated smaller probability as shown in the magnified likelihood map of Fig. 5(a). Moreover, the discrepancy between normalized left and right view prior maps is not very large since we can still recognize the scene for both views. According to , 3D image quality is more affected by the poor quality view for noise contamination, as reflected by the magnified likelihood map of Fig. 5(a), this tendency appears around regions containing obvious edges which means subjective rating is more affected by structural information.
Fig. 5(b) presents the asymmetrically blurred image, the left view is undamaged while the right view is blurred. Contrary to white noise, the view with heavier blurring effect is easier to reconstruct thus having larger likelihood probability. In addition, strong edges in the undistorted view tend to gain more probability in binocular rivalry as indicated in the magnified likelihood map of Fig. 5(b). For the same reason that both views are comprehensible, the normalized prior maps do not show huge differences. For image blur, it is reported in [53, 31] that 3D image quality is more affected by the high quality view. In the magnified likelihood map of Fig. 5(b), strong edges in high quality view have higher probability, which again demonstrates that subjective judgments are more affected by structural information. In Fig. 5(c), the stereo pairs are symmetrically distorted. As a result, the normalized prior and likelihood maps for both views share similar probability.
Besides, the fusion maps of distorted images, normalized prior and likelihood from left and right views are depicted in Fig. 6. The colors of fusion map for symmetrically and asymmetrically distorted images are easy to be distinguished. The color of the fusion maps for symmetrically distorted image is gray tone, while asymmetrically distorted images appear green or pink. It is caused by the different probability for left and right view image during fusion. Moreover, for asymmetrically distorted images, white noise is different from the other four distortion types since noise tends to introduce high frequency information while the other four ditortion types are apt to remove details that correspond to high frequency information.
|LIVE Phase I||LIVE Phase II|
|Cyclopean MS-SSIM ||0.888||0.530||0.948||0.925||0.707||0.814||0.843||0.940||0.908||0.884|
|LIVE Phase I||LIVE Phase II|
|Cyclopean MS-SSIM ||0.912||0.603||0.942||0.942||0.776||0.834||0.862||0.957||0.963||0.901|
Iii-D Performance Evaluation on Individual Distortion Type
We further investigate the capacity of our proposed PAD-Net for each distortion type, the SROCC and PLCC performance are illustrated in Table IX and X. The best performing results across listed metrics are highlighted in boldface. As shown in Table IX and X, our proposed model achieves competitive performance for most of the distortion types. In addition, the scatter plots of DMOS values versus objective scores predicted by PAD-Net for each distortion type on LIVE Phase I and II are presented in Fig. 7(a) and 7(b). The linear correlation between DMOS values and predicted scores demonstrates the great monotonicity and accuracy of PAD-Net. DMOS value range of JPEG compressed images is roughly narrower than those of other distortion types, making it more difficult to estimate the perceptual image quality. Thus, the PLCC and SROCC performance for JPEG distortion is generally lower than the other four.
|Metrics||Train LIVE II/Test LIVE I||Train LIVE I/Test LIVE II|
Iii-E Cross Database Tests
We conduct cross database tests to verify the generalization ability of our proposed PAD-Net. Models to be compared are trained on one database and tested on another. Table XI presents the PLCC performance for cross database validation. Although PAD-Net does not show the best performance when trained on LIVE Phase II and tested on LIVE Phase I, it outperforms other metrics in the second round which is a more challenging task. Since LIVE Phase I only consists of symmetrical distortion while more than half of the 3D pairs in LIVE Phase II are asymmetrically distorted. PAD-Net trained on LIVE Phase I is able to handle the asymmetrical distortion in LIVE Phase II never met before. The PLCC performance on LIVE Phase II not only proves the generalization and robustness of PAD-Net but also demonstrates the effectiveness of the binocular rivalry mechanism based on predictive coding theory for asymmetric distortion in the proposed method.
Iii-F Effects of Network Structure
To explore the influence of different network structures as quality regression network, VGG-16 , ResNet-18, 34 and 50  are adopted to make comparisons. The SROCC, PLCC and RMSE performance on LIVE Phase I and II are reported in Table XII. Firstly, ResNet has superior capability to extract discriminative features for quality prediction than VGG structure. Moreover, with the increased depth of ResNet, the performance does not improve. The possible explanation is that the limited training data requires shallow architecture. Generally, very deep networks need a huge amount of training data to achieve high performance. However, there are only hundreds of distorted images in LIVE Phase I and II, even with data augmentation, it is far from enough for deeper networks. Lack of training data may cause over-fitting problems for deeper neural networks. As a result, ResNet-18 is chosen in this paper to reach better tradeoff.
|LIVE Phase I||LIVE Phase II|
Iii-G Ablation Study
Furthermore, ablation study is conducted to verify the effectiveness of each component in PAD-Net. We first feed the distorted left and right view images into quality regression module as the baseline. Then, normalized likelihood and prior maps are introduced to provide additive information for computing rivalry dominance of both views. Moreover, we compare different fusion methods.
As shown in Fig. 8, simply fusing left and right views can achieve promising performance on LIVE Phase I which only consists of symmetrically distorted pairs. However, the performance degrades seriously on LIVE Phase II owing to the existence of asymmetric distortion. According to the BRM-PC, prior and likelihood probability maps are necessary for 3D image quality estimation. The performance improvement on LIVE Phase II verify the effectiveness of prior and likelihood probability maps obtained through Siamese Encoder-decoder network and further demonstrate the superiority of the HVS guided binocular rivalry mechanism based on predictive coding theory. In addition, we compare the Conv+GDN fusion method with the intuitive addition+multiplication method. Note that Conv+GDN fusion means and addition+multiplication represents . It is shown in Fig. 8 that our proposed method benefits a lot from the Conv+GDN fusion method since the parameters of fusion operation are updated during the training stage to generate the most discriminative feature maps for quality prediction. Therefore, the HVS guided Siamese encoder-decoder module to generate prior and likelihood map and the Conv+GDN fusion method are keys to the success of PAD-Net.
Iii-H Computation Complexity
A good metric for blind SIQA should have high prediction accuracy as well as low computational cost. In the experiment, the models are tested on the NVIDIA GTX 1080ti GPU with 11GB memory. The running time for our proposed PAD-Net and other metrics are listed in Table XIII. Note that we record the time for predicting quality scores of 50 stereo images with the resolution of and then average to obtain the time for each 3D image. The results in Table XIII show that PAD-Net only needs around 0.906 seconds per image which is significantly lower than other metrics.
In this paper, we explore a novel deep learning approach for blind stereoscopic image quality assessment according to the binocular rivalry mechanism based on predictive coding theory. Our proposed predictive auto-encoding network is an end-to-end architecture inspired by the human brain cognition process. Specifically, we adopt the Siamese encoder-decoder module to reconstruct binocular counterparts and generate the corresponding likelihood as well as prior maps. Moreover, we incorporate the quality regression module to obtain the final estimated perceptual quality score. The experimental results demonstrate that our proposed PAD-Net correlates well with subjective ratings. In addition, the proposed method outperforms state-of-the-art algorithms for distorted stereoscopic images under a variety of distortion types, especially for those with asymmetric distortions. Furthermore, we also show that the proposed PAD-Net has a promising generalization ability and can achieve lower time complexity. In future work, we intend to extend the method to blind stereoscopic video quality assessment. Except for image visual quality, we plan to investigate other 3D quality dimensions such as depth perception and visual comfort.
-  (2010) No-reference stereoscopic image quality assessment. In Stereoscopic Displays and Applications XXI, Vol. 7524, pp. 75240T. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE VIII, TABLE IX.
-  (2015) Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281. Cited by: §II-B.
-  (2016) End-to-end optimized image compression. arXiv preprint arXiv:1611.01704. Cited by: §II-B.
Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436. Cited by: §II-B, §II-B.
-  (2009) Quality assessment of stereoscopic images. EURASIP journal on image and video processing 2008 (1), pp. 659024. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE VIII, TABLE IX.
-  (2007) Stereoscopic images quality assessment. In 2007 15th European Signal Processing Conference, pp. 2110–2114. Cited by: §I, §I.
-  (2003) An image quality assessment method based on perception of structural information. In Proceedings 2003 International Conference on Image Processing (Cat. No. 03CH37429), Vol. 3, pp. III–185. Cited by: §I, §I.
-  (2017) Visual discomfort prediction on stereoscopic 3d images without explicit disparities. Signal Processing: Image Communication 51, pp. 50–60. Cited by: §I.
-  (2013) No-reference quality assessment of natural stereopairs. IEEE Transactions on Image Processing 22 (9), pp. 3379–3391. Cited by: §I, §I, §III-A, §III-A, §III-B, §III-B, TABLE X, TABLE XI, TABLE IV, TABLE VIII, TABLE IX.
-  (2013) Full-reference quality assessment of stereopairs accounting for rivalry. Signal Processing: Image Communication 28 (9), pp. 1143–1155. Cited by: §I, §I, §I, §II, §III-A, §III-A, §III-B, §III-B, TABLE X, TABLE IV, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
-  (2017) Blind stereoscopic video quality assessment: from depth perception to overall experience. IEEE Transactions on Image Processing 27 (2), pp. 721–734. Cited by: §I, §I.
-  (1998) A hierarchical model of binocular rivalry. Neural Computation 10 (5), pp. 1119–1135. Cited by: §I, §II-B.
-  (2006) A gain-control theory of binocular combination. Proceedings of the National Academy of Sciences 103 (4), pp. 1141–1146. Cited by: §I.
Deep sparse rectifier neural networks.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §II-B.
-  (2008) Stereoscopic image quality metrics and compression. In Stereoscopic Displays and Applications XIX, Vol. 6803, pp. 680305. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE VIII, TABLE IX.
-  (2003) Final report from the video quality experts group on the validation of objective models of video quality assessment, phase ii. 2003 VQEG. Cited by: §III-A.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §II-C, §III-F.
-  (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §II-C.
-  (2008) Predictive coding explains binocular rivalry: an epistemological review. Cognition 108 (3), pp. 687–701. Cited by: 1st item, §I, Fig. 3, §II-A, §II-B, §II-B, §II-B, §II.
-  (2003) A treatise of human nature. Courier Corporation. Cited by: §II-B.
-  (2014) Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1733–1740. Cited by: §III-B, TABLE X, TABLE XI, TABLE XIII, TABLE IV, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
-  (2004) Object perception as bayesian inference. Annu. Rev. Psychol. 55, pp. 271–304. Cited by: §II-B.
-  (2016) On the accuracy of objective image and video quality models: new methodology for performance evaluation. In 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. Cited by: §III-A, §III-A, §III-B.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
-  (1965) On binocular rivalry. Ph.D. Thesis, Van Gorcum Assen. Cited by: §I.
-  (2014) Quality assessment of stereoscopic 3d image compression by binocular integration behaviors. IEEE transactions on Image Processing 23 (4), pp. 1527–1542. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE VIII, TABLE IX.
-  (2017) Reduced-reference stereoscopic image quality assessment using natural scene statistics and structural degradation. IEEE Access 6, pp. 2768–2780. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE IX.
-  (2016) Waterloo exploration database: new challenges for image quality assessment models. IEEE Transactions on Image Processing 26 (2), pp. 1004–1016. Cited by: §II-B, §II.
-  (2017) End-to-end blind image quality assessment using deep neural networks. IEEE Transactions on Image Processing 27 (3), pp. 1202–1213. Cited by: §II-B, §II-D, §II-D, §II.
-  (2016) Reorganized dct-based image representation for reduced reference stereoscopic image quality assessment. Neurocomputing 215, pp. 21–31. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE IX.
-  (2001) Unequal weighting of monocular inputs in binocular combination: implications for the compression of stereoscopic imagery.. Journal of Experimental Psychology: Applied 7 (2), pp. 143. Cited by: §III-C.
-  (2018) Data augmentation for improving deep learning in image classification problem. In 2018 international interdisciplinary PhD workshop (IIPhDW), pp. 117–122. Cited by: §III-B.
-  (2013) Subjective evaluation of stereoscopic image quality. Signal Processing: Image Communication 28 (8), pp. 870–883. Cited by: §II, §III-A, §III-A, §III-B.
Rectified linear units improve restricted boltzmann machines.
Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §II-B.
-  (2017) Blind deep s3d image quality evaluation via local to global feature aggregation. IEEE Transactions on Image Processing 26 (10), pp. 4923–4936. Cited by: §I, §I, §III-B, §III-B, TABLE X, TABLE IV, TABLE IX.
-  (1998) Mechanisms of stereoscopic vision: the disparity energy model. Current opinion in neurobiology 8 (4), pp. 509–515. Cited by: §I.
-  (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §II-D.
-  (2015) Reduced reference stereoscopic image quality assessment based on binocular perceptual information. IEEE Transactions on multimedia 17 (12), pp. 2338–2344. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE IX.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §II-B.
-  (1999) Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience 2 (1), pp. 79. Cited by: Fig. 2, §II-B.
-  (2012) Objective no-reference stereoscopic image quality prediction based on 2d image features and relative disparity. Advances in Multimedia 2012, pp. 8. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE IX.
-  (2012) Subjective methods for the assessment of stereoscopic 3dtv systems. Cited by: §I.
-  (2018) Multistage pooling for blind quality prediction of asymmetric multiply-distorted stereoscopic images. IEEE Transactions on Multimedia 20 (10), pp. 2605–2619. Cited by: §I.
-  (2016) Toward a blind deep quality evaluator for stereoscopic images based on monocular and binocular interactions. IEEE Transactions on Image Processing 25 (5), pp. 2059–2074. Cited by: §I, §I, §III-B, TABLE XI, TABLE IV.
-  (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15 (11), pp. 3440–3451. Cited by: §II, §III-A.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I, §III-F.
-  (2017) A review of predictive coding algorithms. Brain and cognition 112, pp. 92–97. Cited by: §I, §II-A.
-  (2015) Oriented correlation models of distorted natural images with application to natural stereopair quality evaluation. IEEE Transactions on image processing 24 (5), pp. 1685–1699. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE VIII, TABLE IX.
-  (2015) Visual quality assessment of stereoscopic image and video: challenges, advances, and future trends. In Visual Signal Quality Assessment, pp. 185–212. Cited by: §I.
-  (2005) Mistaking a house for a face: neural correlates of misperception in healthy humans. Cerebral Cortex 16 (4), pp. 500–508. Cited by: §II-B.
-  (2018) Nima: neural image assessment. IEEE Transactions on Image Processing 27 (8), pp. 3998–4011. Cited by: §I.
-  (2006) Neural bases of binocular rivalry. Trends in cognitive sciences 10 (11), pp. 502–511. Cited by: §II-A.
-  (2015) Quality prediction of asymmetrically distorted stereoscopic 3d images. IEEE Transactions on Image Processing 24 (11), pp. 3400–3414. Cited by: §III-A, §III-A, §III-B, §III-C, §III-C.
-  (2016) Perceptual depth quality in distorted stereoscopic images. IEEE Transactions on Image Processing 26 (3), pp. 1202–1215. Cited by: §I.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §I, §I.
-  (2002) A universal image quality index. IEEE signal processing letters 9 (3), pp. 81–84. Cited by: §I, §I.
-  (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §I.
-  (2005) Reduced-reference image quality assessment using a wavelet-domain natural image statistic model. In Human Vision and Electronic Imaging X, Vol. 5666, pp. 149–159. Cited by: §I, §I.
A blind stereoscopic image quality evaluator with segmented stacked autoencoders considering the whole visual perception route. IEEE Transactions on Image Processing 28 (3), pp. 1314–1328. Cited by: §I, §III-B, TABLE XI, TABLE IV.
-  (2019) Blind assessment for stereo images considering binocular characteristics and deep perception map based on deep belief network. Information Sciences 474, pp. 1–17. Cited by: §I, §I, §III-B, TABLE X, TABLE XI, TABLE IV, TABLE IX.
-  (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §II-D.
-  (2010) Perceptual quality assessment for stereoscopic images based on 2d image quality metrics and disparity analysis. In Proc. Int. Workshop Video Process. Quality Metrics Consum. Electron, Vol. 9, pp. 1–6. Cited by: §I, §I, §III-B, TABLE X, TABLE IV, TABLE VIII, TABLE IX.
-  (2011) FSIM: a feature similarity index for image quality assessment. IEEE transactions on Image Processing 20 (8), pp. 2378–2386. Cited by: §I.
-  (2019) Dual-stream interactive networks for no-reference stereoscopic image quality assessment. IEEE Transactions on Image Processing. Cited by: §I, §I, §III-B, §III-B, §III-B, TABLE X, TABLE XI, TABLE XIII, TABLE IV, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.