Image quality assessment (IQA) plays an essential role in various image processing and visual communication systems. In the literature, many traditional 2D IQA metrics have been proposed [62, 63, 7, 65, 64, 70, 57]. Due to the unprecedented development of 3D/stereoscopic technologies, quality issues for 3D media are becoming more and more attracted in both academy and industry . Through the process of acquisition, compression, transmission, display, etc. original reference stereoscopic images usually suffer from perceptual quality degradation caused by diverse distortion types and degrees. As a result, the demands for effectively assessing the perceptual quality of stereoscopic images are urgent.
Compared with the previous 2D IQA case, stereoscopic image quality assessment (SIQA) is more challenging owing to a wide variety of 2D and 3D influential factors, such as image spatial artifacts, depth perception, visual comfort, and so on . These factors have different effects on evaluating the quality of experience (QoE) for stereoscopic images. Except for  that considers image distortion and depth perception quality simultaneously, existing research works mainly focus on modeling each individual factor for 3D QoE [50, 49, 61, 8]. In this paper, we aim to study the visually perceptual quality assessment of stereoscopic images.
Similar to 2D IQA, according to the availability of original reference stereoscopic images, SIQA models are typically divided into three categories: full-reference (FR) [6, 5, 69, 17, 10, 31, 30], reduced-reference (RR) [44, 35, 32], and blind/no-reference (NR) [1, 47, 9, 54, 41, 67, 71] SIQA metrics.
For FR SIQA algorithms, full information of the reference image is assumed to be exploited. The earliest FR IQA model investigated some off-the-shelf 2D IQA metrics, such as structural similarity index (SSIM) , universal quality index (UQI) , C4  and so on , to assess stereoscopic image quality . Further, disparity information was integrated into the 2D IQA metrics to predict the 3D image quality [5, 69]. Apart from incorporating depth clues, the binocular vision characteristics of the human visual system (HVS) were combined with 2D IQA algorithms. For example, Gorley and Holliman proposed a new stereo band limited contrast (SBLC) metric based on the HVS sensitivity to contrast changes in high frequency regions . Moreover, Chen et al.  proposed cyclopean images according to the binocular rivalry in the human eyes. In addition, Lin and Wu developed a binocular integration based computational model for evaluating the perceptual quality of stereoscopic images .
As for RR SIQA approaches, only part of original non-distorted image data is available. Qi et al. utilized binocular perceptual information (BPI) to perform RR SIQA . By characterizing the statistical properties of stereoscopic images in the reorganized discrete cosine transform (RDCT) domain, Ma et al. presented the RR SIQA method . Furthermore, a new RR-SIQA metric based on natural scene statistics (NSS) and structural degradation was proposed .
However, in most practical applications, the original pristine image cannot be accessible. Therefore, NR SIQA is inevitably required. Some research works have been studied about NR SIQA. Akhter et al. extracted the local features of artifacts and disparity to evaluate the perceptual quality of stereoscopic images . Sazzad et al. also exploited perceptual differences of local features for NR SIQA . However, these methods are distortion-specific NR SIQA approaches, which are only suitable for JPEG coded stereoscopic image pairs. Thus, several general-purpose NR SIQA metrics have emerged. Chen et al. 
extracted 2D and 3D NSS features from stereoscopic image pairs. The shape-parameter features were regressed onto subjective quality scores by using the well-known support vector regression (SVR). Suet al. proposed the stereoscopic/3D blind image naturalness quality (S3D-BLINQ) index by constructing a convergent cyclopean image and extracting bivariate and correlation NSS features in spatial and wavelet domains .
proposed a blind deep quality evaluator for assessing stereoscopic image quality with monocular and binocular interactions based on deep belief network (DBN). The deep NR stereoscopic/3D image quality evaluator (DNR-S3DIQE) was proposed, which extracted local abstractions and then aggregated them into global features by employing the aggregation layer . In addition, Yang et al. took into account the deep perception map and binocular weight model along with the DBN to predict perceived stereoscopic image quality . In , a deep edge and color signal integrity evaluator (DECOSINE) was proposed based on the whole visual perception route from eyes to the frontal lobe. Besides, Zhou et al. proposed a dual-stream interactive network called stereoscopic image quality assessment network (StereoQA-Net) for NR SIQA . However, the above-mentioned algorithms consider little prior domain knowledge of the 3D HVS, and thus having difficulty in accurately predicting the perceptual quality of stereoscopic images with various distortion types and levels.
Binocular vision is crucial to quality assessment for stereoscopic images, and can be mainly classified into three categories, namely binocular fusion, rivalry and suppression. Firstly, if the retina regions of left and right eyes receive the same or similar visual contents, binocular fusion happens and the two views are integrated into one single and stable binocular perception . Secondly, binocular rivalry is a phenomenon in 3D vision, in which perception alternates between different views when two eyes see different scenes . Thirdly, binocular suppression occurs since the HVS cannot tolerate binocular rivalry for a long time. During binocular suppression, one view may be inhibited by the other entirely . Existing researches consider either one or multiple binocular vision mechanisms to assist SIQA and in this paper, we primarily focus on modeling binocular rivalry for SIQA. In conventional perspective, binocular rivalry is simulated by low-level competition between the input stimulus and it is related to the energy of the stimulus [42, 29, 14]. Recently, more literatures try to explain binocular rivalry by the predictive coding theory [13, 21]. It is a popular theory about how brain process sensing visual stimuli. According to predictive coding theory, the cognition system tries to match bottom-up visual signal with top-down predictions . Different from the traditional statements that binocular rivalry is low-level inter-ocular competition in early visual cortex, binocular rivalry mechanism based on predictive coding theory (BRM-PC) stress more on the high-level competition . Moreover, the BRM-PC is the HVS guided and more inline with human cognition system . Therefore, we believe that introducing BRM-PC will be beneficial to SIQA.
In this paper, we propose a generic architecture called Predictive Auto-encoDing Network (PAD-Net), which is an end-to-end network for general-purpose NR SIQA. Our contributions of the proposed method are summarized as follows:
We propose a biologically plausible and explicable predictive auto-encoding network through combining the network design with the binocular rivalry mechanism based on predictive coding theory  which helps to explain the binocular rivalry phenomenon in 3D vision. The source code of PAD-Net is available online for public research usage111http://staff.ustc.edu.cn/~chenzhibo/resources.html.
According to the predictive coding theory in human brain, we adopt the encoder-decoder architecture to reconstruct the sensory input and further exploit the Siamese network to generate the corresponding likelihood as well as prior maps for modeling binocular rivalry in the cognition system.
We demonstrate that we can obtain the fusion information to reflect the perceived differences for symmetrically and asymmetrically distorted stereoscopic image pairs under various distortion types and levels.
Compared with state-of-the-art SIQA metrics, the proposed PAD-Net provides more precise quality estimation, especially for those stereopairs under asymmetric distortions thanks to the well consideration of binocular rivalry based on predictive coding theory.
The remainder sections of this paper are organized as follows. Section II explains the proposed Predictive Auto-encoDing Network (PAD-Net) for NR SIQA in details. Section III presents the experimental results and analysis. In Section IV, we conclude the paper with an outlook on the future work.
Ii Binocular Rivalry Mechanism Based on Predictive Coding Theory
In this section, we will first describe the predictive coding theory and then introduce the binocular rivalry mechanism based on predictive coding theory (BRM-PC) in details.
Ii-a Predictive Coding Theory
According to , the core task of brain is to represent the environmental causes of its sensory input. In other words, given a sensory input, the cognition system of human brain will predict the environmental cause, and gives a hypothesis. The final perceptual content is determined by the hypothesis that generates the best prediction .
Numerous predictive coding models have been proposed, and the simplest is the linear predictive coding (LPC) in digital signal processing . Then, the predictive coding theory has been applied to efficient encoding in the retina. Rao et al.  proposed a hierarchical model to represent the principle of human brain in Fig. 2. The feedback pathways carry the prediction from higher-level and the feedforward pathways return the prediction errors between the prediction and sensory input to update the prediction and get the best hypothesis. The model mentioned here is for monocular vision, and it is the basis of binocular rivalry model. In Rao’s model, consider an image , it is represented by a vector
(lower-level cause) using the activation functionand the dictionary :
where is the stochastic noise and denotes the basis vector. Then the vector can be further represented as:
where and are higher-level dictionary and causes. is the ‘top-down’ prediction of . Then can be applied as the optimization function since the goal of the predictive coding model is to estimate suitable causes for generating the best prediction.
Ii-B Binocular Rivalry Mechanism Based on Predictive Coding Theory
According to , binocular rivalry appears when left and right views are given different images, and the perception alternates between the two views. It is an important phenomenon when evaluating the quality of stereoscopic images due to the existence of asymmetrical distortion which means left and right views may suffer from different levels of distortion. Thus, employing binocular rivalry mechanism is beneficial to the improvement of stereoscopic image quality assessment.
Introducing predictive coding theory to explain binocular rivalry phenomenon has attracted greater attention from the community in recent years . Compared with the conventional perspective that believes binocular rivalry is low-level inter-ocular competition in early visual cortex, BRM-PC stress more on the high-level competition . In this paper, we adopt the general theoretical framework in [21, 11]
for SIQA. On the basis of BRM-PC, we perceive the content since the corresponding hypothesis has the highest posterior probability from the Bayesian perspective[15, 25]. As shown in Fig. 3, given a stereoscopic image, our brain normally first determines a hypothesis that can best explain the corresponding stimulus. The perceptual inference depends on the likelihood as well as the prior probability of the hypotheses. The likelihood is about how well the hypothesis predicts the input . The prior is related to empirical knowledge (shape, material, lighting) which comes from hypothesis on higher-level [25, 59] and is about how probable the hypothesis is. Then, the hypotheses for left and right views will compete with each other according to the posterior probability computing from prior and likelihood.
According to the above analysis, we can conclude that the likelihood and prior are important to obtain perceptual inference during binocular rivalry. And they can be designed based on the high-level causes and the error map in the predictive coding model introduced in Section II A.
Iii Proposed Method
Inspired by the BRM-PC , we design the no-reference PAD-Net to automatically predict the perceptual quality of stereoscopic images. The proposed PAD-Net is a Siamese based end-to-end network, including auto predictive coding and quality regression modules. Given paired distorted 3D images, they are divided into sub-images with size 256x256 first, then the quality scores of these sub-images are estimated through PAD-Net and aggregated as a final one as done in . We pre-train the proposed ‘auto-encoder’ and quality regression module on Waterloo Exploration Database  and LIVE 2D Database , respectively. After that, the entire network is jointly optimized using 3D IQA databases [39, 10] to generate more accurate predictions.
In this section, the architecture of our proposed PAD-Net is described first as shown in Fig. 1. Then, we introduce the Siamese encoder-decoder module and quality regression module in details. Finally, the training and testing methods of our PAD-Net are presented.
The architecture of the PAD-Net is depicted in Fig. 1, which contains a Siamese encoder-decoder module and a quality regression module. The Siamese encoder-decoder module represents the processing procedure of left and right view images. It is inspired by the predictive coding theory that the human brain tries to match bottom-up visual stimuli with top-down predictions . In Section II, an empirical Bayesian framework based on BRM-PC is introduced and the rivalry dominance is calculated from the likelihood and prior probability . Correspondingly, the likelihood and prior are inferred through the Siamese encoder-decoder in our proposed PAD-Net. Note that we are not going to calculate real probabilities in the proposed method, but just try to model the likelihood and prior with a quantity that has similar physical meaning in the BRM-PC described in .
For each branch of the Siamese encoder-decoder, four times down convolution and up convolution with a stride of 2 pixels are performed to reconstruct the distorted images. Non-linear activation function layers follow the first three down convolution and up convolution layers to enhance the representation ability of our neural network. We square the error between the input and reconstructed image as the residual map and utilize this map for likelihood calculation. Moreover, the high-level features are convolved with a stride of one pixel to change the channel size and upscaled to the same resolution as the input image. In the PAD-Net, we use the reconstruction error to obtain likelihood probability map and the high-level representation after four times down convolution to generate prior probability map. The reasons will be given in Section III B.
Once we get the left and right view images as well as their likelihood and prior probability maps, we can put them into the quality regression module and compute the final quality score. It can be seen from the bottom of Fig. 1
that the quality regression module is composed of 1) fusion for input images and probability maps, 2) ResNet-18 without last two layers as feature extractor and 3) the final max pooling and fully connected layer. Note that fusion in Fig.1 includes a one stride convolution layer and activation function layer to match the input size with ResNet-18. The output of the proposed PAD-Net is the predicted quality score of the input stereoscopic image.
Iii-B Siamese Encoder-Decoder Module
The Siamese encoder-decoder module is inspired by the BRM-PC, which is the HVS guided and able to generate more reliable and interpretable results. To simulate the predictive coding process in BRM-PC, we adopt the encoder-decoder network structure for image compression from [3, 4], which is a hierarchical structure. In Fig. 2, prediction is adjusted with residual errors to obtain better hypothesis, thus, we employ the squared error between predicted (decoded) image and input image 
as our loss functionto pre-train the encoder-decoder network as follows:
where and represent the -th input and predicted image. , and are the width, height and channel of and . is the batch size of a mini-batch training data. is estimated via the encoder-decoder network with weight . is the weight for encoder-decoder after loss minimization. During the training stage, the loss is defined as the prediction error between distorted and reconstructed images which can be assumed as the feedforward error signal in Fig. 2. Then, we use the gradient descent algorithm to update and generate better prediction. Finally, the prediction error will converge and reach a stable value. The decoded image would not change greatly. Thus, the updating policy of the encoder-decoder network is similar to the prediction coding theory of human brain. Note that the encoder-decoder network is pre-trained on the Waterloo Exploration Database , therefore the training data includes both reference and distorted images.
|Encoder||Conv1+GDN1||3, 256, 256||128, 128, 128||5x5/2/2|
|Conv2+GDN2||128, 128, 128||128, 64, 64||5x5/2/2|
|Conv3+GDN3||128, 64, 64||128, 32, 32||5x5/2/2|
|Conv4||128, 32, 32||192, 16, 16||5x5/2/2|
|Decoder||Unconv1+IGDN1||192, 16, 16||128, 32, 32||5x5/2/2|
|Unconv2+IGDN2||128, 32, 32||128, 64, 64||5x5/2/2|
|Unconv3+IGDN3||128, 64, 64||128, 128, 128||5x5/2/2|
|Unconv4||128, 128, 128||3, 256, 256||5x5/2/2|
Unconv: Fractionally-strided convolution
GDN: Generalized divisive normalization
IGDN: Inverse generalized divisive normalization
: Channel Height Width
, generalized divisive normalization (GDN) transform and inverse generalized divisive normalization (IGDN). GDN is inspired by the neuron models in biological visual system and proves to be effective in density estimation, image compression  and image quality assessment . The GDN and IGDN operations are given by:
where means the -th channel of and it is the input of GDN transform. is the -th channel of normalized activation feature map and it is the output of GDN operation. Moreover, and are the parameters to be updated in GDN function. Likewise, , , and share the same meaning as , , and for IGDN transform. The goal of our encoder-decoder module is to reconstruct the input sensory instead of compressing it. Therefore, we remove the quantization step in  for better reconstruction.
By now, we can get the high-level causes (high-level features) and reconstructed image from the input sensory in the proposed Siamese encoder-decoder module. Then, according to the BRM-PC , the left and right views will compete with each other to obtain the best hypothesis which is related to prior and likelihood. To be specific, the likelihood is about how well the hypothesis predicts the input and the prior is about how probable the hypothesis is and concern with empirical knowledge . Corresponding to the physical meaning of likelihood and prior, we first obtain the -th squared residual error map in the mini-batch of a size of to calculate likelihood as follows:
where and denote the -th channel of the input and predicted image. equals 3 in Eq. 8. The likelihood is used to measure the similarity between and and it is inversely proportional to errors. The training stage of encoder-decoder network can be regarded as the procedure of prediction error minimization.
In the Siamese encoder-decoder module, the prior is modeled with the high-level representation of the sensory input  since the prior comes from high-levels in the cognition system, as assumed in empirical Bayes [21, 56]. Thus, the high-level features are utilized to generate the -th prior map . Before being fed into the quality regression module, the squared error map and the prior map are normalized between left and right view as follows:
where , , and are the prior and error maps for the -th left view and right view images in a mini-batch. , , and indicate the normalized prior and likelihood probability maps. Note that the error is opposite to the likelihood, that is to say, if the error is large, the likelihood will be small. For example, when computing the likelihood map for left view, the error map of right view is adopted and vice versa.
|Prior||Softplus4||192, 16, 16||192, 16, 16||-|
|Conv5+Softplus5||192, 16, 16||1, 16, 16||1x1/1/0|
|Up-sample6||1, 16, 16||1, 256, 256||-|
|Square7a||1, 256, 256||1, 256, 256||-|
|Normlization8a||1, 256, 256||1, 256, 256||-|
|Likelihood||Square7b||1, 256, 256||1, 256, 256||-|
|Normlization8b||1, 256, 256||1, 256, 256||-|
Normlization: Normalization between left view and right view
: Channel Height Width
The detailed structure of prior and likelihood creation is given in Table II. We employ the Softplus activation function  in prior generation to avoid square negative values to positive values. It is defined as:
The Softplus function can be regarded as the smoothing version of ReLU function which is similar to the way cerebral neurons being activated.
Iii-C Quality Regression Module
Based on the distorted stereoscopic images as well as the obtained prior and likelihood probability maps from the Siamese encoder-decoder module, we fuse them as a 3-channel feature map and further feed the 3-channel feature map into the ResNet-18 quality regression network to extract discriminative features for quality estimation. ResNet-18 is chosen for its excellent ability of feature extraction[19, 20]. The last two layers including average pooling and fully connected layer are removed for regressing the feature map into a quality score. Table III illustrates the architecture of quality regression network. The input stem and basic block of the ResNet-18 structure are shown in Fig. 1.
|Fusion||Conv9+GDN9||10, 256, 256||3, 256, 256||1x1/1/0|
|ResNet18 Regression||ResNet-18||3, 256, 256||512,8,8||-|
GDN: Generalized divisive normalization
Fc: Fully connected layer
: Channel Height Width
Iii-D Training and Testing
Owing to the limited size of available 3D image quality assessment database, we train the PAD-Net on the sub-image pairs with the resolution of , the MOS value for the entire image is assumed as the quality scores for several sub-images as done in 
. Thus, sub-images coming from the same test image share the same labels. Moreover, transfer learning is adopted to solve the problem of lacking labeled data and enhance the prediction accuracy of the network[43, 68].
In blind stereoscopic image quality assessment, it is difficult to predict the MOS value precisely . Therefore, we divide the training stage into three steps: 1) pre-training of encoder-decoder on the pristine and distorted 2D images; 2) pre-training ResNet-18 regression on the pristine and distorted 2D images; 3) joint optimization on the 3D IQA database.
Firstly, encoder-decoder is trained to minimize the difference between predicted and input images which is described in Section II B. Secondly, we utilize the original and distorted 2D images along with the associated MOS scores to pre-train the ResNet-18 regression network. It is aimed to map the 2D image into a quality score. In addition, ResNet-18 with pre-trained weight on ImageNet is adopted for better initialization.
Finally, the Siamese encoder-decoder and quality regression module are jointly optimized using stereo image pairs. Since the ultimate purpose of PAD-Net is to estimate the perceptual quality of 3D images, we adopt the -norm between the subjective MOS value and predicted score as loss function:
where and represent the input 3D sub-image pairs. indicates the PAD-Net with encoder-decoder weight , ResNet-18 regression weight and weight training from scratch. At the joint optimization step, are initialized with pre-trained weight and updated with through final loss minimization:
In the testing stage, the stereo image is divided into sub-image pairs with a stride of to cover the whole content. The predicted qualities of all sub-image pairs are averaged to compute the final perceptual quality score.
Iv Experimental Results and Analysis
In this section, we first introduce the databases and performance measures used in our experiment. Then, the experimental results of the proposed PAD-Net on the entire LIVE databases and individual distortion type are illustrated. Meanwhile, the visualization results are provided for better explanation. Finally, we conduct the ablation study to verify the effectiveness of each component in our model and measure the computation complexity.
Iv-a Databases and Performance Measures
LIVE Phase I :
This database contains 20 original and 365 symmetrically distorted stereo image pairs. Five distortion types are included in this database, namely JPEG2000 compression (JP2K), JPEG compression (JPEG), additive white noise (WN), Gaussian blur (BLUR) and Raleigh fast fading channel distortion (FF). Subjective differential mean opinion score (DMOS) is provided for each degraded stereo image. Higher DMOS value means lower visual quality.
LIVE Phase II [10, 9]: It includes 120 symmetrically and 240 asymmetrically distorted stereopairs derived from 8 reference images. This database contains the same distortion types as LIVE Phase I. For each distortion type, the pristine image pair is degraded to 3 symmetrically and 6 asymmetrically image pairs. Subjective scores are also recorded in the form of DMOS.
Waterloo IVC Phase I : This database originates from 6 pristine stereoscopic image pairs. The reference image is altered by three types of distortions, namely WN, BLUR, and JPEG. Altogether, there are totally 78 symmetrically and 252 asymmetrically distorted stereopairs. Subjective mean opinion score (MOS) and individual scores are provided for each stereoscopic image in this database, while higher MOS value means better visual quality.
Performance Measure: Three commonly used criteria  are utilized in our experiment for performance evaluation, including Spearman’s rank order correlation coefficient (SROCC), Pearson’s linear correlation coefficient (PLCC) and root mean squared error (RMSE). SROCC is a non-parametric measure and independent of monotonic mapping. PLCC and RMSE evaluate the prediction accuracy. Higher SROCC, PLCC and lower RMSE indicate better correlation with human judgements. Before calculating PLCC and RMSE, a five-parameter logistic function  is applied to maximize the correlation between subjective ratings and objective metrics.
One of the main issues of PLCC and SROCC is that they neglect the uncertainty of the subjective scores . Thus, we also employ the Krasula methodology , which could be used to better assess the capabilities of objective metrics by considering the statistical significance of the subjective scores and getting rid of the mapping functions.
The basic idea of this model is to determine the reliability of objective models by checking whether they are capable of well 1) distinguishing the significantly different stimuli from the similar ones, and 2) indicating whether one stimulus are of better/worse quality than the other. To this end, in the Krasula framework, pairs of stimuli are selected from the database to compute the area under ROC curve of the ‘Different vs. Similar’ category (AUC-DS), area under ROC curve of the ‘Better vs.Worse’ category (AUC-BW), and percentage of correct classification (CC). Higher AUC-DS and AUC-BW mean more capability to indicate different/similar and better/worse pairs. Higher CC represents better prediction accuracy. Please refer to  for more details.
Iv-B Performance Evaluation
In the experiment, the distorted stereo pairs are randomly split into 80% training set and 20% testing set according to . We adopt the Adam algorithm in the pre-training and joint optimization step. During pre-training, the learning rate is set as
and lowered by a factor of 10 every 50 epochs. The pre-trained weights are obtained after 100 epochs. Since encoder-decoder should retain its function, the learning ratefor encoder-decoder weight is set as to avoid drastic change when conducting joint optimization. Moreover, the learning rate for ResNet-18 regression is set as half of for . is initialized as and scaled by 0.25 every 50 epochs. The learning rate remains unchanged after 200 epochs. We apply data augmentation by randomly cropping, horizontal and vertical flipping in the training stage . The results are obtained after 300 epochs. During testing, the stride U is set as 192 for width and 104 for height in a slight overlapping manner to cover the whole resolution in LIVE databases as shown in Fig. 4.
|LIVE Phase I||LIVE Phase II|
|Cyclopean MS-SSIM ||0.916||0.917||6.533||0.889||0.900||4.987|
We compare the proposed PAD-Net with several classic FR, RR and NR SIQA metrics on the LIVE Phase I and II database. The competing FR and RR models include Gorley’s method  , You’s method , Benoit’s method , Lin’s method , Cyclopean MS-SSIM , RR-BPI , RR-RDCT  and Ma’s method . For NR metrics, some hand-crafted features based algorithms including Akhter’s method , Sazzad’s method , Chen’s method , S3D-BLINQ , DECOSINE  and deep neural network based models including Shao’s method , CNN , DNR-S3DIQE , DBN , StereoQA-Net  are considered in the performance comparison. Note that CNN  is computed for left and right view images separately and then average the scores for both views. The SROCC, PLCC and RMSE performance for the above metrics and proposed PAD-Net are listed in Table IV where the best results are highlighted in bold. It could be observed from the table that the proposed method outperforms state-of-the-art SIQA metrics, especially on LIVE Phase II database. Since there are more asymmetrically distorted images in LIVE II, the proposed PAD-Net is more effective for the challenging asymmetric distortion which will be explained in Section III C.
To employ the Krasula methodology , significance analysis of the subjective scores are required. Among the three considered 3D IQA databases [39, 10, 9, 60], only the Waterloo IVC Phase I  is equipped with individual scores. Furthermore, the excusable/source code of most of the state-of-the-art NR 3D metrics are not released. Therefore, we could only conduct the Krasula analysis on the Waterloo IVC Phase I dataset and compare the proposed PAD-Net with StereoQA-Net , which obtains the best performance on both LIVE Phase I  and LIVE Phase II [10, 9] databases. Table V lists the results of SROCC, PLCC, RMSE and Krasula performance evaluation, it can be observed from the table that the proposed PAD-Net achieve the best performance in terms of SROCC, PLCC, RMSE, AUC-DS, AUC-BW and CC. The results of Krasula performance criteria demonstrate that PAD-Net is the most promising metric in distinguishing stereo images with different qualities.
|LIVE Phase I / II||Cyclopean MS-SSIM ||CNN ||StereoQA-Net ||Proposed PAD-Net|
|Cyclopean MS-SSIM ||0 / 0||-1 / 1||-1 / -1||-1 / -1|
|CNN ||1 / -1||0 / 0||-1 / -1||-1 / -1|
|StereoQA-Net ||1 / 1||1 / 1||0 / 0||-1 / -1|
|Proposed PAD-Net||1 / 1||1 / 1||1 / 1||0 / 0|
Results of the T-Test on the LIVE Phase I and Phase II Database.
Moreover, we conduct significance t-tests using the PLCC values of 10 runs  to verify whether our proposed model is statistically better than other metrics. Table VI list the results of t-tests on LIVE Phase I and II where ‘1’ or ‘-1’ indicate that the metric in the row is statistically superior or worse than the competitive metric in the column. The number ‘0’ means that the two metrics are statistically indistinguishable. From Table VI, we can see that our proposed metric is statistically better than other metrics both on LIVE Phase I and II.
Iv-C Performance Evaluation for Symmetric/Asymmetric Distortion
|LIVE Phase II||Waterloo IVC Phase I|
|Cyclopean MS-SSIM ||0.923||0.842||0.924||0.643|
Our proposed PAD-Net is based on the predictive coding theory and applies deep neural networks to model binocular rivalry mechanism for better prediction of the stereo image quality. Binocular rivalry seldom happens in symmetrical distortion but plays an important role in asymmetrically distorted image quality assessment. Table VII presents the SROCC performance for symmetrically and asymmetrically distorted images in LIVE Phase II and Waterloo IVC Phase I databases. PAD-Net demonstrates the extraordinary ability to predict the perceived quality of asymmetrically distorted stereo pairs by well mimicking the visual mechanism in binocular rivalry.
We provide some visualization results of the PAD-Net for better explanation. In Fig. 5, the left view is undamaged while the right view is degraded with white noise. The reconstructed stereo pairs are similar to the inputs. In the normalized prior and likelihood map, lighter color indicates higher probability. For asymmetrically noised image, the view with heavier noise is harder to be reconstructed thus having lower likelihood probability globally. However, at the strong edges that indicate structural information, the undamaged view may be allocated smaller probability as shown in the magnified likelihood map of Fig. 5. Moreover, the discrepancy between normalized left and right view prior maps is not significant since we can still recognize the scene for both views. According to , 3D image quality is more affected by the poor-quality view for noise contamination, as reflected by the magnified likelihood map of Fig. 5, this tendency appears around regions containing obvious edges which means subjective rating is more affected by structural information [62, 10].
Fig. 6(a) presents the asymmetrically blurred image, the left view is undamaged while the right view is blurred. Contrary to white noise, the view with heavier blurring effect is easier to reconstruct thus having larger likelihood probability. In addition, strong edges in the undistorted view tend to gain more probability in binocular rivalry as indicated in the magnified likelihood map of Fig. 6(b). For the same reason that both views are comprehensible, the normalized prior maps do not show significant differences. For image blur, it is reported in [60, 37] that 3D image quality is more affected by the high-quality view. In the magnified likelihood map of Fig. 6(b), strong edges in high-quality view have higher probability, which again demonstrates that subjective judgments are more affected by structural information. [62, 10] In Fig. 6(b), the stereo pairs are symmetrically distorted. As a result, the normalized prior and likelihood maps for both views share similar probability.
|LIVE Phase I||LIVE Phase II|
|Cyclopean MS-SSIM ||0.888||0.530||0.948||0.925||0.707||0.814||0.843||0.940||0.908||0.884|
Besides, the fusion maps of distorted images, normalized prior and likelihood from left and right views are depicted in Fig. 7. The colors of fusion map for symmetrically and asymmetrically distorted images are easy to be distinguished. The color of the fusion maps for symmetrically distorted image is gray tone, while asymmetrically distorted images appear green or pink. It is caused by the different probability for left and right view images during fusion. Moreover, for asymmetrically distorted images, white noise is different from the other four distortion types since noise tends to introduce high-frequency information while the other four distortion types are apt to remove details that correspond to high-frequency information.
|LIVE Phase I||LIVE Phase II|
|Cyclopean MS-SSIM ||0.912||0.603||0.942||0.942||0.776||0.834||0.862||0.957||0.963||0.901|
Iv-D Performance Evaluation on Individual Distortion Type
We further investigate the capacity of our proposed PAD-Net for each distortion type, the SROCC and PLCC performance are illustrated in Table VIII and IX. The best performing results across listed metrics are highlighted in boldface. As shown in Table VIII and IX, our proposed model achieves competitive performance for most of the distortion types. In addition, the scatter plots of DMOS values versus objective scores predicted by PAD-Net for each distortion type on LIVE Phase I and II are presented in Fig. 8(a) and 8(b). The linear correlation between DMOS values and predicted scores demonstrates the great monotonicity and accuracy of PAD-Net. DMOS value range of JPEG compressed images is roughly narrower than those of other distortion types, making it more difficult to estimate the perceptual image quality. Thus, the PLCC and SROCC performance for JPEG distortion is generally lower than the other four.
Iv-E Cross Database Tests
We conduct cross database tests to verify the generalization ability of our proposed PAD-Net. Models to be compared are trained on one database and tested on another. Table X presents the PLCC performance for cross database validation. Although PAD-Net does not show the best performance when trained on LIVE Phase II and tested on LIVE Phase I, it outperforms other metrics in the second round which is a more challenging task. Since LIVE Phase I only consists of symmetrical distortion while more than half of the 3D pairs in LIVE Phase II are asymmetrically distorted. PAD-Net trained on LIVE Phase I is able to handle the asymmetrical distortion in LIVE Phase II never met before. The PLCC performance on LIVE Phase II not only proves the generalization and robustness of PAD-Net but also demonstrates the effectiveness of the binocular rivalry mechanism based on predictive coding theory for asymmetric distortion in the proposed method.
|Metrics||Train LIVE II/Test LIVE I||Train LIVE I/Test LIVE II|
Iv-F Effects of Network Structure
To explore the influence of different network structures as quality regression network, VGG-16 , ResNet-18, 34 and 50  are adopted to make comparisons. The SROCC, PLCC and RMSE performance on LIVE Phase I and II are reported in Table XI. Firstly, ResNet has superior capability to extract discriminative features for quality prediction than VGG structure. Moreover, with the increased depth of ResNet, the performance does not improve. The possible explanation is that the limited training data requires shallow architecture. Generally, very deep networks need a huge amount of training data to achieve high performance. However, there are only hundreds of distorted images in LIVE Phase I and II, even with data augmentation, it is far from enough for deeper networks. Lack of training data may cause over-fitting problems for deeper neural networks. As a result, ResNet-18 is chosen in this paper to reach better tradeoff.
|LIVE Phase I||LIVE Phase II|
Iv-G Ablation Study
Furthermore, ablation study is conducted to verify the effectiveness of each component in PAD-Net. We first feed the distorted left and right view images into quality regression module as the baseline. Then, normalized likelihood and prior maps are introduced to provide additive information for computing rivalry dominance of both views. Moreover, we compare different fusion methods.
As shown in Fig. 9, simply fusing left and right views can achieve promising performance on LIVE Phase I which only consists of symmetrically distorted pairs. However, the performance degrades seriously on LIVE Phase II owing to the existence of asymmetric distortion. According to the BRM-PC, prior and likelihood probability maps are necessary for 3D image quality estimation. The performance improvement on LIVE Phase II verify the effectiveness of prior and likelihood probability maps obtained through Siamese Encoder-decoder network and further demonstrate the superiority of the HVS guided binocular rivalry mechanism based on predictive coding theory. In addition, we compare the Conv+GDN fusion method with the intuitive addition+multiplication method which denotes we obtain the posterior probability by multiplying prior and likelihood probabilities. Note that Conv+GDN fusion means and Addition+Multiplication represents . It is shown in Fig. 9 that our proposed method benefits a lot from the Conv+GDN fusion method since the parameters of fusion operation are updated during the training stage to generate the most discriminative feature maps for quality prediction. Therefore, the HVS guided Siamese encoder-decoder module to generate prior and likelihood map and the Conv+GDN fusion method are keys to the success of PAD-Net.
Iv-H Computation Complexity
A good metric for blind SIQA should have high prediction accuracy as well as low computational cost. In the experiment, the models are tested on the NVIDIA GTX 1080ti GPU with 11GB memory. The running time for our proposed PAD-Net and other metrics are listed in Table XII. Note that we record the time for predicting quality scores of 50 stereo images with the resolution of and then average to obtain the time for each 3D image. The results in Table XII show that PAD-Net only needs around 0.906 seconds per image which is significantly lower than other metrics.
In this paper, we explore a novel deep learning approach for blind stereoscopic image quality assessment according to the binocular rivalry mechanism based on predictive coding theory. Our proposed predictive auto-encoding network is an end-to-end architecture inspired by the human brain cognition process. Specifically, we adopt the Siamese encoder-decoder module to reconstruct binocular counterparts and generate the corresponding likelihood as well as prior maps. Moreover, we incorporate the quality regression module to obtain the final estimated perceptual quality score. The experimental results demonstrate that our proposed PAD-Net correlates well with subjective ratings. In addition, the proposed method outperforms state-of-the-art algorithms for distorted stereoscopic images under a variety of distortion types, especially for those with asymmetric distortions. Furthermore, we also show that the proposed PAD-Net has a promising generalization ability and can achieve lower time complexity. In future work, we intend to extend the method to blind stereoscopic video quality assessment. Except for image visual quality, we plan to investigate other 3D quality dimensions such as depth perception and visual comfort.
-  (2010) No-reference stereoscopic image quality assessment. In Stereoscopic Displays and Applications XXI, Vol. 7524, pp. 75240T. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VII, TABLE VIII, TABLE IX.
-  (2015) Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281. Cited by: §III-B.
-  (2016) End-to-end optimized image compression. arXiv preprint arXiv:1611.01704. Cited by: §III-B.
Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436. Cited by: §III-B, §III-B.
-  (2009) Quality assessment of stereoscopic images. EURASIP journal on image and video processing 2008 (1), pp. 659024. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VII, TABLE VIII, TABLE IX.
-  (2007) Stereoscopic images quality assessment. In 2007 15th European Signal Processing Conference, pp. 2110–2114. Cited by: §I, §I.
-  (2003) An image quality assessment method based on perception of structural information. In Proceedings 2003 International Conference on Image Processing (Cat. No. 03CH37429), Vol. 3, pp. III–185. Cited by: §I, §I.
-  (2017) Visual discomfort prediction on stereoscopic 3d images without explicit disparities. Signal Processing: Image Communication 51, pp. 50–60. Cited by: §I.
-  (2013) No-reference quality assessment of natural stereopairs. IEEE Transactions on Image Processing 22 (9), pp. 3379–3391. Cited by: §I, §I, §IV-A, §IV-A, §IV-B, §IV-B, TABLE X, TABLE IV, TABLE VII, TABLE VIII, TABLE IX.
-  (2013) Full-reference quality assessment of stereopairs accounting for rivalry. Signal Processing: Image Communication 28 (9), pp. 1143–1155. Cited by: §I, §I, §III, §IV-A, §IV-A, §IV-B, §IV-B, §IV-C, §IV-C, TABLE IV, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
-  (2020) Stereoscopic omnidirectional image quality assessment based on predictive coding theory. IEEE Journal of Selected Topics in Signal Processing. Cited by: §II-B.
-  (2017) Blind stereoscopic video quality assessment: from depth perception to overall experience. IEEE Transactions on Image Processing 27 (2), pp. 721–734. Cited by: §I, §I.
-  (1998) A hierarchical model of binocular rivalry. Neural Computation 10 (5), pp. 1119–1135. Cited by: §I.
-  (2006) A gain-control theory of binocular combination. Proceedings of the National Academy of Sciences 103 (4), pp. 1141–1146. Cited by: §I.
-  (2002) Functional integration and inference in the brain. Progress in neurobiology 68 (2), pp. 113–143. Cited by: §II-B.
Deep sparse rectifier neural networks.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §III-B.
-  (2008) Stereoscopic image quality metrics and compression. In Stereoscopic Displays and Applications XIX, Vol. 6803, pp. 680305. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VII, TABLE VIII, TABLE IX.
-  (2003) Final report from the video quality experts group on the validation of objective models of video quality assessment, phase ii. 2003 VQEG. Cited by: §IV-A.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §III-C, §IV-F.
-  (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §III-C.
-  (2008) Predictive coding explains binocular rivalry: an epistemological review. Cognition 108 (3), pp. 687–701. Cited by: 1st item, §I, Fig. 3, §II-A, §II-B, §III-A, §III-B, §III-B, §III.
-  (1995) Binocular vision and stereopsis. Oxford University Press, USA. Cited by: §I.
-  (2003) A treatise of human nature. Courier Corporation. Cited by: §II-A.
-  (2014) Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1733–1740. Cited by: §IV-B, TABLE X, TABLE XII, TABLE IV, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.
Object perception as bayesian inference. Annu. Rev. Psychol. 55, pp. 271–304. Cited by: §II-B, §III-B.
-  (2016) On the accuracy of objective image and video quality models: new methodology for performance evaluation. In 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. Cited by: §IV-A, §IV-A, §IV-B.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
-  (1996) Activity changes in early visual cortex reflect monkeys’ percepts during binocular rivalry. Nature 379 (6565), pp. 549–553. Cited by: §II-B.
-  (1965) On binocular rivalry. Ph.D. Thesis, Van Gorcum Assen. Cited by: §I.
-  (2019) Adaptive cyclopean image-based stereoscopic image-quality assessment using ensemble learning. IEEE Transactions on Multimedia 21 (10), pp. 2616–2624. Cited by: §I.
-  (2014) Quality assessment of stereoscopic 3d image compression by binocular integration behaviors. IEEE transactions on Image Processing 23 (4), pp. 1527–1542. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VII, TABLE VIII, TABLE IX.
-  (2017) Reduced-reference stereoscopic image quality assessment using natural scene statistics and structural degradation. IEEE Access 6, pp. 2768–2780. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VIII, TABLE IX.
-  (2016) Waterloo exploration database: new challenges for image quality assessment models. IEEE Transactions on Image Processing 26 (2), pp. 1004–1016. Cited by: §III-B, §III.
-  (2017) End-to-end blind image quality assessment using deep neural networks. IEEE Transactions on Image Processing 27 (3), pp. 1202–1213. Cited by: §III-B, §III-D, §III-D, §III.
-  (2016) Reorganized dct-based image representation for reduced reference stereoscopic image quality assessment. Neurocomputing 215, pp. 21–31. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VIII, TABLE IX.
-  (1975) Linear prediction: a tutorial review. Proceedings of the IEEE 63 (4), pp. 561–580. Cited by: §II-A.
-  (2001) Unequal weighting of monocular inputs in binocular combination: implications for the compression of stereoscopic imagery.. Journal of Experimental Psychology: Applied 7 (2), pp. 143. Cited by: §IV-C.
-  (2018) Data augmentation for improving deep learning in image classification problem. In 2018 international interdisciplinary PhD workshop (IIPhDW), pp. 117–122. Cited by: §IV-B.
-  (2013) Subjective evaluation of stereoscopic image quality. Signal Processing: Image Communication 28 (8), pp. 870–883. Cited by: §III, §IV-A, §IV-A, §IV-B.
Rectified linear units improve restricted boltzmann machines.
Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §III-B.
-  (2017) Blind deep s3d image quality evaluation via local to global feature aggregation. IEEE Transactions on Image Processing 26 (10), pp. 4923–4936. Cited by: §I, §I, §IV-B, §IV-B, TABLE IV, TABLE VIII, TABLE IX.
-  (1998) Mechanisms of stereoscopic vision: the disparity energy model. Current opinion in neurobiology 8 (4), pp. 509–515. Cited by: §I.
-  (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §III-D.
-  (2015) Reduced reference stereoscopic image quality assessment based on binocular perceptual information. IEEE Transactions on multimedia 17 (12), pp. 2338–2344. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VIII, TABLE IX.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §III-B.
-  (1999) Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience 2 (1), pp. 79. Cited by: Fig. 2, §II-A, §III-B.
-  (2012) Objective no-reference stereoscopic image quality prediction based on 2d image features and relative disparity. Advances in Multimedia 2012, pp. 8. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VIII, TABLE IX.
-  (2012) Subjective methods for the assessment of stereoscopic 3dtv systems. Cited by: §I.
-  (2018) Multistage pooling for blind quality prediction of asymmetric multiply-distorted stereoscopic images. IEEE Transactions on Multimedia 20 (10), pp. 2605–2619. Cited by: §I.
-  (2016) Toward a blind deep quality evaluator for stereoscopic images based on monocular and binocular interactions. IEEE Transactions on Image Processing 25 (5), pp. 2059–2074. Cited by: §I, §I, §IV-B, TABLE X, TABLE IV.
-  (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15 (11), pp. 3440–3451. Cited by: §III, §IV-A.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I, §IV-F.
-  (2017) A review of predictive coding algorithms. Brain and cognition 112, pp. 92–97. Cited by: §I, §III-A.
-  (2015) Oriented correlation models of distorted natural images with application to natural stereopair quality evaluation. IEEE Transactions on image processing 24 (5), pp. 1685–1699. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VII, TABLE VIII, TABLE IX.
-  (2015) Visual quality assessment of stereoscopic image and video: challenges, advances, and future trends. In Visual Signal Quality Assessment, pp. 185–212. Cited by: §I.
-  (2005) Mistaking a house for a face: neural correlates of misperception in healthy humans. Cerebral Cortex 16 (4), pp. 500–508. Cited by: §III-B.
-  (2018) Nima: neural image assessment. IEEE Transactions on Image Processing 27 (8), pp. 3998–4011. Cited by: §I.
-  (2006) Neural bases of binocular rivalry. Trends in cognitive sciences 10 (11), pp. 502–511. Cited by: §III-A.
-  (2018) Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §II-B, §III-B.
-  (2015) Quality prediction of asymmetrically distorted stereoscopic 3d images. IEEE Transactions on Image Processing 24 (11), pp. 3400–3414. Cited by: §I, §II-B, §IV-A, §IV-A, §IV-B, §IV-C, §IV-C.
-  (2016) Perceptual depth quality in distorted stereoscopic images. IEEE Transactions on Image Processing 26 (3), pp. 1202–1215. Cited by: §I.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §I, §I, §IV-C, §IV-C.
-  (2002) A universal image quality index. IEEE signal processing letters 9 (3), pp. 81–84. Cited by: §I, §I.
-  (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §I.
-  (2005) Reduced-reference image quality assessment using a wavelet-domain natural image statistic model. In Human Vision and Electronic Imaging X, Vol. 5666, pp. 149–159. Cited by: §I, §I.
A blind stereoscopic image quality evaluator with segmented stacked autoencoders considering the whole visual perception route. IEEE Transactions on Image Processing 28 (3), pp. 1314–1328. Cited by: §I, §IV-B, TABLE X, TABLE IV.
-  (2019) Blind assessment for stereo images considering binocular characteristics and deep perception map based on deep belief network. Information Sciences 474, pp. 1–17. Cited by: §I, §I, §IV-B, TABLE X, TABLE IV, TABLE VIII, TABLE IX.
-  (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §III-D.
-  (2010) Perceptual quality assessment for stereoscopic images based on 2d image quality metrics and disparity analysis. In Proc. Int. Workshop Video Process. Quality Metrics Consum. Electron, Vol. 9, pp. 1–6. Cited by: §I, §I, §IV-B, TABLE IV, TABLE VII, TABLE VIII, TABLE IX.
-  (2011) FSIM: a feature similarity index for image quality assessment. IEEE transactions on Image Processing 20 (8), pp. 2378–2386. Cited by: §I.
-  (2019) Dual-stream interactive networks for no-reference stereoscopic image quality assessment. IEEE Transactions on Image Processing. Cited by: §I, §I, §IV-B, §IV-B, §IV-B, TABLE X, TABLE XII, TABLE IV, TABLE VI, TABLE VII, TABLE VIII, TABLE IX.