I Introduction
Ia Motivation
In recent years, 3D video technology has been popular due to its fresh viewing experiences, e.g., special immersion, high interactivity, and large degree of freedom. In the 3D video system, the multiview view plus depth (MVD)
[1] representation is the main data format. The MVD records the color and depth information of the same physical scene from different views. With the MVD format data, arbitrary virtual views can be synthesized via a depthimagebased rendering (DIBR) technique [2, 19]. Commonly, the performance of 3D video system is mainly measured by the distortion/quality [30, 31, 32] of the synthesized virtual view. Hence, the view synthesis distortion (VSD) estimation is crucial, especially for the 3D video applications. For instance, the estimated VSD [10] is generally used for ratedistortion optimization [3], rate allocation [4], the design of the error resilience techniques [6], etc.The main reason for the VSD is the changes/errors in the reference texture and depth videos due to lossy compression or transmission errors. During the view synthesis process, texture changes may cause VSD in the luminance/chrominance level, namely texture degradation. However, depth changes may cause complex geometric VSD. Moreover, different levels of depth changes lead to different levels of geometric VSD. In 1D parallel model, the geometric VSD commonly refers to position shifting [7]. Besides, texture degradation propagating from decoded reference views to their corresponding virtual views is also directed by the changes of depth. After integrating texture degradation with the position shifting, the texture degradation with different levels of position shifting can be regarded as different kinds of subVSD (SVSD) and forms the final VSD.
As shown in Fig. 1, (a) and (b) are synthesized by original reference views (uncompressed in texture and depth reference views) and decoded reference views (compressed by H.264 with QP pair (45, 48) in texture and depth reference views), respectively. Some magnification patches of local distortion are exhibited in (a1)(a4) and (b1)(b4). The VSD in the green/red patches mainly belongs to the texture degradation with small/large position shifting, as shown in (b1b2)/(b3b4), where the shifting is mainly due to the depth changes caused by lossy compression. As the green patches locate on the bodies of person and cylinder, their original depth is smooth. Even after compressed, the level of depth changes in green patches is low, which only causes nonobvious position shifting. After integrating with the texture degradation, only texture degradation is obviously observed in the green patches. However, the VSD in the red patches mainly belongs to the texture degradation with large position shifting, such as the misaligned fingers in (b3) and the erosion around the boundary of cylinder in (b4). Since obvious depth boundaries exist around the hand and cylinder, lossy compression makes these boundaries smooth and brings large level of depth changes. This leads to the significant position shifting effects in the red patches. After texture degradation propagation, obvious texture degradation together with position shifting can be observed in the green patches. All these distortion, including that in (b1)(b4), forms the final VSD in (b).
Inspired from above, after taking the texture distortion into account, different levels of depth changes can be used to represent different kinds of SVSD, which can be further used to predict the VSD. On the one hand, it can benefit the optimization of 3D video coding [29] by figuring out the exact contribution of each kind of SVSD to the VSD. On the other hand, it can also help us design an optimal depth codec [28] by increasing or decreasing different levels of depth changes to bring in the smallest VSD. To our best knowledge, existed methods, such as the methods to be reviewed in subsection IC2), cannot represent the relationship between the SVSD and VSD accurately, which is the key challenge in this work.
IB Our contributions
In this paper, we propose an autoweighted layer representation based view synthesis distortion estimation model. This is the first work utilizing learningbased approach to mine the accurate relationships among the degeneration of texture, changes of depth, and the VSD, especially for the relationship between the VSD and its associated SVSDs. This provides us a methodology to predict the VSD by utilizing its associated SVSDs. It can be used for optimizing the design of 3D video coding, especially for the depth coding. The main contributions are summarized as follows.

This is the first work to relate different levels of depth changes together with their texture degeneration to the view synthesis distortion (VSD), which is crucial for various 3D video applications, such as aforementioned 3D video coding, depth coding, etc.

Sub view synthesis distortion (SVSD) is first defined according to the level of depth changes and its associated texture degeneration in this paper. Besides, an elaborate derivation is given to demonstrate that the VSD can be decomposed into different kinds of SVSD approximately.

To accurately represent the relationship between VSD and its associated SVSDs, a nonlinear mapping function between the VSD and SVSDs is learnt based on our built dataset, which is the first dataset for mining the relationship between the VSD and SVSDs.

To calculate the SVSD efficiently, a layerbased representation method is proposed and further optimized, where all the pixels with the same level of the depth changes (i.e., the SVSD) will be represented with a layer. It enables the SVSD calculation perform at the layer level.
Compared with the existed VSD estimation methods, the welllearnt nonlinear mapping function is able to accurately represent the relationship between the VSD and SVSDs. Meanwhile, the proposed layerbased representation enables the VSD estimation performed at the layer level without spending additional calculation on partly performing the view synthesis process at the pixel level to make the proposed method more efficient.
IC Related work
IC1 View synthesis
In this paper, View synthesis mainly refer to the DIBR based view synthesis, which commonly contains two steps, namely warping and blending.
During the warping step, forward warping, warping competition, and rounding operation are performed accordingly. The goal of warping step is to warp the pixels in the reference views to the warped views. Assume that a pixel in the original reference view with location is warped to a new location in the warped view. The subscript is used to index the left () or right () view. This process can be formulated as
(1) 
where denotes the disparity of pixel with depth value . is the rounding operation. denotes the baseline between cameras and is the focal length of cameras. is the depth range of the physical scene. In Eq. (1), disparity can be regarded as the function of depth value , which is denoted as for simplification.
After warping step, there will be lots of disocclusions in the warped views, since the occlusion parts in the reference views become visible. To fill the disocclusions, the blending step is carried out by merging the two warped views into a virtual one. Besides, three blending strategies need to be followed according to three different cases during blending step: i) if current pixel in the virtual view is visible in both two warped views, a weighted average of these two values is used; ii) if it is only visible in one of the warped views, this value will be directly used; iii) otherwise, an inpainting value will be used.
IC2 View synthesis distortion estimation
Commonly, there are two typical VSD estimation methods to estimate lossy compression caused view synthesis distortion and transmission error (package loss) caused view synthesis distortion, respectively. Kim [8] first developed a camera and video parameters based quality metric to quantify the effect of lossy coding of depth maps on synthesized view quality. After that, Yuan [3] proposed a concise distortion model by analyzing the impacts of the compression distortion of texture images and depth maps on the quality of the virtual views. Meanwhile, Zhang [10] proposed a view synthesis distortion model by taking the regions characteristics into account for depth video coding. Based on this, Fang [17] relate errors in the depth images to the synthesis quality by taking texture image characteristics and the warping step of view synthesis into account. However, the warping step is used for relating the distortion of depth to the synthesized view at the frame level, which limits its accuracy on the VSD estimation. To make more accurate prediction of the VSD, Yuan [5] utilize the warping step of view synthesis to simulate the error prorogation from distorted depth to the virtual synthesized view at the pixel level, which directly measures the quality of the virtual view by partly carrying out view synthesis. However, the blend and inpainting steps in view synthesis are still not considered due to their complicated operations. Jin [11] proposed a pixellevel VSD estimation, where the warping and blending steps are partly taken into account to build a more accurate relations between the distorted depth together with the texture and VSD, achieving the stateoftheart result. However, compared with the pixellevel VSD estimation methods in [5] and [11], the framelevel one in [17] is more efficient, when pixellevel parallel processing is not considered. Meanwhile, Pan [12] developed a depth distortion range, in which depth changes brought no geometrical distortions.
To model the transmission error caused distortion, Zhou [13] first derived a Channel distortion model for multiview video transmission over lossy packetswitched networks, which can estimate the channel caused distortion at the frame level. Then, a quadratic model is proposed by Cheung [14], which first relates the disparity errors caused by packet loss in the depth maps to the distortion contribution in the synthesized view. After that, Gao [6] developed an endtoend 3D video transmission oriented VSD estimation model for 3D video coding to improve error resilience. To accurately model the error propagation process during view synthesis, Zhang [15] proposed a depthvaluebased graphical (DVGM) model. By taking the transmission error into account, it can accurately estimate the transmission caused view synthesis distortion. To further speed up DVGM, Jin [16] proposed a depthbin based graphical model for VSD estimation, which is more efficient without sacrificing accuracy.
As reviewed above, all these methods are trying to predict the VSD by modeling framelevel depth distortion or pixellevel depth distortion without considering the exact contribution of different levels of depth changes for the VSD. Besides, to build the relations between the distorted reference views to the virtual synthesized view, all these VSD methods partly integrate the view synthesis process into their approaches, e.g. the warping step. This will lead to two drawbacks: 1) It cannot achieve accurate prediction of the VSD by partly using view synthesis (e.g., warping step) for building the relationship between the distorted texture together with depth and the VSD, since the blending step of the view synthesis also affects the view synthesis results. While the nonlinear operations of blending step (warping competition, inpainting operation, etc.) make it hardly formulated in such VSD estimation methods. 2) Even through part of view synthesis is performed in such methods, the calculation is based on pixellevel, namely each pixel will perform the view synthesis process partly. This reduces their efficiency in some degree. To overcome these drawbacks, we firstly learn the nonlinear mapping function based on our dataset to exploit an exact relationship between different levels of depth changes together with their associated distorted texture (the SVSDs) and the VSD. Then, we propose an efficient layerbased representation method, which enables the VSD estimation performed at the layer level.
Ii The proposed Model
In this section, we first define the total view synthesis distortion (VSD) and the sub view synthesis distortion (SVSD) abstractly. To better understand the SVSD, an analysis on the SVSD is given in detail. After that, a set of theoretical derivations is given according to view synthesis process, from which we demonstrate that the VSD can be decomposed into the SVSDs by their associated weights approximately. Finally, a nonlinear mapping function represented with neural networks is used to learn the weights between the VSD and its associated SVSDs.
Iia Definition of the VSD and SVSD
In this paper, the view synthesis distortion of the virtual view (i.e., VSD) is formulated with the Mean Square Error (MSE) over the entire frame of synthesized view as used in [17], where
(2) 
where and denote the width and height of the virtual view. and are the color values of two pixels in two different virtual views, which are synthesized with the original reference views and decoded views. Besides, these two pixels have the same location in their own virtual views.
As mentioned in Eq. (1), depth changes may lead to disparity changes. Assume that a pixel in the decoded reference view with location is warped to in the warped view. Then, we have
(3) 
Besides, depth changes with different levels may cause disparities changes with different levels, namely warped position shifting with different levels. Here, the depth changes from to caused warped position shifting distortion is measured with , and
(4) 
Therefore, the warped position shifting distortion of a pixel can be functioned by the levels of depth changes from to . After integrating with texture information distortion, the sub view synthesis distortion (SVSD) in this paper is defined as
(5) 
where denotes a certain SVSD of all the pixels that are with the same level of depth changes (measured with ) in the left () or right () warped views, and their locations are collected with set . denotes the cardinality of . Their associated pixels in the original and decoded reference views are and , and their locations are collected with set and , respectively. and are the texture value of pixel and . It should be noticed that the whole frame collected with set can be divided into several according to , and we have and . Besides, the intersection of any two sets of is empty, i.e., .
As the lossy compression (source coding) caused depth distortion can be approximately regarded as a zeromean white noise, the threesigma rule is used to confirm the available number of
, namely .IiB An detailed analysis of the SVSD
To better understand SVSD, an analysis of the SVSD is given based on an example in this subsection. Assume that there are three different levels of depth changes, caused by the lossy compression in the left and right depth reference views. Therefore, six SVSDs need to be calculated for each of the VSD. Here, only three SVSDs in the left view are involved for simplify, namely , , and . The other three SVSDs (, , and ) in the right view are similar.
As shown in Fig. 2, pixels with the same warped position shifting distortion are masked with the same color points in original and decoded reference views. Pixels with and are highlighted with blue and red points, respectively. The rest are pixels with , which means no warped position shifting distortion exists in these pixels. Set is used to collect locations of the blue points in the original reference view, which has a horizontal interval . Then, these pixels are warped to the interval in the original warped view, their locations are collected with set . As lossy compression makes the depth values of these pixels changed, the pixels in set in the decoded reference view are mistakenly warped to a new interval in the decoded warped view. Their locations are collected with set . Assume all the pixels of these two sets are eventually exhibited in the virtual views. By fusing these two sets with a union operation, we can obtain set , namely . The complementary parts are highlighted with the dark points. Finally, we can easily locate their associated sets and in the original and decoded reference views by backward warping [18] pixels in the . Therefore, the SVSD can be regarded as the Mean Square Error over the pixels of these two sets physically. Likewise, the definitions for other SVSDs in the left view are similar..
It should be noticed that the purpose of union operation is to ensure the distorted pixels could be fully counted during the SVSD calculation. Besides, depending on the fact in the backward warping that the missing depth value of a certain pixel is assigned with its adjacent pixel’s depth value, we assign the depth values of the dark points with their adjacent pixels in the sets during inverse warping process.
IiC A theoretical derivation: from VSD to SVSDs
To relate the pixel in the virtual view with its associated pixels in the original and decoded reference views, we also take the inverse process into account as in [11]. First, we relate the pixel in the virtual view to its associated pixels in the warped views by taking the blending strategies into account. in Eq. (2) can be rewritten as
(6) 
where and denote the color values of pixels in the original left and right warped views. and (, ) are two weights for the left and right warped views, which are determined by the locations of cameras array. The first item represents that the pixel in the virtual view is visible in both warped views case. The second and third items represent that only one of the warped views can be visible. The last item is to formulate the inpainting case.
Similarly, can be rewritten as
(7) 
where and denote the color values of pixels in the decoded left and right warped views.
Then, we make an approximate computation. As the last case is infrequent, which occupies around 1 of the all these four cases as mentioned in [19]. Therefore, can be approximately rewritten as
(8) 
where and (, ) are two weights for the left and right warped views. Similarly, we have
(9) 
(16) 
(20) 
(21) 
After that, we relate these pixels in the warped views to their associated pixels in the reference views by considering of the warping step during view synthesis. Likewise, we assume pixels with position in the original left and right warped views are warped from pixels with position in the original left () or right () reference views. We have
(10) 
where denotes the color value of pixel in the original left () or right () reference views. Its associated disparity can be represented by Eq. (1).
Similarly, we assume pixels with position in the decoded warped views is warped from pixel with position in the decoded reference views due to its distorted depth values . We have
(11) 
where denotes the color value of pixel in the decoded left () or right () reference view. Its associated disparities can be represented by Eq. (3).
Therefore, and can be rewritten as
(12) 
and
(13) 
By substituting Eq. (12) and Eq. (13) into Eq. (8) and Eq. (9), respectively, we obtain
(14) 
and
(15) 
Eq. (14) and Eq. (15) relate the pixel in the virtual view to its associated pixels in the reference views. Finally, Eq. (2) is rewritten as Eq. (16).
For the Eq. (5), it can be rewritten as
(17) 
As aforementioned that the intersection of any two sets of is empty and , we have
(18) 
As , then we obtain
(19) 
By plugging Eq. (18) and Eq. (19) into Eq. (16), we have Eq. (20). Based on the observation of Eq. (20), the VSD can be decomposed into SVSDs () by their associated weights approximately. To obtain the exact weights between the VSD and its associated SVSDs, a nonlinear mapping function represented with neural networks is used instead of the linear one to well learn the nonlinear relation between the VSD and SVSDs due to a nonlinear mapping represented with neural networks is employed instead of a linear one to well learn the nonlinear relation between the VSD and SVSDs. This is due to the nonlinear operations existed in view synthesis, e.g., hole filling, inpainting, warping competition and so on. Then, we have Eq. (21), where the VSD is regarded as a nonlinear mapping function of SVSDs via a functional relation . This also gives us a theoretical proof that the VSD can be predicted using its associated SVSDs once the functional relation is obtained.
IiD Learning a nonlinear mapping function:
To make the prediction from SVSDs closely approximate to the actual VSD, we formulate an optimization program as follows
(22) 
where is the index of sample, and training samples are involved. denotes a serious SVSDs (i.e., ) of sample , which can be obtained by our layerbased representation method, which is introduced in Section III. Their associated actual VSD (i.e., MSE) can be directly obtained by calculating the mean square error between the pair of virtual views, which are synthesized by their corresponding original reference views and decoded reference views. is used to measure the approximation between two terms. With such a learnt nonlinear mapping function , once another series of SVSDs are given, their associated VSD can be accurately predicted correspondingly.
In particular, to obtain such a nonlinear function
, we employ XGBoost
[20], a scalable endtoend tree boosting system. Specifically, by introducing tress to our model, the optimization program in Eq. (22) can be rewritten as(23) 
where represents the training error of sample, and we adopt the mean square error. The second term measures the complexity of our tree, where is the complexity of tree.
Of note, minimizing the tree complexity facilitates the generalization ability of our algorithm, while minimizing the prediction error can guarantee the accuracy of our model. The parameters setting of XGBoost will be given in subsection IVA.
Iii LayerBased Representation
After analyzing the SVSD, a layerbased representation method is first developed to generate the SVSD in this section. Then, an optimization of the layerbased representation is made to reduce its complexity and speedup the SVSDs generation. The details will be presented as follows.
Iiia Methodology
It is similar that the layerbased representation method is performed on the left and right views. For simplify, only its application on the left views is carefully elaborated in the following parts. The framework of the layerbased representation is shown in Fig. 3 (a).
IiiA1 Disparity Conversion
The depth images are input at first, which contains the left original and decoded depth images and . Then, and are obtained by plugging and into Eq. (1), respectively.
IiiA2 Disparity Difference
A pixelwise subtraction operation is carried out to obtain the difference between and . Then, we have the disparity difference image .
IiiA3 Layered Representation
Pixels with the same value in are masked with the same color and collected with a pair of layers , in and , respectively. We use different colors to mask pixels with different values, which are further represented with different layers. Then, sets , … , and , … , are easily obtained by visiting their corresponding layers , . An example with assumption is shown in Fig. 4, pixels with are masked with blue, gray, and red colors, which are represented with three layers in and , respectively, i.e., , and , which contains sets , and . It should be noticed that all the following operations are performed at the layer level.
IiiA4 Forward Warping
The layered pixels in and are forward warped to the original and decoded warped views according to the disparity images and . Sets , … , and , … , are obtained and represented with different pairs of layers.
IiiA5 Fusion
Merge each and (where ) by a union operation and obtain , … , , namely (where ).
IiiA6 Inverse Warping
Inversely warp the back to the original and decoded left reference view, and generate their associated sets and , which are presented with different pairs of layers.
IiiA7 Mse
Pixels with locations in and in are used to calculate the via a layer level MSE calculation, which is similar with that used in Eq. (5).


Sequences  Resolutions  Views  Frames  QP Pairs (texture, depth) 


BookArrival [22]  1024*768  9 (8,10)  1 to 25  (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48) 
Kendo [23]  1024*768  2 (1,3)  1 to 25  (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48) 
Balloons [23]  1024*768  2 (1,3)  1 to 25  (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48) 
NewsPaper [24]  1024*768  3 (2,4)  1 to 25  (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48) 
PoznanStreet [25]  1920*1080  4.5 (4,5)  1 to 25  (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48) 
PoznanHall2 [25]  1920*1080  6.5 (6,7)  1 to 25  (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48) 
UndoDancer [26]  1920*1080  3 (1,5)  1 to 25  (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48) 
GTFly [27]  1920*1080  7 (5,9)  1 to 25  (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48) 

IiiB Optimization
As analyzed in Fig. 2, pixels with in (a) are firstly warped to in (b). Due to lossy compression of the depth image, pixels with the same location in (e) are mistakenly warped to (f). From the (b) to (f), this process could be regarded as a leftshiftoperation from to , the shiftinterval is one full pixel precision. To make sure that all these changed pixels are counted during our SVSD calculations, dark points are complemented. Besides, these complemented pixels have the same depth value with their neighboring pixels. Thus, after inverse warping, the pixels with in (d) can be regarded as a right extension of the pixels with in (a), and the extendedinterval is one full pixel precision. Similarly, the pixels with in (h) can be treated as a left extension of the pixels with in (e), and the extendedinterval is one full pixel precision as well. According the observation above, the complicated forward warping, union operation, and inverse warping processes can be replaced by an extension process, which is a layerlevel fusion operation. The optimized layerbased representation framework is shown in Fig. 3 (b).
IiiB1 For the left view, the extension process can be performed as follows
If , the in the can be generated by a right extension operation applied on the , with full pixel precision extendedinterval; The in the can be generated by a right extension operation applied on the , with full pixel precision extendedinterval; If , the in the can be generated by a left extension operation applied on the , with full pixel precision extendedinterval; The in the can be generated by a left extension operation applied on the , with full pixel precision extendedinterval; Otherwise, the rest of the pixels in and are the and .
IiiB2 For the right view, the extension process is opposite
If , the in the can be generated by a left extension operation applied on the , with full pixel precision extendedinterval; The in the can be generated by a left extension operation applied on the , with full pixel precision extendedinterval; If , the in the can be generated by a right extension operation applied on the , with full pixel precision extendedinterval; The in the can be generated by a right extension operation applied on the , with full pixel precision extendedinterval; Otherwise, the rest of the pixels in and are the and .
Therefore, instead performing the complicated forward warping, union operation, and inverse warping processes, a fusionlike operation is achieved at the layer level to generate the SVSDs, which makes it more efficient.
Iv Experimental Results
In this paper, three stateoftheart methods, namely Yuan[5], Fang[17], and Jin[11], are chosen as the anchors in our comparisons. The training and testing data are firstly generated by calculating the VSD and its associated SVSDs with the traditional MSE calculation and the proposed layerbased representation, respectively. After that, two main experiments are conducted, which contains: i) accuracy comparisons and ii) efficiency comparisons.
Iva Training and testing data generation
Here, 8 test sequences from the Common Test Conditions (CTC) of the JCT3V [21] are used. There are left and right reference views for each sequence. For each of left or right reference view, there are texture and its associated depth videos. The first 25 frames of reference views are compressed with 7 QP pairs, which are further used to synthesize the virtual view. Therefore, There are 32 (84) original videos (original left texture video, original left depth video, original right texture video, and original right depth video) and 224 (847) compressed videos. The details of these test sequences are exhibited in TABLE I, which includes the sequences resolutions, view positions (where denotes that the view is synthesized with the and views), the index of used frames, and recommended QP pairs for texture and depth videos.
To obtain the training and testing data, the VSD (ground truth) is firstly obtained by carrying out the MSE in Eq. (2) between the the original synthesized views and compressed synthesized views. They are synthesized with original reference views and compressed reference views, respectively. Then, 1400 (8725) VSD results are obtained. After that, the proposed layerbased representation is conducted on the original and compressed reference views to obtain the SVSDs. As aforementioned, threesigma rule is used to confirm the available number of according to the depth distortion in different frames. We obtain 937 SVSDs with = 1 and 463 SVSDs with = 3. Then, the VSD and its associated SVSDs are respectively divided into two parts, i.e., training and testing data. The ratio between training and testing data is 2:1. To be fair, the division is randomly performed three times and we get three groups of training and testing data.


(Frames)  Method  MSE  MSE  PSNR (dB)  PSNR (dB)  Time (s)  


Test 1  (313)  GT  6.4743  /  41.4288  /  / 
Fang  7.6905  1.2162  40.3488  1.0800  0.6682  
Yuan  10.2094  3.7351  39.4684  1.9603  32.7745  
Jin  7.2745  0.8002  40.5898  0.8390  3.3692  
Ours  6.3341  0.1402  41.4292  0.0004  1.3349  
(155)  GT  29.2056  /  34.2465  /  /  
Fang  30.4578  1.2522  33.9720  0.2745  0.5815  
Yuan  48.6340  19.4284  31.9915  2.2550  27.6567  
Jin  29.6714  0.4657  34.0862  0.1603  3.0297  
Ours  28.9729  0.2327  34.2446  0.0019  1.7949  
Average (468)  GT  14.0029  /  39.0500  /  /  
Fang  15.2310  1.2281  38.2368  0.8132  0.6395  
Yuan  22.9355  8.9326  36.9921  2.0579  31.0795  
Jin  14.6923  0.6894  38.4358  0.6142  3.2568  
Ours  13.8320  0.1708  39.0497  0.0003  1.4873  


Test 2  (313)  GT  6.6590  /  41.3541  /  / 
Fang  7.9168  1.2577  40.2716  1.0825  0.6687  
Yuan  10.5356  3.8766  39.3893  1.9649  32.9871  
Jin  7.4636  0.8045  40.4954  0.8587  3.3976  
Ours  6.5804  0.0786  41.3442  0.0100  1.3389  
(155)  GT  29.5211  /  34.2191  /  /  
Fang  30.7591  1.2379  33.9340  0.2852  0.5794  
Yuan  49.7187  20.1976  31.9193  2.2999  26.8840  
Jin  30.0273  0.5062  34.0644  0.1548  3.0102  
Ours  29.3131  0.2081  34.2078  0.0113  1.7700  
Average (468)  GT  14.2309  /  38.9911  /  /  
Fang  15.4821  1.2512  38.1726  0.8184  0.6391  
Yuan  23.5129  9.2820  36.9152  2.0758  30.9658  
Jin  14.9366  0.7057  38.3655  0.6256  3.2693  
Ours  14.1094  0.1215  38.9806  0.0104  1.4817  


Test 3  (313)  GT  7.0173  /  41.1795  /  / 
Fang  8.2793  1.2619  40.1180  1.0614  0.6528  
Yuan  11.1017  4.0844  39.2342  1.9453  31.7427  
Jin  7.8104  0.7930  40.3350  0.8445  3.2600  
Ours  6.9255  0.0918  41.1620  0.0175  1.3047  
(155)  GT  30.7228  /  34.1629  /  /  
Fang  31.9127  1.1899  33.8774  0.2856  0.5771  
Yuan  51.3203  20.5976  31.9013  2.2616  27.3320  
Jin  31.1067  0.3839  34.0060  0.1570  2.9837  
Ours  30.4047  0.3181  34.1480  0.0149  1.7749  
Average (468)  GT  14.8685  /  38.8556  /  /  
Fang  16.1066  1.2381  38.0511  0.8045  0.6277  
Yuan  24.4220  9.5535  36.8056  2.0500  30.2819  
Jin  15.5260  0.6575  38.2388  0.6168  3.1685  
Ours  14.7017  0.1668  38.8390  0.0166  1.4604  



Entries  Settings 


booster  gbtree 
objective  reg:gamma 
gamma  0.1 
max_depth  16 
lambda  3 
subsample  0.7 
colsample_bytree  0.7 
min_child_weight  3 
silent  1 
eta  0.1 
seed  1000 
nthread  4 

For each group of such data, we have 932 training data (624 training data with = 1 and 308 training data with = 3) and 468 testing data (313 testing data with = 1 and 155 testing data with = 3). The detailed settings and hyper parameters of XGBoost are shown in TABLE III. The training data are fed in the XGBoost system to train the nonlinear function . With the well learnt nonlinear function and the testing data SVSDs, VSD can be accurately predicted. In our experiments, three testing results are generated by training and testing on three groups of data.
IvB Accuracy comparison: MSE and PSNR
As shown in TABLE II, the best and the second best results are highlighted with red and blue colors. Compared with three anchors, the proposed method achieves the best predicted results in both MSE and PSNR, i.e., achieving the smallest MSE and PSNR, where is the absolute value of the difference between ground truth and predicted result.
Besides, we also compare the ground truth MSE and PSNR with four predicted MSEs and PSNRs provided by four different methods in all testing frames. As shown in Fig. 5, the experiments are conducted on three groups of training and testing data. Their results are shown in Fig. 5 (a), (b), and (c), respectively. The proposed method achieves the closest results with the ground truth in both MSE and PSNR.
All these experimental results demonstrate that the welllearnt nonlinear mapping function can accurately represent the relationship between the VSD and its associated SVSDs, which plays a critical role during view synthesis distortion estimation/prediction. With such welllearnt nonlinear mapping function, once the SVSDs are given, their associated VSD can be accurately predicted in this work. On the one hand, it can facilitate the optimization of 3D video coding by figuring out the exact contribution of each kind of SVSD to the VSD. On the other hand, as the SVSDs are represented by different levels of depth changes, this can also help us design an optimal depth codec by increasing or decreasing different levels of depth changes to bring in the smallest VSD. To our best knowledge, the existing methods as aforementioned in subsection IC can hardly achieve this.
IvC Efficiency comparison: running time
In this subsection, the complexity of these four methods are compared, where the entire frame of VSD prediction is estimated. The average running time of all the 463 frames is shown in TABLE II, where the unit is second (s). The running time of the proposed methods listed in TABLE II involves two parts. The first part is the running time of the SVSDs generation. The second part is the running time of the NLM training and testing. The ratio of these two parts is 1000:1 during our test. According to the experimental results, the proposed method is competitive to the stateoftheart method (e.g., Fang’s [17] method) in terms of efficiency.
Of note, the proposed method is friendly for parallel processing. Each layer can be performed independently during SVSD calculation, e.g., by a separate thread of the CPU or GPU. Besides, all these anchors except for Fang’s [17] method are friendly for parallel processing. After taking the advantages of paralleled design into account, our method outperforms the stateoftheart in terms of efficiency due to the advantages of our layerlevel operations during the SVSD calculation.
V Conclusion
In this paper, we have proposed an autoweighted layer representation based view synthesis distortion estimation for 3D video coding. To achieve this, the level of depth changes and their associated texture degeneration have been used to define the sub view synthesis distortion (SVSD). After that, a set of theoretical derivations have demonstrated that the VSD can be approximately decomposed into the SVSDs multiplied by their associated weights. We also have developed a layerbased representation of the SVSD, where all the pixels with the same level of depth changes are represented with a layer to enable efficient SVSD calculation. Meanwhile, we have learnt a nonlinear mapping function to better represent the relationship between the VSD and SVSDs based on our newly built dataset. Experimental results have demonstrated that the proposed method outperforms relevant stateoftheart VSD estimation methods in both accuracy and efficiency. Besides, unlike existing VSD estimation methods, we propose the first work to relate different levels of depth changes to the VSD. This allows many new applications can be developed for 3D video coding in our future work, such as optimizing 3D coding by figuring out the exact contribution of SVSDs to the VSD, building a more efficient deep codec by increasing and decreasing different levels of depth changes to bring in the smallest VSD, etc.
References
 [1] P. Merkle, A. Smolic, K. Muller, and T. Wiegand, “Multiview video plus depth representation and coding,” in Proc. IEEE International Conference on Image Processing, San Antonio, TX, USA, Sep. 2007, pp. I.201–I.204.
 [2] C. Fehn, “Depthimagebased rendering (dibr), compression, and transmission for a new approach on 3DTV,” Proc. SPIE, vol. 5291, pp. 93–105, May 2014.
 [3] H. Yuan, Y. Chang, J. Huo, F. Yang, and Z. Lu, “Modelbased joint bit allocation between texture videos and depth maps for 3d video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 4, pp. 485–497, 2011.
 [4] F. Shao, G. Jiang, M. Yu, K. Chen, and Y.S. Ho, “Asymmetric coding of multiview video plus depth based 3D video for view rendering,” IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 157–167, 2011.
 [5] H. Yuan, S. Kwong, X. Wang, Y. Zhang, and F. Li, “A virtual view psnr estimation method for 3d videos,” IEEE Transactions on Broadcasting, vol. 62, no. 1, pp. 134–140, 2016.
 [6] P. Gao and W. Xiang, “Ratedistortion optimized mode switching for errorresilient multiview video plus depth based 3d video coding,” IEEE Transactions on Multimedia, vol. 16, no. 7, pp. 1797–1808, 2014.
 [7] Z. Zheng, J. Huo, B. Li, and H. Yuan, “Fine virtual view distortion estimation method for depth map coding,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 417–421, 2017.
 [8] W.S. Kim, A. Ortega, P. Lai, D. Tian, and C. Gomila, “Depth map distortion analysis for view rendering and depth coding,” in 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE, 2009, pp. 721–724.
 [9] J. Jin, A. Wang, Y. Zhao, C. Lin, and B. Zeng, “Regionaware 3D warping for DIBR,” IEEE IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 953–966, 2016.
 [10] Y. Zhang, S. Kwong, L. Xu, S. Hu, G. Jiang, and C.C. J. Kuo, “Regional bit allocation and rate distortion optimization for multiview depth video coding with view synthesis distortion model,” IEEE Transactions on Image Processing, vol. 22, no. 9, pp. 3497–3512, 2013.
 [11] J. Jin, J. Liang, Y. Zhao, C. Lin, C. Yao, and L. Meng, “Pixellevel view synthesis distortion estimation for 3d video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 2229–2239, 2019.
 [12] P. Gao and A. Smolic, “Occlusionaware depth map coding optimization using allowable depth map distortions,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5266–5280, 2019.
 [13] Y. Zhou, C. Hou, W. Xiang, and F. Wu, “Channel distortion modeling for multiview video transmission over packetswitched networks,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 11, pp. 1679–1692, 2011.
 [14] G. Cheung, J. Ishida, A. Kubota, and A. Ortega, “Transform domain sparsification of depth maps using iterative quadratic programming,” in 2011 18th IEEE International Conference on Image Processing. IEEE, 2011, pp. 129–132.

[15]
D. Zhang and J. Liang, “View synthesis distortion estimation with a graphical model and recursive calculation of probability distribution,”
IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 5, pp. 827–840, 2014.  [16] J. Jin, J. Liang, Y. Zhao, C. Lin, C. Yao, and A. Wang, “A depthbinbased graphical model for fast view synthesis distortion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 6, pp. 1754–1766, 2018.
 [17] L. Fang, N.M. Cheung, D. Tian, A. Vetro, H. Sun, and O. C. Au, “An analytical model for synthesis distortion estimation in 3d video,” IEEE Transactions on Image Processing, vol. 23, no. 1, pp. 185–199, 2013.
 [18] S. Li, C. Zhu, and M.T. Sun, “Hole filling with multiple reference views in dibr view synthesis,” IEEE Transactions on Multimedia, vol. 20, no. 8, pp. 1948–1959, 2018.
 [19] J. Jin, A. Wang, Y. Zhao, C. Lin, and B. Zeng, “Regionaware 3d warping for dibr,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 953–966, 2016.
 [20] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
 [21] Joint Collaborative Team for 3DV (Mar. 2013), 3DHTM Software Platform [Online]. Available: https://hevc.hhi.fraunhofer.de/ svn/svn3DVCSoftware/tags/
 [22] Fraunhofer Heinrich Hertz Institute (Sep. 2013), 3DV Sequences of HHI [Online]. Available: ftp://ftp.hhi.de/HHIMPEG3DV
 [23] Nagoya University (March 2008), 3DV Sequences of Nagoya University [Online]. Available: http://www.tanimoto.nuee.nagoyau.ac.jp/mpeg/mpegftv.html
 [24] “3DV sequences of ETRI and GIST,” Electron. Telecommun. Res. Institute Gwangju Inst. Sci. Technol., Daejon, Korea, Apr. 2008. [Online]. Available: ftp://203.253.130.48
 [25] M. Domañski, T. Grajek, K. Klimaszewski, M. Kurc, O. Stankiewicz, J. Stankowski, and K. Wegner, “Poznan multiview video test sequences and camera parameters,” document MPEG 2009/M17050, ISO/IEC JTC1/SC29/WG11, Xian, China, Oct. 2009.
 [26] D. Rusanovskyy, P. Aflaki, and M.M. Hannuksela, “Undo Dancer 3DV sequence for purposes of 3DV standardization,” ISO/IEC JTC1/SC29/WG11, Doc. M20028, Geneva, CH, March 2011.
 [27] J. Zhang, R. Li, H. Li, D. Rusanovskyy and M.M. Hannuksela, “Ghost Town Fly 3DV sequence for purposes of 3DV standardization,” ISO/IEC JTC1/SC29/WG11, Doc. M20027, Geneva, CH, March 2011.
 [28] J. Jin, Y. Zhao, C. Lin, and A. Wang, “An accurate and efficient nonlinear depth quantization scheme,” Pacific Rim Conference on Multimedia, Springer, pp. 390–399, 2015.
 [29] L. Zhao, A. Wang, B. Zeng and J. Jin, “Scalable Coding of Depth Images with SynthesisGuided Edge Detection,” KSII Transactions on Internet and Information Systems, vol. 9, no. 10, pp. 41084125, 2015.
 [30] S. Tian, L. Zhang, L. Morin and O. Déforges, “NIQSV+: A noreference synthesized view quality assessment metric,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1652–1664, 2017.
 [31] S. Tian, L. Zhang, L. Morin, and O. Déforges, “A benchmark of DIBR synthesized view quality assessment metrics on a new database for immersive media applications,” IEEE Transactions on Multimedia, vol. 21, no. 5, pp. 1235–1247, May 2019.
 [32] S. Ling, J. Li, Z. Che, W. Zhou, J. Wang, Junle and P. Le Callet, “Revisiting discriminator for blind freeviewpoint image quality assessment,” IEEE Transactions on Multimedia, vol. 23, pp. 4245–4258, 2020.