Auto-Weighted Layer Representation Based View Synthesis Distortion Estimation for 3-D Video Coding

Recently, various view synthesis distortion estimation models have been studied to better serve for 3-D video coding. However, they can hardly model the relationship quantitatively among different levels of depth changes, texture degeneration, and the view synthesis distortion (VSD), which is crucial for rate-distortion optimization and rate allocation. In this paper, an auto-weighted layer representation based view synthesis distortion estimation model is developed. Firstly, the sub-VSD (S-VSD) is defined according to the level of depth changes and their associated texture degeneration. After that, a set of theoretical derivations demonstrate that the VSD can be approximately decomposed into the S-VSDs multiplied by their associated weights. To obtain the S-VSDs, a layer-based representation of S-VSD is developed, where all the pixels with the same level of depth changes are represented with a layer to enable efficient S-VSD calculation at the layer level. Meanwhile, a nonlinear mapping function is learnt to accurately represent the relationship between the VSD and S-VSDs, automatically providing weights for S-VSDs during the VSD estimation. To learn such function, a dataset of VSD and its associated S-VSDs are built. Experimental results show that the VSD can be accurately estimated with the weights learnt by the nonlinear mapping function once its associated S-VSDs are available. The proposed method outperforms the relevant state-of-the-art methods in both accuracy and efficiency. The dataset and source code of the proposed method will be available at https://github.com/jianjin008/.

READ FULL TEXT VIEW PDF

page 2

page 3

page 4

page 7

page 8

page 9

page 10

page 11

11/19/2012

Rate-Distortion Analysis of Multiview Coding in a DIBR Framework

Depth image based rendering techniques for multiview applications have b...
05/08/2018

Optimization of Occlusion-Inducing Depth Pixels in 3-D Video Coding

The optimization of occlusion-inducing depth pixels in depth map coding ...
01/22/2022

Content-aware Warping for View Synthesis

Existing image-based rendering methods usually adopt depth-based image w...
07/21/2020

A Deep Ordinal Distortion Estimation Approach for Distortion Rectification

Distortion is widely existed in the images captured by popular wide-angl...
12/15/2019

Characterizing Generalized Rate-Distortion Performance of Video Coding: An Eigen Analysis Approach

Rate-distortion (RD) theory is at the heart of lossy data compression. H...
05/07/2022

λ-domain VVC Rate Control Based on Game Theory

Versatile Video Coding (VVC) has set a new milestone in high-efficiency ...
12/15/2019

Characterizing Generalized Rate-Distortion Performance of Videos

Rate-distortion (RD) analysis is at the heart of lossy data compression....

I Introduction

Fig. 1: Illustration of lossy compression caused texture degradation with different levels of position shifting distortion. (a) and (b) are synthesized by original reference views and decoded reference views, respectively. The green patches located on the main bodies of person and cylinder hardly have depth boundaries, where only small depth changes caused by compression exist and lead to small position shifting in (b1) and (b2). The red patches have lots of depth boundaries around the fingers and cylinder. Compression causes large depth changes around depth boundaries, leading to large position shifting, e.g., finger misalignment and cylinder boundary erosion in (b3) and (b4). Besides, texture changes in the decoded texture reference views are propagated to the virtual view under the direction of changed depth, causing texture degradation from (b1) to (b4).

I-a Motivation

In recent years, 3-D video technology has been popular due to its fresh viewing experiences, e.g., special immersion, high interactivity, and large degree of freedom. In the 3-D video system, the multiview view plus depth (MVD)

[1] representation is the main data format. The MVD records the color and depth information of the same physical scene from different views. With the MVD format data, arbitrary virtual views can be synthesized via a depth-image-based rendering (DIBR) technique [2, 19]. Commonly, the performance of 3-D video system is mainly measured by the distortion/quality [30, 31, 32] of the synthesized virtual view. Hence, the view synthesis distortion (VSD) estimation is crucial, especially for the 3-D video applications. For instance, the estimated VSD [10] is generally used for rate-distortion optimization [3], rate allocation [4], the design of the error resilience techniques [6], etc.

The main reason for the VSD is the changes/errors in the reference texture and depth videos due to lossy compression or transmission errors. During the view synthesis process, texture changes may cause VSD in the luminance/chrominance level, namely texture degradation. However, depth changes may cause complex geometric VSD. Moreover, different levels of depth changes lead to different levels of geometric VSD. In 1-D parallel model, the geometric VSD commonly refers to position shifting [7]. Besides, texture degradation propagating from decoded reference views to their corresponding virtual views is also directed by the changes of depth. After integrating texture degradation with the position shifting, the texture degradation with different levels of position shifting can be regarded as different kinds of sub-VSD (S-VSD) and forms the final VSD.

As shown in Fig. 1, (a) and (b) are synthesized by original reference views (uncompressed in texture and depth reference views) and decoded reference views (compressed by H.264 with QP pair (45, 48) in texture and depth reference views), respectively. Some magnification patches of local distortion are exhibited in (a1)-(a4) and (b1)-(b4). The VSD in the green/red patches mainly belongs to the texture degradation with small/large position shifting, as shown in (b1-b2)/(b3-b4), where the shifting is mainly due to the depth changes caused by lossy compression. As the green patches locate on the bodies of person and cylinder, their original depth is smooth. Even after compressed, the level of depth changes in green patches is low, which only causes non-obvious position shifting. After integrating with the texture degradation, only texture degradation is obviously observed in the green patches. However, the VSD in the red patches mainly belongs to the texture degradation with large position shifting, such as the misaligned fingers in (b3) and the erosion around the boundary of cylinder in (b4). Since obvious depth boundaries exist around the hand and cylinder, lossy compression makes these boundaries smooth and brings large level of depth changes. This leads to the significant position shifting effects in the red patches. After texture degradation propagation, obvious texture degradation together with position shifting can be observed in the green patches. All these distortion, including that in (b1)-(b4), forms the final VSD in (b).

Inspired from above, after taking the texture distortion into account, different levels of depth changes can be used to represent different kinds of S-VSD, which can be further used to predict the VSD. On the one hand, it can benefit the optimization of 3-D video coding [29] by figuring out the exact contribution of each kind of S-VSD to the VSD. On the other hand, it can also help us design an optimal depth codec [28] by increasing or decreasing different levels of depth changes to bring in the smallest VSD. To our best knowledge, existed methods, such as the methods to be reviewed in subsection I-C-2), cannot represent the relationship between the S-VSD and VSD accurately, which is the key challenge in this work.

I-B Our contributions

In this paper, we propose an auto-weighted layer representation based view synthesis distortion estimation model. This is the first work utilizing learning-based approach to mine the accurate relationships among the degeneration of texture, changes of depth, and the VSD, especially for the relationship between the VSD and its associated S-VSDs. This provides us a methodology to predict the VSD by utilizing its associated S-VSDs. It can be used for optimizing the design of 3-D video coding, especially for the depth coding. The main contributions are summarized as follows.

  • This is the first work to relate different levels of depth changes together with their texture degeneration to the view synthesis distortion (VSD), which is crucial for various 3-D video applications, such as aforementioned 3-D video coding, depth coding, etc.

  • Sub view synthesis distortion (S-VSD) is first defined according to the level of depth changes and its associated texture degeneration in this paper. Besides, an elaborate derivation is given to demonstrate that the VSD can be decomposed into different kinds of S-VSD approximately.

  • To accurately represent the relationship between VSD and its associated S-VSDs, a nonlinear mapping function between the VSD and S-VSDs is learnt based on our built dataset, which is the first dataset for mining the relationship between the VSD and S-VSDs.

  • To calculate the S-VSD efficiently, a layer-based representation method is proposed and further optimized, where all the pixels with the same level of the depth changes (i.e., the S-VSD) will be represented with a layer. It enables the S-VSD calculation perform at the layer level.

Compared with the existed VSD estimation methods, the well-learnt nonlinear mapping function is able to accurately represent the relationship between the VSD and S-VSDs. Meanwhile, the proposed layer-based representation enables the VSD estimation performed at the layer level without spending additional calculation on partly performing the view synthesis process at the pixel level to make the proposed method more efficient.

I-C Related work

I-C1 View synthesis

In this paper, View synthesis mainly refer to the DIBR based view synthesis, which commonly contains two steps, namely warping and blending.

During the warping step, forward warping, warping competition, and rounding operation are performed accordingly. The goal of warping step is to warp the pixels in the reference views to the warped views. Assume that a pixel in the original reference view with location is warped to a new location in the warped view. The subscript is used to index the left () or right () view. This process can be formulated as

(1)

where denotes the disparity of pixel with depth value . is the rounding operation. denotes the baseline between cameras and is the focal length of cameras. is the depth range of the physical scene. In Eq. (1), disparity can be regarded as the function of depth value , which is denoted as for simplification.

After warping step, there will be lots of dis-occlusions in the warped views, since the occlusion parts in the reference views become visible. To fill the dis-occlusions, the blending step is carried out by merging the two warped views into a virtual one. Besides, three blending strategies need to be followed according to three different cases during blending step: i) if current pixel in the virtual view is visible in both two warped views, a weighted average of these two values is used; ii) if it is only visible in one of the warped views, this value will be directly used; iii) otherwise, an inpainting value will be used.

I-C2 View synthesis distortion estimation

Commonly, there are two typical VSD estimation methods to estimate lossy compression caused view synthesis distortion and transmission error (package loss) caused view synthesis distortion, respectively. Kim [8] first developed a camera and video parameters based quality metric to quantify the effect of lossy coding of depth maps on synthesized view quality. After that, Yuan [3] proposed a concise distortion model by analyzing the impacts of the compression distortion of texture images and depth maps on the quality of the virtual views. Meanwhile, Zhang [10] proposed a view synthesis distortion model by taking the regions characteristics into account for depth video coding. Based on this, Fang [17] relate errors in the depth images to the synthesis quality by taking texture image characteristics and the warping step of view synthesis into account. However, the warping step is used for relating the distortion of depth to the synthesized view at the frame level, which limits its accuracy on the VSD estimation. To make more accurate prediction of the VSD, Yuan [5] utilize the warping step of view synthesis to simulate the error prorogation from distorted depth to the virtual synthesized view at the pixel level, which directly measures the quality of the virtual view by partly carrying out view synthesis. However, the blend and inpainting steps in view synthesis are still not considered due to their complicated operations. Jin [11] proposed a pixel-level VSD estimation, where the warping and blending steps are partly taken into account to build a more accurate relations between the distorted depth together with the texture and VSD, achieving the state-of-the-art result. However, compared with the pixel-level VSD estimation methods in [5] and [11], the frame-level one in [17] is more efficient, when pixel-level parallel processing is not considered. Meanwhile, Pan [12] developed a depth distortion range, in which depth changes brought no geometrical distortions.

To model the transmission error caused distortion, Zhou [13] first derived a Channel distortion model for multi-view video transmission over lossy packet-switched networks, which can estimate the channel caused distortion at the frame level. Then, a quadratic model is proposed by Cheung [14], which first relates the disparity errors caused by packet loss in the depth maps to the distortion contribution in the synthesized view. After that, Gao [6] developed an end-to-end 3-D video transmission oriented VSD estimation model for 3-D video coding to improve error resilience. To accurately model the error propagation process during view synthesis, Zhang [15] proposed a depth-value-based graphical (DVGM) model. By taking the transmission error into account, it can accurately estimate the transmission caused view synthesis distortion. To further speed up DVGM, Jin [16] proposed a depth-bin based graphical model for VSD estimation, which is more efficient without sacrificing accuracy.

As reviewed above, all these methods are trying to predict the VSD by modeling frame-level depth distortion or pixel-level depth distortion without considering the exact contribution of different levels of depth changes for the VSD. Besides, to build the relations between the distorted reference views to the virtual synthesized view, all these VSD methods partly integrate the view synthesis process into their approaches, e.g. the warping step. This will lead to two drawbacks: 1) It cannot achieve accurate prediction of the VSD by partly using view synthesis (e.g., warping step) for building the relationship between the distorted texture together with depth and the VSD, since the blending step of the view synthesis also affects the view synthesis results. While the nonlinear operations of blending step (warping competition, inpainting operation, etc.) make it hardly formulated in such VSD estimation methods. 2) Even through part of view synthesis is performed in such methods, the calculation is based on pixel-level, namely each pixel will perform the view synthesis process partly. This reduces their efficiency in some degree. To overcome these drawbacks, we firstly learn the nonlinear mapping function based on our dataset to exploit an exact relationship between different levels of depth changes together with their associated distorted texture (the S-VSDs) and the VSD. Then, we propose an efficient layer-based representation method, which enables the VSD estimation performed at the layer level.

The outline of the rest of our paper is as follows. First, the proposed model is presented in Section II. Then, the layer-based representation method is developed in Section III. Section IV presents experimental results and Section V concludes this paper.

Ii The proposed Model

In this section, we first define the total view synthesis distortion (VSD) and the sub view synthesis distortion (S-VSD) abstractly. To better understand the S-VSD, an analysis on the S-VSD is given in detail. After that, a set of theoretical derivations is given according to view synthesis process, from which we demonstrate that the VSD can be decomposed into the S-VSDs by their associated weights approximately. Finally, a nonlinear mapping function represented with neural networks is used to learn the weights between the VSD and its associated S-VSDs.

Ii-a Definition of the VSD and S-VSD

In this paper, the view synthesis distortion of the virtual view (i.e., VSD) is formulated with the Mean Square Error (MSE) over the entire frame of synthesized view as used in [17], where

(2)

where and denote the width and height of the virtual view. and are the color values of two pixels in two different virtual views, which are synthesized with the original reference views and decoded views. Besides, these two pixels have the same location in their own virtual views.

As mentioned in Eq. (1), depth changes may lead to disparity changes. Assume that a pixel in the decoded reference view with location is warped to in the warped view. Then, we have

(3)
Fig. 2: Illustration of the analysis on S-VSDs during view synthesis.

Besides, depth changes with different levels may cause disparities changes with different levels, namely warped position shifting with different levels. Here, the depth changes from to caused warped position shifting distortion is measured with , and

(4)

Therefore, the warped position shifting distortion of a pixel can be functioned by the levels of depth changes from to . After integrating with texture information distortion, the sub view synthesis distortion (S-VSD) in this paper is defined as

(5)

where denotes a certain S-VSD of all the pixels that are with the same level of depth changes (measured with ) in the left () or right () warped views, and their locations are collected with set . denotes the cardinality of . Their associated pixels in the original and decoded reference views are and , and their locations are collected with set and , respectively. and are the texture value of pixel and . It should be noticed that the whole frame collected with set can be divided into several according to , and we have and . Besides, the intersection of any two sets of is empty, i.e., .

As the lossy compression (source coding) caused depth distortion can be approximately regarded as a zero-mean white noise, the three-sigma rule is used to confirm the available number of

, namely .

Ii-B An detailed analysis of the S-VSD

To better understand S-VSD, an analysis of the S-VSD is given based on an example in this subsection. Assume that there are three different levels of depth changes, caused by the lossy compression in the left and right depth reference views. Therefore, six S-VSDs need to be calculated for each of the VSD. Here, only three S-VSDs in the left view are involved for simplify, namely , , and . The other three S-VSDs (, , and ) in the right view are similar.

As shown in Fig. 2, pixels with the same warped position shifting distortion are masked with the same color points in original and decoded reference views. Pixels with and are highlighted with blue and red points, respectively. The rest are pixels with , which means no warped position shifting distortion exists in these pixels. Set is used to collect locations of the blue points in the original reference view, which has a horizontal interval . Then, these pixels are warped to the interval in the original warped view, their locations are collected with set . As lossy compression makes the depth values of these pixels changed, the pixels in set in the decoded reference view are mistakenly warped to a new interval in the decoded warped view. Their locations are collected with set . Assume all the pixels of these two sets are eventually exhibited in the virtual views. By fusing these two sets with a union operation, we can obtain set , namely . The complementary parts are highlighted with the dark points. Finally, we can easily locate their associated sets and in the original and decoded reference views by backward warping [18] pixels in the . Therefore, the S-VSD can be regarded as the Mean Square Error over the pixels of these two sets physically. Likewise, the definitions for other S-VSDs in the left view are similar..

It should be noticed that the purpose of union operation is to ensure the distorted pixels could be fully counted during the S-VSD calculation. Besides, depending on the fact in the backward warping that the missing depth value of a certain pixel is assigned with its adjacent pixel’s depth value, we assign the depth values of the dark points with their adjacent pixels in the sets during inverse warping process.

Ii-C A theoretical derivation: from VSD to S-VSDs

To relate the pixel in the virtual view with its associated pixels in the original and decoded reference views, we also take the inverse process into account as in [11]. First, we relate the pixel in the virtual view to its associated pixels in the warped views by taking the blending strategies into account. in Eq. (2) can be rewritten as

(6)

where and denote the color values of pixels in the original left and right warped views. and (, ) are two weights for the left and right warped views, which are determined by the locations of cameras array. The first item represents that the pixel in the virtual view is visible in both warped views case. The second and third items represent that only one of the warped views can be visible. The last item is to formulate the inpainting case.

Similarly, can be rewritten as

(7)

where and denote the color values of pixels in the decoded left and right warped views.

Then, we make an approximate computation. As the last case is infrequent, which occupies around 1 of the all these four cases as mentioned in [19]. Therefore, can be approximately rewritten as

(8)

where and (, ) are two weights for the left and right warped views. Similarly, we have

(9)
(16)
(20)
(21)

After that, we relate these pixels in the warped views to their associated pixels in the reference views by considering of the warping step during view synthesis. Likewise, we assume pixels with position in the original left and right warped views are warped from pixels with position in the original left () or right () reference views. We have

(10)

where denotes the color value of pixel in the original left () or right () reference views. Its associated disparity can be represented by Eq. (1).

Similarly, we assume pixels with position in the decoded warped views is warped from pixel with position in the decoded reference views due to its distorted depth values . We have

(11)

where denotes the color value of pixel in the decoded left () or right () reference view. Its associated disparities can be represented by Eq. (3).

Therefore, and can be rewritten as

(12)

and

(13)

By substituting Eq. (12) and Eq. (13) into Eq. (8) and Eq. (9), respectively, we obtain

(14)

and

(15)

Eq. (14) and Eq. (15) relate the pixel in the virtual view to its associated pixels in the reference views. Finally, Eq. (2) is rewritten as Eq. (16).

For the Eq. (5), it can be rewritten as

(17)

As aforementioned that the intersection of any two sets of is empty and , we have

(18)

As , then we obtain

(19)

By plugging Eq. (18) and Eq. (19) into Eq. (16), we have Eq. (20). Based on the observation of Eq. (20), the VSD can be decomposed into S-VSDs () by their associated weights approximately. To obtain the exact weights between the VSD and its associated S-VSDs, a nonlinear mapping function represented with neural networks is used instead of the linear one to well learn the non-linear relation between the VSD and S-VSDs due to a nonlinear mapping represented with neural networks is employed instead of a linear one to well learn the non-linear relation between the VSD and S-VSDs. This is due to the non-linear operations existed in view synthesis, e.g., hole filling, inpainting, warping competition and so on. Then, we have Eq. (21), where the VSD is regarded as a non-linear mapping function of S-VSDs via a functional relation . This also gives us a theoretical proof that the VSD can be predicted using its associated S-VSDs once the functional relation is obtained.

Ii-D Learning a non-linear mapping function:

To make the prediction from S-VSDs closely approximate to the actual VSD, we formulate an optimization program as follows

(22)

where is the index of sample, and training samples are involved. denotes a serious S-VSDs (i.e., ) of sample , which can be obtained by our layer-based representation method, which is introduced in Section III. Their associated actual VSD (i.e., MSE) can be directly obtained by calculating the mean square error between the pair of virtual views, which are synthesized by their corresponding original reference views and decoded reference views. is used to measure the approximation between two terms. With such a learnt nonlinear mapping function , once another series of S-VSDs are given, their associated VSD can be accurately predicted correspondingly.

In particular, to obtain such a nonlinear function

, we employ XGBoost

[20], a scalable end-to-end tree boosting system. Specifically, by introducing tress to our model, the optimization program in Eq. (22) can be rewritten as

(23)
Fig. 3: Illustration of the framework of our proposed layer-based representation method.
Fig. 4: Illustration of layered representation.

where represents the training error of sample, and we adopt the mean square error. The second term measures the complexity of our tree, where is the complexity of tree.

Of note, minimizing the tree complexity facilitates the generalization ability of our algorithm, while minimizing the prediction error can guarantee the accuracy of our model. The parameters setting of XGBoost will be given in subsection IV-A.

Iii Layer-Based Representation

After analyzing the S-VSD, a layer-based representation method is first developed to generate the S-VSD in this section. Then, an optimization of the layer-based representation is made to reduce its complexity and speedup the S-VSDs generation. The details will be presented as follows.

Iii-a Methodology

It is similar that the layer-based representation method is performed on the left and right views. For simplify, only its application on the left views is carefully elaborated in the following parts. The framework of the layer-based representation is shown in Fig. 3 (a).

Iii-A1 Disparity Conversion

The depth images are input at first, which contains the left original and decoded depth images and . Then, and are obtained by plugging and into Eq. (1), respectively.

Iii-A2 Disparity Difference

A pixel-wise subtraction operation is carried out to obtain the difference between and . Then, we have the disparity difference image .

Iii-A3 Layered Representation

Pixels with the same value in are masked with the same color and collected with a pair of layers , in and , respectively. We use different colors to mask pixels with different values, which are further represented with different layers. Then, sets , … , and , … , are easily obtained by visiting their corresponding layers , . An example with assumption is shown in Fig. 4, pixels with are masked with blue, gray, and red colors, which are represented with three layers in and , respectively, i.e., , and , which contains sets , and . It should be noticed that all the following operations are performed at the layer level.

Iii-A4 Forward Warping

The layered pixels in and are forward warped to the original and decoded warped views according to the disparity images and . Sets , … , and , … , are obtained and represented with different pairs of layers.

Iii-A5 Fusion

Merge each and (where ) by a union operation and obtain , … , , namely (where ).

Iii-A6 Inverse Warping

Inversely warp the back to the original and decoded left reference view, and generate their associated sets and , which are presented with different pairs of layers.

Iii-A7 Mse

Pixels with locations in and in are used to calculate the via a layer level MSE calculation, which is similar with that used in Eq. (5).

 

Sequences Resolutions Views Frames QP Pairs (texture, depth)

 

BookArrival [22] 1024*768 9 (8,10) 1 to 25 (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48)
Kendo [23] 1024*768 2 (1,3) 1 to 25 (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48)
Balloons [23] 1024*768 2 (1,3) 1 to 25 (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48)
NewsPaper [24] 1024*768 3 (2,4) 1 to 25 (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48)
PoznanStreet [25] 1920*1080 4.5 (4,5) 1 to 25 (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48)
PoznanHall2 [25] 1920*1080 6.5 (6,7) 1 to 25 (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48)
UndoDancer [26] 1920*1080 3 (1,5) 1 to 25 (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48)
GT-Fly [27] 1920*1080 7 (5,9) 1 to 25 (15, 24), (20, 29), (25, 34), (30, 39), (35, 42),(40, 45),(45, 48)

 

TABLE I: Details of Test Sequences

Iii-B Optimization

As analyzed in Fig. 2, pixels with in (a) are firstly warped to in (b). Due to lossy compression of the depth image, pixels with the same location in (e) are mistakenly warped to (f). From the (b) to (f), this process could be regarded as a left-shift-operation from to , the shift-interval is one full pixel precision. To make sure that all these changed pixels are counted during our S-VSD calculations, dark points are complemented. Besides, these complemented pixels have the same depth value with their neighboring pixels. Thus, after inverse warping, the pixels with in (d) can be regarded as a right extension of the pixels with in (a), and the extended-interval is one full pixel precision. Similarly, the pixels with in (h) can be treated as a left extension of the pixels with in (e), and the extended-interval is one full pixel precision as well. According the observation above, the complicated forward warping, union operation, and inverse warping processes can be replaced by an extension process, which is a layer-level fusion operation. The optimized layer-based representation framework is shown in Fig. 3 (b).

Iii-B1 For the left view, the extension process can be performed as follows

If , the in the can be generated by a right extension operation applied on the , with full pixel precision extended-interval; The in the can be generated by a right extension operation applied on the , with full pixel precision extended-interval; If , the in the can be generated by a left extension operation applied on the , with full pixel precision extended-interval; The in the can be generated by a left extension operation applied on the , with full pixel precision extended-interval; Otherwise, the rest of the pixels in and are the and .

Iii-B2 For the right view, the extension process is opposite

If , the in the can be generated by a left extension operation applied on the , with full pixel precision extended-interval; The in the can be generated by a left extension operation applied on the , with full pixel precision extended-interval; If , the in the can be generated by a right extension operation applied on the , with full pixel precision extended-interval; The in the can be generated by a right extension operation applied on the , with full pixel precision extended-interval; Otherwise, the rest of the pixels in and are the and .

Therefore, instead performing the complicated forward warping, union operation, and inverse warping processes, a fusion-like operation is achieved at the layer level to generate the S-VSDs, which makes it more efficient.

Iv Experimental Results

In this paper, three state-of-the-art methods, namely Yuan[5], Fang[17], and Jin[11], are chosen as the anchors in our comparisons. The training and testing data are firstly generated by calculating the VSD and its associated S-VSDs with the traditional MSE calculation and the proposed layer-based representation, respectively. After that, two main experiments are conducted, which contains: i) accuracy comparisons and ii) efficiency comparisons.

Iv-a Training and testing data generation

Here, 8 test sequences from the Common Test Conditions (CTC) of the JCT-3V [21] are used. There are left and right reference views for each sequence. For each of left or right reference view, there are texture and its associated depth videos. The first 25 frames of reference views are compressed with 7 QP pairs, which are further used to synthesize the virtual view. Therefore, There are 32 (84) original videos (original left texture video, original left depth video, original right texture video, and original right depth video) and 224 (847) compressed videos. The details of these test sequences are exhibited in TABLE I, which includes the sequences resolutions, view positions (where denotes that the view is synthesized with the and views), the index of used frames, and recommended QP pairs for texture and depth videos.

To obtain the training and testing data, the VSD (ground truth) is firstly obtained by carrying out the MSE in Eq. (2) between the the original synthesized views and compressed synthesized views. They are synthesized with original reference views and compressed reference views, respectively. Then, 1400 (8725) VSD results are obtained. After that, the proposed layer-based representation is conducted on the original and compressed reference views to obtain the S-VSDs. As aforementioned, three-sigma rule is used to confirm the available number of according to the depth distortion in different frames. We obtain 937 S-VSDs with = 1 and 463 S-VSDs with = 3. Then, the VSD and its associated S-VSDs are respectively divided into two parts, i.e., training and testing data. The ratio between training and testing data is 2:1. To be fair, the division is randomly performed three times and we get three groups of training and testing data.

 

 (Frames) Method MSE MSE PSNR (dB) PSNR (dB) Time (s)

 

Test 1  (313) GT 6.4743 / 41.4288 / /
Fang 7.6905 1.2162 40.3488 1.0800  0.6682
Yuan 10.2094 3.7351 39.4684 1.9603 32.7745
Jin 7.2745 0.8002 40.5898 0.8390 3.3692
Ours  6.3341  0.1402  41.4292  0.0004 1.3349
 (155) GT 29.2056 / 34.2465 / /
Fang 30.4578 1.2522 33.9720 0.2745  0.5815
Yuan 48.6340 19.4284 31.9915 2.2550 27.6567
Jin 29.6714 0.4657 34.0862 0.1603 3.0297
Ours  28.9729  0.2327  34.2446  0.0019 1.7949
Average (468) GT 14.0029 / 39.0500 / /
Fang 15.2310 1.2281 38.2368 0.8132  0.6395
Yuan 22.9355 8.9326 36.9921 2.0579 31.0795
Jin 14.6923 0.6894 38.4358 0.6142 3.2568
Ours  13.8320  0.1708  39.0497  0.0003 1.4873

 

Test 2  (313) GT 6.6590 / 41.3541 / /
Fang 7.9168 1.2577 40.2716 1.0825  0.6687
Yuan 10.5356 3.8766 39.3893 1.9649 32.9871
Jin 7.4636 0.8045 40.4954 0.8587 3.3976
Ours  6.5804  0.0786  41.3442  0.0100 1.3389
 (155) GT 29.5211 / 34.2191 / /
Fang 30.7591 1.2379 33.9340 0.2852  0.5794
Yuan 49.7187 20.1976 31.9193 2.2999 26.8840
Jin 30.0273 0.5062 34.0644 0.1548 3.0102
Ours  29.3131  0.2081  34.2078  0.0113 1.7700
Average (468) GT 14.2309 / 38.9911 / /
Fang 15.4821 1.2512 38.1726 0.8184  0.6391
Yuan 23.5129 9.2820 36.9152 2.0758 30.9658
Jin 14.9366 0.7057 38.3655 0.6256 3.2693
Ours  14.1094  0.1215  38.9806  0.0104 1.4817

 

Test 3  (313) GT 7.0173 / 41.1795 / /
Fang 8.2793 1.2619 40.1180 1.0614  0.6528
Yuan 11.1017 4.0844 39.2342 1.9453 31.7427
Jin 7.8104 0.7930 40.3350 0.8445 3.2600
Ours  6.9255  0.0918  41.1620  0.0175 1.3047
 (155) GT 30.7228 / 34.1629 / /
Fang 31.9127 1.1899 33.8774 0.2856  0.5771
Yuan 51.3203 20.5976 31.9013 2.2616 27.3320
Jin 31.1067 0.3839 34.0060 0.1570 2.9837
Ours  30.4047  0.3181  34.1480  0.0149 1.7749
Average (468) GT 14.8685 / 38.8556 / /
Fang 16.1066 1.2381 38.0511 0.8045  0.6277
Yuan 24.4220 9.5535 36.8056 2.0500 30.2819
Jin 15.5260 0.6575 38.2388 0.6168 3.1685
Ours  14.7017  0.1668  38.8390  0.0166 1.4604

 

TABLE II: The Comparison of Four Methods in Terms of The Prediction of MSE, PNSR, And Time Cost.

 

Entries Settings

 

booster gbtree
objective reg:gamma
gamma 0.1
max_depth 16
lambda 3
subsample 0.7
colsample_bytree 0.7
min_child_weight 3
silent 1
eta 0.1
seed 1000
nthread 4

 

TABLE III: Hyper Parameters Setting of XGBoost
Fig. 5: The comparison between the ground truth MSE & PSNR and four predicted MSE & PSNR provided by four different methods. (a), (b), and (c) are the results of three groups of training and testing data. The red patches are the magnification of local parts of the curve.

For each group of such data, we have 932 training data (624 training data with = 1 and 308 training data with = 3) and 468 testing data (313 testing data with = 1 and 155 testing data with = 3). The detailed settings and hyper parameters of XGBoost are shown in TABLE III. The training data are fed in the XGBoost system to train the nonlinear function . With the well learnt nonlinear function and the testing data S-VSDs, VSD can be accurately predicted. In our experiments, three testing results are generated by training and testing on three groups of data.

Iv-B Accuracy comparison: MSE and PSNR

As shown in TABLE II, the best and the second best results are highlighted with red and blue colors. Compared with three anchors, the proposed method achieves the best predicted results in both MSE and PSNR, i.e., achieving the smallest MSE and PSNR, where is the absolute value of the difference between ground truth and predicted result.

Besides, we also compare the ground truth MSE and PSNR with four predicted MSEs and PSNRs provided by four different methods in all testing frames. As shown in Fig. 5, the experiments are conducted on three groups of training and testing data. Their results are shown in Fig. 5 (a), (b), and (c), respectively. The proposed method achieves the closest results with the ground truth in both MSE and PSNR.

All these experimental results demonstrate that the well-learnt nonlinear mapping function can accurately represent the relationship between the VSD and its associated S-VSDs, which plays a critical role during view synthesis distortion estimation/prediction. With such well-learnt nonlinear mapping function, once the S-VSDs are given, their associated VSD can be accurately predicted in this work. On the one hand, it can facilitate the optimization of 3-D video coding by figuring out the exact contribution of each kind of S-VSD to the VSD. On the other hand, as the S-VSDs are represented by different levels of depth changes, this can also help us design an optimal depth codec by increasing or decreasing different levels of depth changes to bring in the smallest VSD. To our best knowledge, the existing methods as aforementioned in subsection I-C can hardly achieve this.

Iv-C Efficiency comparison: running time

In this subsection, the complexity of these four methods are compared, where the entire frame of VSD prediction is estimated. The average running time of all the 463 frames is shown in TABLE II, where the unit is second (s). The running time of the proposed methods listed in TABLE II involves two parts. The first part is the running time of the S-VSDs generation. The second part is the running time of the NLM training and testing. The ratio of these two parts is 1000:1 during our test. According to the experimental results, the proposed method is competitive to the state-of-the-art method (e.g., Fang’s [17] method) in terms of efficiency.

Of note, the proposed method is friendly for parallel processing. Each layer can be performed independently during S-VSD calculation, e.g., by a separate thread of the CPU or GPU. Besides, all these anchors except for Fang’s [17] method are friendly for parallel processing. After taking the advantages of paralleled design into account, our method outperforms the state-of-the-art in terms of efficiency due to the advantages of our layer-level operations during the S-VSD calculation.

V Conclusion

In this paper, we have proposed an auto-weighted layer representation based view synthesis distortion estimation for 3-D video coding. To achieve this, the level of depth changes and their associated texture degeneration have been used to define the sub view synthesis distortion (S-VSD). After that, a set of theoretical derivations have demonstrated that the VSD can be approximately decomposed into the S-VSDs multiplied by their associated weights. We also have developed a layer-based representation of the S-VSD, where all the pixels with the same level of depth changes are represented with a layer to enable efficient S-VSD calculation. Meanwhile, we have learnt a nonlinear mapping function to better represent the relationship between the VSD and S-VSDs based on our newly built dataset. Experimental results have demonstrated that the proposed method outperforms relevant state-of-the-art VSD estimation methods in both accuracy and efficiency. Besides, unlike existing VSD estimation methods, we propose the first work to relate different levels of depth changes to the VSD. This allows many new applications can be developed for 3-D video coding in our future work, such as optimizing 3-D coding by figuring out the exact contribution of S-VSDs to the VSD, building a more efficient deep codec by increasing and decreasing different levels of depth changes to bring in the smallest VSD, etc.

References

  • [1] P. Merkle, A. Smolic, K. Muller, and T. Wiegand, “Multi-view video plus depth representation and coding,” in Proc. IEEE International Conference on Image Processing, San Antonio, TX, USA, Sep. 2007, pp. I.201–I.204.
  • [2] C. Fehn, “Depth-image-based rendering (dibr), compression, and transmission for a new approach on 3D-TV,” Proc. SPIE, vol. 5291, pp. 93–105, May 2014.
  • [3] H. Yuan, Y. Chang, J. Huo, F. Yang, and Z. Lu, “Model-based joint bit allocation between texture videos and depth maps for 3-d video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 4, pp. 485–497, 2011.
  • [4] F. Shao, G. Jiang, M. Yu, K. Chen, and Y.-S. Ho, “Asymmetric coding of multi-view video plus depth based 3-D video for view rendering,” IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 157–167, 2011.
  • [5] H. Yuan, S. Kwong, X. Wang, Y. Zhang, and F. Li, “A virtual view psnr estimation method for 3-d videos,” IEEE Transactions on Broadcasting, vol. 62, no. 1, pp. 134–140, 2016.
  • [6] P. Gao and W. Xiang, “Rate-distortion optimized mode switching for error-resilient multi-view video plus depth based 3-d video coding,” IEEE Transactions on Multimedia, vol. 16, no. 7, pp. 1797–1808, 2014.
  • [7] Z. Zheng, J. Huo, B. Li, and H. Yuan, “Fine virtual view distortion estimation method for depth map coding,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 417–421, 2017.
  • [8] W.-S. Kim, A. Ortega, P. Lai, D. Tian, and C. Gomila, “Depth map distortion analysis for view rendering and depth coding,” in 2009 16th IEEE International Conference on Image Processing (ICIP).   IEEE, 2009, pp. 721–724.
  • [9] J. Jin, A. Wang, Y. Zhao, C. Lin, and B. Zeng, “Region-aware 3-D warping for DIBR,” IEEE IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 953–966, 2016.
  • [10] Y. Zhang, S. Kwong, L. Xu, S. Hu, G. Jiang, and C.-C. J. Kuo, “Regional bit allocation and rate distortion optimization for multiview depth video coding with view synthesis distortion model,” IEEE Transactions on Image Processing, vol. 22, no. 9, pp. 3497–3512, 2013.
  • [11] J. Jin, J. Liang, Y. Zhao, C. Lin, C. Yao, and L. Meng, “Pixel-level view synthesis distortion estimation for 3d video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 2229–2239, 2019.
  • [12] P. Gao and A. Smolic, “Occlusion-aware depth map coding optimization using allowable depth map distortions,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5266–5280, 2019.
  • [13] Y. Zhou, C. Hou, W. Xiang, and F. Wu, “Channel distortion modeling for multi-view video transmission over packet-switched networks,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 11, pp. 1679–1692, 2011.
  • [14] G. Cheung, J. Ishida, A. Kubota, and A. Ortega, “Transform domain sparsification of depth maps using iterative quadratic programming,” in 2011 18th IEEE International Conference on Image Processing.   IEEE, 2011, pp. 129–132.
  • [15]

    D. Zhang and J. Liang, “View synthesis distortion estimation with a graphical model and recursive calculation of probability distribution,”

    IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 5, pp. 827–840, 2014.
  • [16] J. Jin, J. Liang, Y. Zhao, C. Lin, C. Yao, and A. Wang, “A depth-bin-based graphical model for fast view synthesis distortion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 6, pp. 1754–1766, 2018.
  • [17] L. Fang, N.-M. Cheung, D. Tian, A. Vetro, H. Sun, and O. C. Au, “An analytical model for synthesis distortion estimation in 3d video,” IEEE Transactions on Image Processing, vol. 23, no. 1, pp. 185–199, 2013.
  • [18] S. Li, C. Zhu, and M.-T. Sun, “Hole filling with multiple reference views in dibr view synthesis,” IEEE Transactions on Multimedia, vol. 20, no. 8, pp. 1948–1959, 2018.
  • [19] J. Jin, A. Wang, Y. Zhao, C. Lin, and B. Zeng, “Region-aware 3-d warping for dibr,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 953–966, 2016.
  • [20] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
  • [21] Joint Collaborative Team for 3DV (Mar. 2013), 3D-HTM Software Platform [Online]. Available: https://hevc.hhi.fraunhofer.de/ svn/svn3DVCSoftware/tags/
  • [22] Fraunhofer Heinrich Hertz Institute (Sep. 2013), 3DV Sequences of HHI [Online]. Available: ftp://ftp.hhi.de/HHIMPEG3DV
  • [23] Nagoya University (March 2008), 3DV Sequences of Nagoya University [Online]. Available: http://www.tanimoto.nuee.nagoya-u.ac.jp/mpeg/mpeg-ftv.html
  • [24] “3DV sequences of ETRI and GIST,” Electron. Telecommun. Res. Institute Gwangju Inst. Sci. Technol., Daejon, Korea, Apr. 2008. [Online]. Available: ftp://203.253.130.48
  • [25] M. Domañski, T. Grajek, K. Klimaszewski, M. Kurc, O. Stankiewicz, J. Stankowski, and K. Wegner, “Poznan multiview video test sequences and camera parameters,” document MPEG 2009/M17050, ISO/IEC JTC1/SC29/WG11, Xian, China, Oct. 2009.
  • [26] D. Rusanovskyy, P. Aflaki, and M.M. Hannuksela, “Undo Dancer 3DV sequence for purposes of 3DV standardization,” ISO/IEC JTC1/SC29/WG11, Doc. M20028, Geneva, CH, March 2011.
  • [27] J. Zhang, R. Li, H. Li, D. Rusanovskyy and M.M. Hannuksela, “Ghost Town Fly 3DV sequence for purposes of 3DV standardization,” ISO/IEC JTC1/SC29/WG11, Doc. M20027, Geneva, CH, March 2011.
  • [28] J. Jin, Y. Zhao, C. Lin, and A. Wang, “An accurate and efficient nonlinear depth quantization scheme,” Pacific Rim Conference on Multimedia, Springer, pp. 390–399, 2015.
  • [29] L. Zhao, A. Wang, B. Zeng and J. Jin, “Scalable Coding of Depth Images with Synthesis-Guided Edge Detection,” KSII Transactions on Internet and Information Systems, vol. 9, no. 10, pp. 4108-4125, 2015.
  • [30] S. Tian, L. Zhang, L. Morin and O. Déforges, “NIQSV+: A no-reference synthesized view quality assessment metric,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1652–1664, 2017.
  • [31] S. Tian, L. Zhang, L. Morin, and O. Déforges, “A benchmark of DIBR synthesized view quality assessment metrics on a new database for immersive media applications,” IEEE Transactions on Multimedia, vol. 21, no. 5, pp. 1235–1247, May 2019.
  • [32] S. Ling, J. Li, Z. Che, W. Zhou, J. Wang, Junle and P. Le Callet, “Re-visiting discriminator for blind free-viewpoint image quality assessment,” IEEE Transactions on Multimedia, vol. 23, pp. 4245–4258, 2020.