Quality Assessment of DIBR-synthesized views: An Overview

11/16/2019 ∙ by Shishun Tian, et al. ∙ Shenzhen University 0

The Depth-Image-Based-Rendering (DIBR) is one of the main fundamental technique to generate new views in 3D video applications, such as Multi-View Videos (MVV), Free-Viewpoint Videos (FVV) and Virtual Reality (VR). However, the quality assessment of DIBR-synthesized views is quite different from the traditional 2D images/videos. In recent years, several efforts have been made towards this topic, but there lacks a detailed survey in literature. In this paper, we provide a comprehensive survey on various current approaches for DIBR-synthesized views. The current accessible datasets of DIBR-synthesized views are firstly reviewed. Followed by a summary and analysis of the representative state-of-the-art objective metrics. Then, the performances of different objective metrics are evaluated and discussed on all available datasets. Finally, we discuss the potential challenges and suggest possible directions for future research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Providing more immersive experiences with depth perception to the observers, the 3D applications, such as the Multi-View Video (MVV) and Free-Viewpoint Video (FVV), have drawn great public attention in recent years. These 3D applications allow the users to view the same scene at various angles which may result in a huge information redundancy and cost tremendous bandwidth or storage space. To reduce these limitations, researchers attempt to transmit and store only a subset of these views and synthesize the others at receiver by using the Multiview-Video-Plus-Depth (MVD) data format and Depth-Image-Based-Rendering (DIBR) techniques [1]. Only limited viewpoints (both texture images and depth maps) are included in the MVD data format, the other view images are synthesized through DIBR.

This MVD plus DIBR scenario greatly reduces the burden on the storage and transmission of 3D video content. However, the DIBR view synthesis technique also raises new challenges in the quality assessment of virtual synthesized views. During the DIBR process, the pixels in the texture image at the original viewpoint are back-projected to the real 3D space, and then re-projected to the target virtual viewpoint using the depth map, which is named 3D image warping in the literature. As shown in Fig. 1, DIBR view synthesis can be divided into two parts: 3D image warping and hole filling. During the 3D image warping procedure, the pixels in the original view are warped to the corresponding position in the target view. Owing to the changing of viewpoint, some objects which are invisible in the original view may become visible in the target one, which is called dis-occlusion and causes black holes in the synthesized view. Then, the second step is to fill the black holes. The holes can be filled by typical image in-painting algorithms. Most of the image in-painting algorithms use the pixels around the “black holes” to search the similar regions in the same image, and then use this similar region to fill the “black holes”. Due to the imprecise depth map and imperfect image in-painting method, various distortions, which are quite different from the traditional ones in 2D images/videos, may be caused. Most of the 2D objective quality metrics [2, 3, 4, 5, 6, 7, 8, 9] which focus on the traditional distortions will fail to evaluate the quality of DIBR-synthesized views. Subjective test is the most accurate and reliable way to assess the quality of media content since the human observers are the ultimate users in most applications. The subjective tests offer the datasets along with subjective quality scores. The objective metrics are designed to mathematically model and predict the subjective quality scores. In other words, an ideal objective model is expected to be consistent with the subjective results. Since the subjective test is time consuming and practically not suitable for real-time applications, effective objective metrics are highly desired.

Fig. 1: Procedure of DIBR

Although several efforts have been made targeting at the objective quality assessment of DIBR-synthesized views in recent years, to the best of our knowledge, there is not a detailed survey on these works in current literature. In this paper, we provide a comprehensive survey on the quality assessment approaches for DIBR-synthesized views ranging from the subjective to objective methods. The main contributions can be summarized as follows: (1) the state-of-the-art metrics are introduced and classified based on their used approaches. (2) the metrics in terms of the contributions, advantages and disadvantages are analysed in depth. (3) the performances of these metrics are evaluated on different datasets, and the essential reasons of their performance on different type of distortions are analysed. (4) furthermore, the limitations of current works are discussed and the possible directions for future research are given.

The rest of this paper is organized as follows. The subjective methods are first surveyed in Section II. Section III introduces the state-of-the-art objective quality metrics in detail. The experimental results are presented and discussed in Section IV. Finally, the conclusions are given in Section V.

Ii Subjective image/video quality assessment of DIBR-synthesized views

Subjective test is the most direct method for image/video quality assessment. During the test, a group of human observers are asked to rate the quality of each tested image or video. The subjective test results obtained from the subjective ratings are recognized as the quality of the tested images/videos. In different subjective test methodologies, the acquisition of subjective scores are also different.

The Absolute Category Rating (ACR) method used in IVC image / video datasets [10, 11] randomly present the test sequences to the observers and ask them to rate on five-scales quality judgement (excellent, good, fair, poor, bad). The subjective quality scores are calculated by simply averaging the ratings. The Single Stimulus Continuous Quality Evaluation (SSCQE) in SIAT [12] dataset allows the observer to rate on a continuous scale instead of a discrete five-scales evaluation. The IVY [13] image dataset uses the Double Stimulus Continuous Quality Scale (DSCQS). The test image along with its associated reference image are presented in succession. It is usually used when the test and reference images are similar. Pairwise Comparison (PC) method directly performs a one-to-one comparison of every image pair in the dataset. It is the most accurate and reliable way to get the subjective quality scores, but it costs too much time since all the image pairs need to be tested. Compared to PC method, the Subjective Assessment Methodology for VIdeo Quality (SAMVIQ) method used in IETR dataset can achieve much higher accuracy than ACR method for the same number of observers and cost less time than PC since it allows the observer to freely view several image multiple times and a continuous rating scale. Besides, the IVY [13], IETR [14] and SIAT datasets normalize the obtained scores to z-score to make the results more intuitive. The IVC and MCL-3D [15] datasets directly use the average scores. Except for the subjective test methodology, as shown in Table I, they use different sequences, DIBR algorithms etc. In the following part, we will introduce them respectively in detail.

Name Sequence Resolution Methodology DIBR algos Other distortions No. PVS 1 Reference Display
Name Year
IVC DIBR-image BookArrival 1024 768 ACR2 Fehn’s 2004 None 84 Original 2D
Lovebird1 1024 768 Telea’s 2003
Newspaper 1024 768 VSRS 2009
Müller 2008
PC3 Ndjiki-Nya 2010
Köppel 2010
Black hole
IVC DIBR-video idem ACR2 idem H.264 93 Original 2D
(QP: 26,34,44)
IETR-image BookArrival 1024 768 SAMVIQ4 Criminisi 2004 None 140 Original 2D
Lovebird1 1024 768 VSRS 2009
Newspaper 1024 768 LDI 2011
Balloons 1024 768 HHF 2012
Kendo 1024 768 Ahn’s 2013
Dancer 1920 1088 Luo’s 2016
Shark 1920 1088 Zhu’s 2016
Poznan_Street 1920 1088
PoznanHall2 1920 1088
GT_fly 1920 1088
IVY image Aloe 1280 1100 DSCQS5 Criminisi 2004 None 84 Original Stereo.
Dolls 1300 1100 Ahn’s 2013
Reindeer 1300 1100 VSRS 2009
Laundry 1300 1100 Yoon 2014
Lovebird1 1024 768
Newspaper 1024 768
BookArrival 1024 768
MCL-3D image Kendo 1024 768 PC3 Fehn’s 2004

Additive White Noise

684 Synthesized Stereo.
Lovebird1 1024 768 Telea’s 2003 Blur
Balloons 1024 768 HHF 2012 Down sampling
Dancer 1920 1088 Black hole JPEG
Shark 1920 1088 JPEG2k
Poznan_Street 1920 1088 Translation Loss
PoznanHall2 1920 1088
GT_fly 1920 1088
Microworld 1920 1088
SIAT video BookArrival 1024 768 SSCQE6 VSRS 2009 3DV-ATM 140 Original 2D
Balloons 1024 768
Kendo 1024 768
Lovebird1 1024 768
Newspaper 1024 768
Dancer 1920 1088
PoznanHall2 1920 1088 (QP: 24-50)
Poznan_Street 1920 1088
GT_fly 1920 1088
Shark 1920 1088
  • PVS: Processed Video Sequences.

  • ACR: Absolute Categorical Rating.

  • PC: Pairwise Comparison.

  • SAMVIQ: Subjective Assessment Methodology for VIdeo Quality.

  • DSCQS: Double Stimulus Continuous Quality Scale.

  • SSCQ: Single Stimulus Continuous Quality Scale.

TABLE I: Summary of existing DIBR related datasets

Ii-a IVC DIBR datasets

The IVC DIBR-image dataset [10] was proposed by Bosc et al. in 2011. It contains 84 DIBR-synthesized view images synthesized by 7 DIBR algorithms [1, 16, 17, 18, 19, 20, 21]. 3 Multi-view plus Depth (MVD) sequences, , and , are extracted as the source contents. MVD sequences contain several view-points acquired with real cameras in rectified configuration capturing the same scene, together with a depth image for each acquired view. For each sequence, 4 virtual views are synthesized from the adjacent viewpoint by using the above algorithms. Note that in this dataset, virtual views were only generated by single-view-based synthesis, which means that the virtual view is synthesized with only one image and its associated depth map. The IVC DIBR-video dataset [11] uses almost the same contents and methodologies except that it adds the H.264 compression (with 3 quantization levels) distortion for each test sequence. In other words, there are 93 distorted videos in this dataset, among which 84 ones only contain the DIBR view synthesis distortions. As one of the first DIBR related image datasets, the IVC datasets play an important role in the first research phase of this topic. However, because of the fast development of DIBR view synthesis algorithms, some of the distortions in these datasets do not exist in the state-of-the-art view synthesis algorithms.

Ii-B IETR DIBR image dataset

Similar to the IVC datasets, the IETR dataset [14] is dedicated to investigate the DIBR view synthesis distortions as well. Compared to the IVC datasets, it uses more and newer DIBR view synthesis algorithms [22, 23, 24, 25, 26, 17, 27], including both interview synthesis and single view based synthesis, to exclude some “old fashioned” distortions, such as “black holes”. The interview DIBR algorithms use two neighbouring views to synthesize the virtual viewpoint. In addition, the IETR dataset also uses more MVD sequences, of which 7 sequences (, , , , , and ) are natural images and 3 sequences (, and ) are computer animation images. It contains 140 synthesized view images and their associated 10 reference images which are also the images captured by real cameras at the virtual viewpoints.

Ii-C IVY stereoscopic image dataset

Jung et al. proposed the IVY stereoscopic 3D image dataset for the quality assessment of DIBR-synthesized stereoscopic images [13]. Different from the above two datasets, besides the DIBR view synthesis distortion, the IVY dataset also explores binocular perception by showing the synthesized image pairs on a stereoscopic display. A total of 7 sequences and three MVD sequences are selected. 84 stereo images are synthesized by four DIBR algorithms [22], [24], [28], [29] in this dataset. All the virtual view images in the IVY dataset are generated by single-view-based synthesis.

Ii-D MCL-3D image dataset

Song et al. proposed the MCL-3D stereoscopic image dataset [15] to evaluate the quality of DIBR-synthesized stereoscopic images. Although 4 DIBR algorithms are included, the number of images synthesized by these algorithms is quite limited (36 pairs). The major part of this dataset focuses on the traditional distortions in the synthesized views. 6 types of traditional distortions are considered in this dataset: additive white noise, Gaussian blur, down sampling blur, JPEG, JPEG2000 and transmission loss. Nine MVD sequences are collected, among which , , , and are natural images; , , and are Computer Graphics images. For each sequence, these traditional distortions are first applied on the base views. Then, the left and right view images are synthesized from these distorted base view images by using the view synthesis reference software (VSRS) [17]. Different from the above IVC, IETR and IVY datasets, the reference images in the MCL-3D dataset are the images synthesized from undistorted base view images instead of the ones captured by real cameras.

Ii-E SIAT synthesized video dataset

The SIAT synthesized video dataset [12] focuses on the distortions caused by compressed texture and depth images in the synthesized views. It uses the same 10 MVD sequences as the IETR image dataset. For each sequence, 4 different texture and depth image quantization levels and their combinations are applied on the base views. Then, the videos at the virtual viewpoints are synthesized using the VSRS-1D-Fast software [30]. This dataset uses the real images (captured by real cameras at the virtual viewpoint) as references. Only interview synthesis is used in this dataset.

In the above datasets, the distortions in the DIBR-synthesized views come from not only the DIBR view synthesis algorithms, but also from the distorted texture and depth images. The IVC [31, 10, 11], IVY [13] and IETR [14] datasets focus on the distortions caused by different DIBR view synthesis algorithms; while the MCL-3D [15] and SIAT [12] datasets explore the influence of traditional 2D distortions of original texture and depth map on the DIBR-synthesized views. These datasets were usually used to evaluate and validate several quality metrics. In the next section, we will introduce the objective approaches for the quality assessment of DIBR-synthesized views.

Iii Objective image/video quality assessment of DIBR-synthesized views

Several methods have been proposed to evaluate the quality of DIBR-synthesized views in the past decade. Based on the amount of reference information, these methods can be divided into 4 categories: Full-reference (FR), Reduced-reference (RR), Side View based Full-reference (SV-FR) and No-reference (NR), as shown in Fig 2

. The FR methods use the original undistorted image/video at the virtual viewpoint as reference to assess the quality of synthesized views, while the RR methods only use some features extracted from the original reference. Especially, the SV-FR methods use the undistorted image/video at the original viewpoint, from which the virtual view is synthesized, as the reference. The NR methods need no access to the original image/video.

(a) FR metrics (b) RR metrics (c) Side view based FR metrics (d) NR metrics
Fig. 2: Categories of quality assessment metrics for DIBR-synthesized views.

Table II classify the metrics based on their used approaches. Most of them (VSQA, MP-PSNR, MW-PSNR, EM-IQA and CT-IQA) evaluate the quality of synthesized views by considering the contour or gradient degradation between the synthesized and the reference images which is one of the most annoying characteristics of geometric distortions. Meanwhile some metrics (DSQM, 3DSwIM) calculate the quality score by comparing the extracted perceptual features between the synthesized and the reference images. Especially, the APT metric uses a local image description model to reconstruct the synthesize image, and evaluates the quality of the synthesized view based on the reconstruction error. These metrics are introduced as follows.

MetricApproach HF DF C/G JND MSD LID DE DR SE SC IC ML
FR Bosc et al. 2012 [32] - - - - - - - - - - -
VSQA [33] - - - - - - - - - - -
3DSwIM [34] - - - - - - - - - -
MW-PSNR [35, 36] - - - - - - - - - -
MP-PSNR [37] - - - - - - - - - -
CT-IQA [38] - - - - - - - - - - -
ST-SIAQ [39] - - - - - - - - -
EM-IQA [40] - - - - - - - - -
PSPTNR [41] - - - - - - - - - - -
VQA-SIAT [12] - - - - - - - - - -
SR-3DVQA [42] - - - - - - - - -
3VQM [43] - - - - - - - - - - -
SDRD [44] - - - - - - - - - -
SCDM [45] - - - - - - - - - -
SC-IQA [46] - - - - - - - - - - -
CBA [13] - - - - - - - - - - -
Zhou [47] - - - - - - - -
Ling [48] - - - - - - - - -
Wang [49] - - - - - - - -
RR MP-PSNRr [50] - - - - - - - - - -
MW-PSNRr [50] - - - - - - - - - -
RRLP [51] - - - - - - - - -
SV-FR LOGS [52] - - - - - - - - -
DSQM [53] - - - - - - - - - -
SIQE [54] - - - - - - - - - -
SIQM [55] - - - - - - - - - -
NR APT [56] - - - - - - - - - - -
OUT [57] - - - - - - - - - - -
MNSS [58] - - - - - - - - -
NR_MWT [59] - - - - - - - -
NIQSV [60] - - - - - - - - - -
NIQSV+ [61] - - - - - - - - - -
HEVSQP[62] - - - - - - - - - -
CLGM [63] - - - - - - - - -
GDIC [64] - - - - - - - - -
Wang [65] - - - - - - - -
SET [66] - - - - - - - -
FDI [67] - - - - - - - - - -
CSC-NRM [68] - - - - - - - - - - -
SIQA-CFP [69] - - - - - - - - - -
GANs-NRM [70] - - - - - - - - - -
TABLE II:

Overview of the existing metrics. The features in the first column indicate hand-craft feature (HF), deep feature (DF), contour/gradient (C/G), JND, Multi-scale decomposition (MSD), local image description (LID), depth estimation (DE), dis-occlusion Region (DR), Sharpness Evaluation (SE), Shift compensation (SC), Image complexity (IC), ML (Machine Learning).

Iii-a FR and RR metrics

In this subsection, we review 19 well-known FR metrics and 3 RR metrics.

Iii-A1 Edge/Contour based FR metrics

the distortions in DIBR-synthesized views are mostly geometrical and structural distortions, which may degrade the object shape in the synthesized image. It can be measured by the change of object edges. In addition, the sharp edges in the depth map may also induce large dis-occlusions in the synthesized views which may result in dramatic distortions. Thus, a few edge-based methods have been proposed to evaluate the quality of DIBR-synthesized views.

The FR metric proposed by Bosc et al. in [32]

indicates the structural degradations by calculating the contour displacement between the synthesized and the reference images. Firstly, a Canny edge detector is used to extract the image contours; then, the contour displacements between the synthesized and reference images are estimated. Based on the contour displacement map, three parameters are computed: the mean ratio of inconsistent displacement vectors per contour pixel, the ratio of inconsistent vectors, the ratio of new contours. The final quality score is obtained as a weighted sum of these three parameters.

In [39], Ling et al. proposed a contour-based FR metric ST-SIAQ for the quality assessment of DIBR-synthesized views. Instead of directly using the contour information in [32], ST-SIAQ uses mid-level contour descriptor called “Sketch Token” [71]

. The “Sketch Token” stands as a codebook of image contour representation, of which each dimension can be recognized as the possibility which indicates how likely the current patch belongs to one certain category of contour from the codebook. To reduce the shifting effect in the feature comparison stage, the patches in the reference image are firstly matched to the synthesized image. The “Sketch Token” is clustered into 151 categories, which means the “Sketch Token” descriptor has 151 dimensions. A Random Forests decision model associated with a set of low-level features (including oriented gradient channels

[72], color channels, and self-similarity channels [73]

) are used to obtain the “Sketch Token” descriptor. The geometric distortion strength in the synthesized view is calculated as the Kullback Leibler divergence of “Sketch Token” descriptors between the synthesized and reference images. In

[74], this metric is improved to evaluate the quality of DIBR-synthesized videos by considering the temporal dissimilarity.

Ling et al. also proposed another contour-based FR metric EM-IQA in [40]. Different from ST-SIAQ metric, EM-IQA uses interest points matching and elastic metric [75], instead of block matching and “Sketch Token” descriptor, to compensate the shifting and evaluate the contour degradation respectively. After interest points matching, a Simple Linear Iterative Clustering (SLIC) is used to extract the contours in the image. SLIC is originally proposed for image segmentation, in the EM-IQA metric, the boundaries of the segmented objects are considered as contours. Then, the elastic metric proposed in [75, 76] is used to finally measure the degradation between the contours of synthesized and reference images, which provides the quality score of DIBR-synthesized view.

In [38], Ling et al. proposed a variable-length context tree based image quality assessment metric CT-IQA, dedicated to quantify the overall structure dissimilarity and dissimilarities in various contour characteristics. Firstly, the contours of the reference and synthesized images are converted to differential chain code (DCC) [77] which represents the direction of object contours. Then, an optimal context tree [78] is learned from the DCC in the reference image. The overall structural dissimilarity is calculated by subtracting the encoding cost of DCC in the synthesized image and reference images. In addition, the overall dissimilarity in contour characteristics is also obtained by measuring the difference of total contour number, total contour start information and total number of symbols between the reference and synthesized image. The final quality score is calculated by combining the overall structure dissimilarity and contour characteristics dissimilarity.

Liu et al. proposed a gradient-based FR video quality assessment metric VQA-SIAT [12] by considering the “Activity” and “Flickering” which is the most annoying temporal distortion in the DIBR-synthesized views. The main contribution of this metric is the two following proposed structures: Quality Assessment Group of Pictures (QA-GoP) and Spatio-Temporal (S-T) tube. The QA-GoP acts as a process unit on a whole video sequence, it contains a group of 2N+1 frames (N frames before and N frames after the central frame). Besides, a block matching method is used to search the corresponding blocks of the central frame blocks in the forward and backward frames. The 2N + 1 blocks along the motion trajectory construct a S-T tube. The distortion of “Activity” is calculated from the difference of the spatial gradient in the (S-T) tube and (QA-GoP) between the synthesized and reference videos. The “Flickering” distortion is measured from the difference of temporal gradient, which is defined below:

(1)

where () is the coordinate in frame corresponding to () along the motion trajectory in previous frame . The final quality score of DIBR-synthesized view video is obtained by integrating both “Activity” and “Flickering” distortions.

Furthermore, in [42], Zhang et al. proposed a FR metric SR-3DVQA combining the “Activity” measurement module in VQA-SIAT with a sparse representation-based flicker estimation method. In the SR-3DVQA metric, a DIBR-synthesized video is treated as a 3D volume data by stacking the frames sequentially. Then, the volume data is decomposed as a number of spatially neighboring temporal layers i.e. X-T or Y-T planes where X, Y are the spatial coordinate and T is the temporal coordinate. In order to effectively evaluate the flicker distortion in the synthesized video, the gradient in the temporal layers and sharp edges in the associate depth map are extracted as key features for the dictionary learning and sparse representation. The rank-based method in [52] is used to pool the flicker score from the temporal layers. The final quality score is calculated by combining the flicker score and “Activity” score in the previous VQA-SIAT [12].

Jakhetiya et al. proposed a free-energy-principle-based IQA metric RRLP for Screen Content and DIBR-synthesized view images based on prediction model and distortion categorization [51]. The image quality is measured by calculating the disorder and sharpness similarity between the distorted and reference images. The disorder is obtained from a prediction model. As shown in Eq. 2, an observation-model-based bilateral filter (OBF) [79] is firstly used to divide the predicted and disorder parts.

(2)

where represents the predicted part, and are respectively the pixels and their associated weights in the surrounding window of the th pixel, is a parameter. The disorder part is computed as the difference between the predicted part and the original image:

(3)

Then, the sharpness (edge structures) is calculated by four filters in[80]. Finally, the disorder and sharpness similarity between the distorted and reference images are estimated by using the similarity function in SSIM [2].

Iii-A2 Wavelet transform based FR metrics

in the previous part, we introduced the metrics that use the edge/contour in luminance domain to evaluate the geometric distortions in DIBR-synthesized views. According to previous research, the wavelet transform representation can not only capture the image edges, but also some other texture unnaturalness. In this part, the wavelet transform based FR metrics will be reviewed.

Battisti et al. proposed an FR metric (3DSwIM) for DIBR-synthesized views based on the comparison of statistical features of wavelet sub-bands [34, 81]. The same as EM-IQA [40] and VQA-SIAT [12], 3DSwIM uses a block matching to ensure the “shifting-resilience”. The distortions in each block of the synthesized view is measured by the Kolmogorov–Smirnov [82] distance between the histograms of the two matched blocks. In addition, since the Human Vision System (HVS) pays more attention on the human body, a skin detector is used to weight the skin regions in the matched blocks.

Sandić-Stanković et al. proposed another multi-scaled decomposition based FR metric MW-PSNR [36, 35]. The MW-PSNR uses morphological wavelet filters for decomposition. Then a multi-scale wavelet mean square error (MW-MSE) is calculated as the average MSE of all sub-bands and finally the MW-PSNR is calculated from it.

The wavelet transform based FR metrics can be recognized as a kind of edge/contour based metrics. For example, the higher sub-bands of the wavelet transformed image represent the edge information of the original image. Compared to the pixel level edge/contour used in the previous subsection, the metrics in this subsection use the features in wavelet transformed domain to represent both the image edges and other characteristics.

Iii-A3 Morphological operation based FR metrics

morphological operations are widely used in image processing, especially a couple of erosion and dilation operations can be used to detect the image edges [83]. In [37], Sandić-Stanković et al. proposed the MP-PSNR based on multi-scaled pyramid decomposition using morphological filters. The basic erosion and dilation operations used in MP-PSNR are calculated as maximum and minimum in the neighbourhood defined by the structure element, as shown in the following equation:

(4)
(5)

where is a gray-scale image and is binary structure element. Then, they use the Mean Square Error (MSE) between the reference and synthesized images in all pyramids’ sub-bands to quantify the distortion. As shown in Fig 3, during the decomposition, the dilation is used as expand operation and the erosion is used as reduce operation, the detail image of each scale is calculated as the difference between the original and processed (erosion and dilation) images. Finally, the overall quality is calculated by averaging the MSE of detail images in all the sub-bands and expressing it as PSNR.

Fig. 3: Decomposition scheme of MP-PSNR. represents the image at scale (), represent the detail image at scale [37].

In [50], Sandić-Stanković et al. also proposed the reduced version of MP-PSNR, and MW-PSNR. Only detail images from higher decomposition scales are taken into account to measure the difference between the synthesized image and the reference image. The reduced version achieved significant improvement over the original FR metrics with lower computational complexity.

Iii-A4 Depth estimation based FR metrics

Solh et al. proposed a full reference metric 3VQM [43] to evaluate synthesized view distortions by deriving an “ideal” depth map from the virtual synthesized view and the reference view at a different viewpoint. The “ideal” depth is the depth map that would generate the distortion-free image given the same reference image and DIBR parameters.

Three distortion measurements, spatial outliers, temporal outliers and temporal inconsistency are calculated from the difference between the “ideal” depth map and the distorted depth map:

(6)
(7)
(8)

where , and denote the spatial outliers, temporal outliers and temporal inconsistencies respectively,

represents the standard deviation.

is the difference between the “ideal” and the distorted depth maps and is the frame number. These three measurements are then integrated into a final quality score.

Since the calculation of the “ideal” depth map is based on the assumption that the horizontal shift of the synthesized view and the original view is small, this metric would not work well when the baseline distance increases.

Iii-A5 Dis-occlusion region based FR metrics

since the DIBR view synthesis distortions mainly occur in the dis-occlusion regions, some of the FR metrics improve the performance of 2D FR metrics by using dis-occlusion maps[44, 45] instead of using weighting maps.

The SDRD metric proposed by Zhou in [44] detects the dis-occlusion regions by simply comparing the absolute difference between the synthesized and reference images. Before that, a self-adaptive scale transform model is used to eliminate the effect of view distance, and a SIFT flow-based warping is adopted to compensate the global shift in the synthesized view image. The final quality score is obtained by weighting the dis-occlusion regions with their size since the distortions with bigger size are more annoying to human vision system.

Tian et al. proposed a full-reference quality assessment model (SCDM) for 3D synthesized views by considering global shift compensation and dis-occlusion regions [45]. This model can be used on any pixel-based FR metrics. SCDM firstly compensates the shift by using a SURF + RANSAC approach instead of the SIFT flow used in SDRD. Then, the dis-occlusion regions are directly extracted from the depth map. It is more precise and uses more resources compared to SDRD. The final quality score is obtained as a weighted PSNR or weighted SSIM. It is reported to improve the performance of PSNR and SSIM by 36.85% and 13.33% in terms of Pearson Linear Correlation Coefficients (PLCC).

Since the distortions in the DIBR-synthesized views are not restricted in the dis-occlusion regions only, they may occur around these regions as well. In [49], Wang et al. proposed a critical region based metric by dilating the dis-occlusion region with a morphological operator. Similar to SDRD, the dis-occlusion region map is extracted by a SIFT-flow based approach. Then a Discrete Cosine Transform (DCT) decomposition method is used to partition and classify the critical regions into edge blocks, texture blocks and smooth blocks. Based on the perceptual properties of these three types of blocks, their distortions are measured differently. The edge and texture blocks contain more complex edges or texture information, the blur distortions in these regions would be much more annoying than that in the smooth regions. On the other hand, the smooth regions are sensitive to color degradations. Thus, the texture similarity and color contrast similarity between the synthesized and reference images are calculated to measure the local distortions in the edge, texture and smooth blocks respectively. Finally, a global sharpness detection is combined with the local distortion measurement to obtain the overall quality score.

Iii-A6 2D related FR metrics

the main reason of the ineffectiveness of 2D quality assessment metrics on DIBR-synthesized views can be analysed as follows. Firstly, there exists large object shift in the synthesized views and this kind shift effect can be easily penalized by 2D metrics even though the HVS is not sensitive to the global shift in the image. The second reason is the distribution of distortions. The distortions in traditional 2D images often scatter over the whole image while the DIBR view synthesis distortions are mostly local, especially in the dis-occluded regions. The 2D related metrics are based on the traditional 2D FR metrics, such as PSNR, SSIM etc. They try to improve the performance of 2D metrics by considering HVS and the characteristics of DIBR view synthesis distortions.

The VSQA metric proposed by Conze et al. in [33] tries to improve the performance of SSIM [2] by taking advantage of known characteristics of the human visual system (HVS). It aims to handle areas where disparity estimation may fail, such as thin objects, object borders, transparency etc. by applying three weighting maps on the SSIM distortion map. The main purpose of these three weighting maps is to characterize the image complexity in terms of textures, diversity of gradient orientations and presence of high contrast since the HVS is more sensitive to the distortions in such areas. For example, the distortions in an untextured area are much more annoying than the ones located in a high texture complexity area. It is reported that this method approaches a gain of 17.8% over SSIM in correlation with subjective measurements.

Zhao et al. proposed the PSPTNR metric to measure the perceptual temporal noise of the synthesized sequence [41]. The temporal noise is defined as the the difference between inter-frame change in the processed sequence and that in the reference sequence:

(9)

where indicates the temporal noise, and represent the distorted and reference sequence respectively. In order to better predict the perceptual quality of synthesized videos, temporal noise is filtered by a Just Noticeable Distortion (JND) model [84] and a motion mask [85], since the human can observe noise only beyond certain level and motion may decrease the texture sharpness in the video.

The shift compensation methods included in SDRD and SCDM only consider the global shift, but according to the recent research[86], Human Visual System (HVS) is more sensitive to local artefacts compared to the global object shift. In [46], Tian et al. proposed a shift-compensation based image quality assessment metric (SC-IQA) for DIBR-synthesized views. The same as SCDM, a SURF + RANSAC approach is used to roughly compensate the global shift. In addition, a multi-resolution block matching method is proposed to precisely compensate the global shift and penalize the local shift at the same time. A saliency map [87] is also considered to weight the distortion map of the synthesized view. Furthermore, only the blocks with the worst quality are used to calculate the final quality score since HVS tends to perceive poor regions in an image with more severity than the good ones [86, 12]. SC-IQA achieves the performance of SCDM without access to the depth map.

The metrics introduced above consider only the view synthesis and compression artefacts which occur on applications that show the synthesized views on a 2D display, the binocular effect in the synthesized stereoscopic images is not taken into consideration. In [13], Jung et al. proposed a SSIM-based FR metric to measure the critical binocular asymmetry (CBA) in the synthesized stereo images. Firstly, the disparity inconsistency between the two different views is generated to detect the critical areas in terms of Left-Right image mismatches. Then, only the SSIM value on the critical areas of each view are computed to measure the asymmetry in the corresponding view image. The final binocular asymmetry score is obtained by averaging the asymmetry score in the left and right views.

Iii-B Side view based FR metrics

The major limitation of the FR metrics is that they always need the reference view which may be unavailable in some circumstances (eg. FVV). In other words, there is no ground truth for a full comparison with the distorted synthesized view. In this part, four side view based FR metrics will be reviewed. This kind of metrics use the real image/video at the original viewpoint, from which the virtual view is synthesized, as reference to evaluate the quality of DIBR-synthesized virtual views. These metrics are named as side view based FR metrics in this paper.

Li et al. proposed a side view based FR metric for DIBR-synthesized views by measuring local geometric distortions in dis-occluded regions and global sharpness (LOGS) [52]. This metric consists of three parts. Firstly, the dis-occlusion regions are detected by using SIFT-flow based warping. These dis-occluded regions are extracted from the absolute difference map between the synthesized view and the warped reference view followed by an additional threshold. Then, the distortion size and strength in the local dis-occlusion regions are combined to obtain the overall local geometric distortion. The distortion size is simply measured by the number of pixels in the dis-occluded regions and the distortion strength is defined as the mean value of the dis-occluded regions in the whole difference map

. The next part is to measure the global sharpness by using a reblurring-based method. The synthesized image is firstly blurred by a Gaussian smoothing filter. Both the synthesized image and its reblurred version are divided into blocks. The sharpness of each block is calculated by its textural complexity, which is represented by its variance

. Then, the overall sharpness score is computed by averaging the textural distance of all blocks. Finally, the local geometric distortion and the global sharpness are pooled to generate the final quality score.

Farid et al. proposed a side view based FR metric (DSQM) for the DIBR-synthesized view in [53]. A block matching is firstly used to estimate the shift between the reference and synthesized image. Then the difference of Phase congruency (PC) in these two matched blocks is used to measure the quality of the block in the synthesized image, which is defined as follows:

(10)

where and represent the amplitude and the local phase of the -th Fourier component at position respectively. The implementation of phase congruency is based on an logarithmic Gabor wavelet method proposed in [88]. The quality score of each block is calculated as the absolute difference between the mean values of the phase congruency maps of the matched blocks in the synthesized and reference image:

(11)

where represents the mean value of the corresponding phase congruency map, the and indicate the PC map of the matched blocks in the synthesized and reference image. The final image quality is obtained by averaging the quality score of all the blocks.

Farid et al. proposed a cyclopean eye theory [89] and divisive normalization (DN) transform [90] based Synthesized Image Quality Evaluator (SIQE) in [54]. The DIBR-synthesized view image associated with the left and right side views are firstly transformed by DN. Then, the statistical characteristics of the cyclopean image are estimated from the DN representations of the left and right side views while the statistical characteristics of the synthesized image are obtained directly from itself. The similarity (Bhattacharyya coefficient [91]) between the distribution of the cyclopean and the synthesized image’s DN representations is computed to measure the quality score of the synthesized image.

The SIQE metric only considers the texture information, in [55], Farid et al. proposed an extended version of SIQM by considering both the texture and depth information. The depth distortion estimation is based on the fact that the edge regions in a depth image are more sensitive to noise than the flat homogeneous regions since the distorted edge in the depth map may cause very annoying structural distortions in the synthesized image. Firstly, the pixels in the depth map with a high gradient value are extracted as noise sensitive pixels (NSP). Then, for each NSP, a local histogram from the distorted depth map is constructed and analysed to estimate the distortion in the depth image. The overall depth distortions are calculated by averaging the distortions in the left and right depth image. The final quality of the synthesized view is pooled from the texture and depth distortions.

Iii-C NR metrics

In this part, we will review the NR metrics which do not need ground truth images/videos to evaluate the quality of DIBR-synthesized views.

Iii-C1 Local image description based NR metrics

due to the distorted depth map and imperfect rendering method, there exists a large number of structural and geometric distortions in the DIBR-synthesized views. As introduced in the RRLP metric [51], the structural distortions may result in local disorder in the image. Similarly, several local image description based NR metrics have been proposed to evaluate the structural distortions by measuring the local inconsistency via different models.

Gu et al. proposed an auto-regression (AR) based model (APT) to capture the geometric distortions in the DIBR-synthesized views. For each pixel, a local AR model (33) is first used to construct a relationship between this pixel and its neighbouring pixels.

(12)

where denotes a vector which is composed of the neighbouring pixels of in the (33) patch, is a vector of AR parameters and represents the error difference between the current pixel value and its corresponding AR prediction. The AR parameters are solved on the assumption that the 77 local patch, which consists of the current pixel and its 48 adjacent pixels, shares the same AR model. The error difference map between the synthesized and the reconstructed images is obtained as the distortion map. Then, a Gaussian filter and a saliency map [92] associated with a maximum pooling are used to obtain the final image quality score. Due to its computational complexity, this method owns a high computing cost.

Different from the APT metric, the OUT (outliers) metric [57] proposed by Jakhetiya et al. uses a median filter to calculate the difference map. Then, two thresholds are used to extract the structural and geometric distortion regions. The quality score is finally obtained from the standard deviation of the structural and geometric distortion regions.

These local image description based metrics can only detect thin distortions or local noise, they do not work well on the large size distortions.

Iii-C2 Morphological operation based NR metrics

the morphological operations show their effectiveness in the FR metric MP-PSNR [37]. In [60, tian2018niqsv], Tian et al. proposed two metrics NIQSV and NIQSV+ to detect the local thin structural distortions through morphological operations. These two metrics assume that the “perfect” image consists of flat areas and sharp edges, so such images are insensitive to the morphological operations while the local thin structural distortions can be easily detected by these morphological operations. The NIQSV metric firstly uses an opening operation to detect the thin distortions and followed by a closing operation with larger Structural Element (SE) to file the black holes. The NIQSV+ extend the NIQSV by proposing two additional measurements: black hole detection and stretching detection. The black hole distortion is estimated by counting the black hole pixels proportion in the image while the stretching distortion is evaluated by calculating the gradient decrease of the stretching region and its adjacent non-stretching region.

Due to the limitation of the assumption and the SE size, these two metrics do not work well on the distortions in complex texture and the distortions with large size.

Iii-C3 Sharpness detection based NR metric

sharpness detection has been widely used in 2D image quality assessment [93, 94, 95] and also in the side view based FR metric LOGS [52]. In this part, we will introduce its usage in NR metrics. Sharpness is one of the most important measurements in NR image quality assessment [96, 97, 98]. The DIBR view synthesis may introduce multiple distortions such as blur, geometric distortions around the object edges, which may significantly result in the degradation of sharpness.

Nonlinear morphological wavelet decomposition can extract high-pass image content while preserving the unblurred geometric structures [37, 36]. In the transform domain, geometry distorted areas introduced by DIBR-synthesis are characterized by coefficients of higher value compared to the coefficients of smooth, edge and textural areas. In [59], Sandić-Stanković et al. proposed a wavelet-based NR metric (NR_MWT) for the DIBR-synthesized view videos. The sharpness is measured by quantifying the high frequency components in the image, which are represented by the high-high wavelet sub-band. The final quality is obtained from the sub-band coefficients whose value are higher than the threshold. Similar to MW-PSNR and MP-PSNR [37, 36], the NR_MWT also achieved a very low computational complexity.

Differently, in CLGM [63]

, the sharpness is measured as the distance of standard deviations between the synthesized image and its down-sampled version. Besides, two additional distortions, dis-occluded regions and stretching, are also taken into consideration in CLGM. The dis-occluded regions are detected through an analysis of local image similarity. Similar to NIQSV+

[61], the stretching distortion is estimated by computing the similarity between the stretching region and its adjacent non-stretching region.

In [64], Wang et al. also proposed a NR metric (GDIC) to measure the geometric distortions and image complexity. Firstly, different from the wavelet transform based metrics introduced above, this GDIC metric uses the edge map of wavelet sub-bands to obtain the shape of geometric distortions. Then, the geometric distortion is measured by edge similarity between the wavelet low-level and high-level sub-bands [99]. Besides, the image complexity is also an important factor in human visual perception. In order to evaluate the image complexity of the DIBR-synthesized images, hybrid filter [100, 101, 102], which combines the Autoregressive (AR) and bilateral (BL), is used. The final image quality score is computed by normalizing the geometric distortion with image complexity. Furthermore, in [65], this metric is extended to achieve higher performance by adding a log-energy based sharpness detection module.

Iii-C4 Flicker region based video NR metrics

in DIBR-synthesized videos, temporal flicker is one of the most annoying distortions. Extracting the flicker regions may help to evaluate the quality of DIBR-synthesized videos.

In [103], Kim et al. also proposed a NR metric (CTI) to measure the temporal inconsistency and flicker regions in the DIBR-synthesized video. First, the flicker regions are detected from the difference between motion-compensated consecutive frames. Then, the structural similarity between consecutive frames are calculated on the flicker regions to measure the structural distortions in each frame. At the same time, the number of pixels in the flicker regions is used to weight the distortion of each frame. The final quality score is obtained as the weighted sum of the quality scores of all the frames in the DIBR-synthesized video.

In [67], Zhou et al. proposed a NR metric FDI to measure the temporal flickering distortion in the DIBR-synthesized videos. Firstly, the gradient variations between each frame are used to extract the potential flickering regions. Followed by a refinement to precisely obtain the flickering regions through calculating the correlation between the candidate flickering regions and their neighbours. Then, the flickering distortion is estimated in SVD domain from the difference between the singular vectors of the flickering block and their associated block in the previous frame. The final video quality is computed as the average quality of all the frames.

Iii-C5 Natural Scene Statistics based NR metrics

Natural Scene Statistics (NSS) based approaches, which assume that the natural images contain certain statistics and these statistics may be changed by different distortions, have achieved great success in the quality assessment of traditional 2D images [104, 105, 106, 107]. Due to the big difference between the DIBR view synthesis distortions and the traditional 2D ones, these NSS based metrics do not work well on the quality assessment of DIBR-synthesized views. Recently, several efforts have been made to fix this gap.

As introduced in the previous Edge/Contour based FR metrics part, the edge image is significantly degraded by structural and geometric distortions in DIBR-synthesized images, and the edge based FR metrics have shown their superiority. With this view, Zhou et al. proposed a NR metric (SET) for DIBR-synthesized images via edge statistics and texture naturalness based on Difference-of-Gaussian (DoG) in [66]. The orientation selective statistics (similar to the metric in [106]) are extracted from different scale DoG images while the texture naturalness features are obtained based on the Gray level Gradient Co-occurrence Matrix (GGCM) [108]

which represents the joint distribution relation of pixel gray level and edge gradient. A Random Forest (RF) regression model is finally trained based on these two groups of features to predict the quality of DIBR-synthesized images.

Gu et al. proposed a self-similarity and main structure consistency based Multiscale Natural Scene Statistics (MNSS) in [58]. The multiscale analysis on the DIBR-synthesized image and its associated reference image indicates that the distance (SSIM value [2]) between the synthesized and the reference image decreases significantly when the scale reduces. It is assumed that the synthesized image at a higher scale holds a better quality, which means the higher scale images can be approximately used as reference. Thus, the similarity between the lower scale image (first scale is used in this metric) and the higher scale images (self similarity) are used to measure the quality of DIBR-synthesized image. Besides, in the main structure NSS model, the authors use 300 natural images from the Berkeley segmentation dataset [109] to obtain the general statistical regularity of main structure in natural images. The similarity between the main structure map of the synthesized image and the obtained prior NSS vector is calculated to evaluate the structure degradation of the DIBR-synthesized image. Finally, the statistical regularity of main structure and the structure degradation are combined to get the overall quality score.

Shao et al. propose a NR metric (HEVSQP) for DIBR-synthesized videos based on color-depth interactions in [62]

. Firstly, the video sequence is divided into Group of Frames (GoF). Through an analysis of color-depth interactions, more than 90 features from both texture and depth videos, including gradient magnitude, asymmetric generalized Gaussian distribution (AGGD)

[105]

, local binary pattern (LBP), are extracted. Then, a principal component analysis (PCA) is applied to reduce the feature dimension. Then, two dictionaries, color dictionary and depth dictionary, are learned to establish the relationship between the features and video quality. The final quality score is pooled from the color and depth quality.

In [68], Ling et al. proposed a NR learning based metric for DIBR-synthesized views, which focuses on the non-uniform distortions. Firstly, a set of convolutional kernels are learned by using the improved fast convolutional sparse coding (CSC) algorithms. Then, the convolutional sparse coding (CSC) based features of the DIBR-synthesized images are extracted, from which the final quality score is obtained via support vector regression (SVR).

Although the NSS models have made great progress for the NR IQA, the hand-craft features may not be sufficient to represent complex image textures and artefacts, there still exists a large gap between objective quality measurement and human perception [110].

Iii-C6 Deep feature based NR metrics

the deep learning techniques, especially the Convolutional Neural Networks (CNN), have shown their great advantages in various computer vision tasks. They make it possible to directly learn the representative features from image

[111]

. Unfortunately, owing to the limitation of size of DIBR-synthesized view datasets, there is not enough data to train the deep model straightforwardly. However, it is shown in the recent published literature that the deep neural network models trained on large-scale datasets, eg. ImageNet

[112], can be used to extract effective representative features of human perception.

In [69], Wang et al. proposed a NR metric SIQA-CFP which uses the ResNet-50 [113] model pre-trained on ImageNet to extract multi-level features of DIBR-synthesized images. Then, a contextual multi-level feature pooling strategy is designed to encode the high-level and low-level features, and finally to get the quality scores.

As introduced in Section I, various distortions may be introduced during the dis-occlusion region filling stage. Meanwhile, in current literature, several Generative Adversarial Networks (GAN) [114] based models have been proposed for image in-painting. As the generator is trained to in-paint the missing part, the discriminator is supposed to have the capability to capture the perceptual information which reflects the in-painted image quality. Based on this assumption, Ling et al. proposed a GAN based NR metric (GANs-NRM) for DIBR-synthesized images. In GANs-NRM, a generative adversial network for image in-painting is firstly trained on two large-scale datasets (PASCAL [115] and Places [116]). Then, the features extracted from the pre-trained discriminator are used to learn a Bag-of-Distortion-Word (BDW) codebook. A Support Vector Regression (SVR) is trained on the encoded information of each image to predict the final quality of DIBR-synthesized images. Instead of simply using the general models trained for other tasks, eg. object detection, this metric is more targeted, and it also proposes a new way to obtain the semantic features for image quality assessment.

Iii-D Summary

In this section, 19 FR, 3 RR, 4 SV-FR and 15 NR DIBR quality metrics have been reviewed and categorized based on their used approaches and on the amount of reference information used. As shown in Table II, most of the metrics consist of multiple parts, it is thus difficult to classify them into a single specific category thoroughly, we just classify them into the most related one instead. Besides, there are also some other ways to do the classification. For example, if we focus on the image structural representation used in these metrics, they can be classified into low-level [12]), mid-level [39, 40] and high-level [68, 69, 70] metrics. As introduced in [117], the low-level representations indicate the pixel level edges or contours; the mid-level representations mean the shapes and texture information; the high-level representations refer to the complex features eg. objects, unnatural structures. Besides, there are also some hierarchical metrics which combine the above features, such as the LMS metric proposed in [47] which uses both low-level and mid-level features [39] and the metric in [48] which integrates the features on each level.

Metric IVC image dataset IETR image dataset MCL 3D image dataset IVY dataset
PLCC RMSE SROCC PLCC RMSE SROCC PLCC RMSE SROCC PLCC RMSE SROCC
FR 2D PSNR 0.4557 0.5927 0.4417 0.6012 0.1985 0.5356 0.7852 1.6112 0.7915 0.6311 19.1227 0.6668
SSIM 0.4348 0.5996 0.4004 0.4016 0.2275 0.2395 0.7331 1.7693 0.7470 0.3786 22.8172 0.3742
NR 2D BIQI 0.5150 0.5708 0.3248 0.4427 0.2223 0.4321 0.3347 2.4516 0.3696 0.5686 20.2791 5754
BLIINDS2 0.5709 0.5467 0.4702 0.2020 0.2428 0.1458 0.6338 2.0124 0.5893 0.3508 23.0855 0.2569
FR DIBR Bosc 0.5841 0.5408 0.4903 0.4536 2.2980 0.4330
3DSwIM 0.6864 0.4842 0.6125 0.6519 1.9729 0.5683
VSQA 0.6122 0.5265 0.6032 0.5576 0.2062 0.4719 0.5078 2.9175 0.5120
ST-SIAQ 0.6914 0.4812 0.6746 0.3345 0.2336 0.4232 0.7133 1.8233 0.7034
EM-IQA 0.7430 0.4455 0.6282 0.5627 0.2020 0.5670
MP-PSNR 0.6729 0.4925 0.6272 0.5753 0.2032 0.5507 0.7831 1.6179 0.7899 0.5947 19.8182 0.5707
MW-PSNR 0.6200 0.5224 0.5739 0.5301 0.2106 0.4845 0.7654 1.6743 0.7721 0.5373 20.7910 0.5051
SCDM 0.8242 0.3771 0.7889 0.6685 0.1844 0.5903 0.7166 1.8141 0.7197
SC-IQA 0.8496 0.3511 0.7640 0.6856 0.1805 0.6423 0.8194 1.4913 0.8247 0.4326 22.2256 0.3135
Wang [49] 0.8512 0.3146 0.8346 0.6118 0.1961 0.6136 0.7910 1.5917 0.7929
CBA 0.826 8.181 0.829
RR DIBR MP-PSNRr 0.6954 0.4784 0.6606 0.6061 0.1976 0.5873 0.7740 1.6474 0.7802 0.5384 20.7733 0.5454
MW-PSNRr 0.6625 0.4987 0.6232 0.5403 0.2090 0.4946 0.7579 1.7012 0.7665 0.5304 20.8993 0.5138
SV-FR DIBR SIQE 0.7650 0.5382 0.4492 0.3144 0.2353 0.3418 0.6734 1.9233 0.6976
LOGS 0.8256 0.3601 0.7812 0.6687 0.1845 0.6683 0.7614 1.6873 0.7579 0.6442 18.8553 0.6385
DSQM 0.7430 0.4455 0.7067 0.2977 0.2367 0.2369 0.6995 1.8593 0.6980
NR DIBR APT 0.7307 0.4546 0.7157 0.4225 0.2252 0.4187 0.6433 1.9870 0.6200 0.5156 21.1239 0.4754
OUT 0.7243 0.4591 0.7010 0.2007 0.2429 0.1924 0.4208 2.3601 0.3171 0.2525 23.8530 0.2409
MNSS 0.7700 0.4120 0.7850 0.3387 0.2333 0.2281 0.3766 2.4101 0.3531 0.3834 22.7681 0.2282
NR_MWT 0.7343 0.4520 0.5169 0.4769 0.2179 0.4567 0.1373 2.5771 0.0110 0.4848 21.5614 0.4558
NIQSV 0.6346 0.5146 0.6167 0.1759 0.2446 0.1473 0.6460 1.9820 0.5792 0.4113 22.4706 0.2717
NIQSV+ 0.7114 0.4679 0.6668 0.2095 0.2429 0.2190 0.6138 2.0375 0.6213 0.2823 23.6491 0.3823
SET 0.8586 0.3015 0.8109 0.9117 1.0631 0.9108
GANs-NRM 0.826 0.386 0.807 0.646 0.198 0.571
  • : Due to the unavailability of source code or reference resources eg. depth map and side view reference image, we just use the reported results in their corresponding publications instead, their associated results on other datasets are marked by the symbol “—” in the table.

TABLE III: Performance of the DIBR dedicated metrics on DIBR-synthesized image dataset.

Iv Experimental results and discussions

In this section, the performance of different objective quality assessment metrics are presented and analysed. Besides, some potential challenges and possible directions for future work will be discussed.

Iv-a Performance evaluation methodologies

The subjective test results can be recognized as the ground truth visual quality since the human observer is the ultimate receiver of image/video content. The accuracy of an objective quality metric can be evaluated based on its consistencies with the subjective quality scores. In this part, we will introduce the Video Quality Expert Group (VQEG) [118] recommended correlation based methods and the recently proposed Krasula’ model [119] in detail.

Iv-A1 Correlation coefficients based methods

the reliability of objective metrics can be evaluated through their correlation with subjective test scores. Three widely used criteria, Pearson Linear Correlation Coefficients (PLCC) and Root-Mean-Square-Error (RMSE) and Spearman Rank-Order Correlation Coefficients (SROCC), are recommended by VQEG to evaluate the prediction accuracy, prediction monotonicity and prediction consistency of the objective metrics respectively, which are defined as follows:

(13)
(14)
(15)

where indicates the difference of ranking of and . Higher PLCC and SROCC value indicate higher accuracy and better monotonicity respectively. On the contrary, a higher RMSE value refers to a lower prediction accuracy.

(a) Before regression (b) After regression
Fig. 4: Example relationship between DMOS and objective quality scores. This figure is from [120]

Before computing these three criteria, the objective scores are recommended by VQEG to be mapped to the predicted subjective score to remove the nonlinearties due to the subjective rating processing and to facilitate comparison of the metrics in a common analysis space [118]. The nonlinear function for regression mapping is shown as follows:

(16)

where is the score obtained by the objective metric and are the parameters of these regression functions. They are obtained through regression to minimize the difference between and . As shown in Fig. 4, the nonlinearity has been removed after the regression.

Fig. 5: Krasula’s model for performance evaluation of objective quality metrics [119].

Iv-A2 Analysis of Krasula’s model

the above methods compare the performance of each metric by calculating their correlations with the subjective results. However they only consider the mean value of subjective scores, the uncertainty of the subjective scores are ignored. In addition, the quality scores need to be regressed by a regression function cf. Eq. 16, that is not the way they are exactly used in real scenarios. Thus, we further conduct a statistical test proposed by Krasula et al. in [119] which does not suffer from the drawbacks of the above methods. The performances of objective metrics are evaluated by their classification abilities.

As shown in Fig. 5

, firstly, the tested image pairs in the dataset are divided into two groups: different and similar according to their subjective scores. The cumulative distribution function (cdf) of the normal distribution is used to calculate the probability of image pairs. Then, we consider the pairs with higher than the selected significance level 0.95 to be significantly different. The others will be recognized as similar.

There are two performance analysis. The first performance analysis is conducted by by evaluating how well the objective metric succeeds to distinguish significantly different image pairs from unsignificantly different video pairs, in a consistent way with subjective evaluation of significant difference. In the case of the two videos in the pair are significantly different according to the subjective results. The second analysis determines whether the objective metric can correctly identify the image of higher quality in the pair.

Compared to simply calculating the correlation coefficients, this model considers not only the mean value of subjective scores, but also their uncertainties. Besides, since no regression is used, this model less depends on the quality ranges of different datasets. Another advantage of Krasula’s model is that it can easily combine the data from multiple datasets and evaluate a comprehensive performance on multiple datasets instead of simply averaging the results on different datasets.

Iv-B Performance on DIBR image datasets

Iv-B1 Results of PLCC, RMSE and SROCC

the obtained PLCC, RMSE and SROCC values of the objective image quality assessment metrics on the DIBR-synthesized image datasets are given in Table III, in which four 2D metrics [121, 104, 2] and 24 DIBR metrics are tested. The best three performances among the blind IQA methods are shown in bold. We can easily observe that the DIBR-synthesized view dedicated metrics significantly outperform the traditional 2D metrics on the IVC and IETR image datasets which focus on the DIBR view synthesis distortions. In other words, the metrics initially designed for traditional 2D image distortions can not well evaluate the DIBR view synthesis distortions.

The shift compensation based FR and SV-FR metrics obtain great improvement compared to the original 2D FR metrics, eg. the SC-IQA compared to PSNR. One main reason is that the global object shift existing in the DIBR-synthesized images may not be perceived by human observers but can be easily detected by the original 2D pixel-based FR metrics. So, this shift distortions are often overestimated by the 2D pixel-based FR metrics.

If we focus on the wavelet transform-based metrics (NR_MWT and MW-PSNR), the NR metric (NR_MWT etc.) perform better than the FR metric (MW-PSNR) on the IVC dataset. It is surprising that the FR metric performs even worse than the NR metric since these metrics use similar features and FR metric has access to the ground truth. While on the IETR dataset, the NR metric perform worse than the FR metrics. The main reason is probably also be the global shift distortion in the IVC image dataset.

To further explore the object shift effect, we have made an additional experiment on the IVC dataset while excluding the A1 view synthesis algorithm [16] which causes great object shift in the synthesized views. The A1 algorithm fills the black holes in the dis-occlusion regions by simply stretching the adjacent texture which may cause great global object shift in the synthesized views. The results are shown in Table IV. We can observe that the performance of FR and RR metrics increase significantly when large global shift artefacts are excluded.

Metric PLCC RMSE SROCC
FR 2D image metrics PSNR 0.7519 0.4525 0.6766
SSIM 0.5956 0.5513 0.4424
FR DIBR image metrics MW-PSNR 0.8545 0.3565 0.7750
RR DIBR image metrics MW-PSNRr 0.8855 0.3188 0.8298

TABLE IV: Performance on the IVC DIBR image dataset excluding A1 algorithm

The edge/contour based metrics also perform much better than the 2D pixel-based FR metrics since the edge/contour features can better represent the geometric degradations in the DIBR-synthesized images compared to simple pixel information.

The NR metrics do not need any reference information to evaluate the image quality, so the global shift does not have effect on the NR metrics. Besides, since the real reference images at virtual viewpoints are not always available in real applications, the NR metrics are more practical and useful. From table III, we can easily find that the performance of the DIBR-synthesized view dedicated metrics decrease greatly in IETR dataset compared to their performance in IVC dataset. Among these metrics, the NR ones decrease the most, especially the learning based NR metrics. This is because of the fact that these NR metrics focus on the distortions in the IVC dataset, but in the IETR dataset, many “old fashioned” distortions are excluded.

(a) Different / Similar analysis on IVC image dataset (b) Better / Worse analysis on IVC image dataset (c) Different / Similar analysis on IETR image dataset (d) Better / Worse analysis on IETR image datasets (e) Different / Similar analysis combining two datasets (f) Better / Worse analysis combining two datasets
Fig. 6: Performance on IVC and IETR image datasets using Krasula’s model. The metrics 1-15 indicate PSNR, SSIM, SCDM. MP-PSNRr, MW-PSNRr, EM-IQA, SC-IQA, LOGS, NIQSV+, APT, MNSS, NR_MWT, OUT, BIQI, BLiindS2 respectively. In the significant test results, the white block indicates that the metric in the row performs significantly better that the metric in the column and vice versa for the black block. The gray block means these two metrics are statistically equivalent.

As introduced in Section II, the MCL-3D dataset does not focus on the DIBR view synthesis distortions, but on the traditional distortion effects on the synthesized views. Thus, the performance of the tested objective metrics are quite different. Some of the metrics (Bosc, VSQA and NR_MWT) that only consider the DIBR view synthesis distortions perform not as good as the traditional 2D metrics. Some 2D related FR metrics perform even worse than their original version. For instance, VSQA and 3DSwIM metrics can not achieve the performance of SSIM; the SCDM, MP-PSNR and MW-PSNR metrics perform worse than PSNR. Among these metrics, the feature-based FR metrics perform better than the simple edge/contour based metrics. It can be inferred that the frequency domain features can represent not only the edge/contour information, but also some other texture characteristics. The SET metric contains not only the DoG features for the DIBR view synthesis distortions, but also the GGCM based features for the texture naturalness. That may explain its good performance on both IVC and MCL-3D datasets.

The IVY dataset considers not only the view synthesis distortion, but also de binocular asymmetry in synthesized stereoscopic images. The baseline distance between the virtual viewpoint and the original viewpoint is much bigger than that in the other datasets. Thus, the metrics which do not consider the binocular asymmetry perform not well on this dataset.

Iv-B2 Results of Krasula’s model

only the IVC and IETR datasets are tested in this part since the MCL-3D and IVY datasets do not provide the standard deviation which represents the subject uncertainty. The obtained Area Under the Curves (AUC) and significant test results on IVC and IETR are shown in Table 6 (a) (b) (c) (d). The Fig. 6 (e) and (f) demonstrate the results on the combination of IVC and IETR datasets. A higher AUC value indicates a higher performance. In the significant test results, the white block indicates that the metric in the row performs significantly better that the metric in the column and vice versa for the black block. The gray block means these two metrics are statistically equivalent.

In the first different / similar analysis on the IVC dataset cf. 6 (a), none of these metrics perform well since most AUC values are below 0.7 and there even exist some metrics whose AUC values are under 0.5. Generally, the DIBR FR metrics perform better than the other metrics.

In the second different / similar analysis on the IVC dataset cf. 6 (b), the DIBR-synthesized view dedicated metrics perform significantly better than the 2D metrics (first and last 2 metrics) since the DIBR metrics can achieve higher AUC values. Among these metrics, the SCDM and SC-IQA metrics perform the best, they achieve AUC values higher than 0.9.

The results on the IETR dataset cf. 6 (c) (d) and the combination of the two datasets cf. 6 (e) (f) show that most of the FR metrics outperform the NR metrics except the SSIM metric. The 2D NR metrics achieve similar results compared to their performance on IVC dataset, while the performance of the DIBR NR metrics decrease greatly compared to their performance on IVC dataset. The results of Krasula’s model are consistent with the correlation coefficients results in the previous part.

Metric IVC video dataset SIAT video dataset
PLCC RMSE SROCC PLCC RMSE SROCC
FR 2D image metrics PSNR 0.5104 0.5690 0.4647 0.6525 0.0972 0.6366
SSIM 0.4081 0.6041 0.3751 0.4528 0.1144 0.4550
FR 2D video metrics MOVIE 0.4971 0.4903 0.3877 0.646 0.097 0.693
ST-RRED 0.2025 0.6480 0.5777 0.7164 0.0895 0.6971
NR 2D video metrics SpEED 0.3771 0.6128 0.5952 0.7236 0.0885 0.6987
VIIDEO 0.5971 0.5308 0.5877 0.2586 0.1239 0.2535
FR DIBR image metrics Bosc 0.5856 0.4602 0.2654 0.453 0.114 0.431
MP-PSNR 0.5026 0.5720 0.5478 0.5681 0.1056 0.5044
MW-PSNR 0.4911 0.4638 0.4558 0.5745 0.1050 0.5024
3DSwIM 0.4822 0.4974 0.3320 0.5677 0.1057 0.2762
RR DIBR image metrics MP-PSNRr 0.4617 0.5869 0.5307 0.5640 0.1059 0.5040
MW-PSNRr 0.4802 0.5804 0.5038 0.5757 0.1049 0.5853
SV-FR DIBR image metrics SIQE 0.4084 0.5138 0.0991 0.3627 0.1195 0.2586
DSQM 0.5241 0.4857 0.3157 0.4001 0.1071 0.3994
NR DIBR image metrics OUT 0.6762 0.4874 0.6151 0.0945 0.1277 0.0926
NR_MWT 0.7530 0.4354 0.7145 0.5051 0.1107 0.3092
NIQSV 0.6505 0.5025 0.5963 0.5144 0.1100 0.4562
MNSS 0.5180 0.5660 0.5371 0.1591 0.1266 0.2463
FR DIBR video metrics CQM 0.4102 0.5101 0.3265 0.4021 0.1070 0.4064
PSPTNR 0.4321 0.5002 0.4152 0.4461 0.1069 0.4305
VQA-SIAT 0.5943 0.5321 0.5879 0.8527 0.0670 0.8583
NR DIBR video metrics CTI 0.6821 0.4372 0.6896 0.5736 0.1053 0.5425
FDI 0.7576 0.4319 0.7162 0.5952 0.1033 0.5425
TABLE V: Performance on the IVC and SIAT DIBR video dataset
(a) Different / Similar analysis on IVC Video dataset (b) Better / Worse analysis on IVC Video dataset
Fig. 7: Performance on IVC video dataset using Krasula’s model. The metrics 1-13 represent: PSNR, SSIM, SpEED, ST-RRED, VIIDEO, MP-PSNRr, MW-PSNRr, NIQSV, OUT, MNSS, NR_MWT, FDI, SIAT-VQA respectively. In the significant test results, the white block indicates that the metric in the row performs significantly better that the metric in the column and vice versa for the black block. The gray block means these two metrics are statistically equivalent.

Iv-C Performance on DIBR video datasets

The DIBR-synthesized videos contain some temporal distortions, such as flickering, in addition to the spatial distortions in images. In this experiment, 12 state-of-the-art DIBR image metrics in addition to 5 DIBR video metrics are tested. To compare the performance of DIBR metrics and traditional 2D metrics, 5 widely used 2D video metrics and 2 2D image metrics are tested. The quality scores of image metrics are obtained by averaging the quality of all the frames. The three metrics which performance the best among the BIQA methods are marked in bold.

The obtained PLCC, RMSE and SROCC values on IVC video and SIAT video datasets are given in table V. Only the results of Krasula’s model on IVC video dataset are shown in Fig. 7 since the SIAT video dataset does not provide the uncertainty of subject ratings.

The IVC video dataset focuses on the DIBR view synthesis distortions while the SIAT dataset focuses on the compression effect on the synthesized views. We can easily observe that the best three metrics on IVC and SIAT datasets are the DIBR metrics and 2D metrics respectively except VQA-SIAT metric. The VQA-SIAT metric mainly focuses on the compression effect which may lead obvious flicker in the DIBR-synthesized views. The spatial view synthesis distortions considered in this metric are very limited. That may explain why it significantly outperforms the other metrics on SIAT dataset while it can not obtain a very good performance on the IVC dataset. When we focus on the IVC video dataset, none of FR metrics achieve high correlation with the subjective results. Moreover, there is no significant difference between the performance of DIBR FR and 2D FR metrics. However, the DIBR NR metrics perform the best compared to other metrics. The main reason is the same as that on IVC image dataset: the global shift effect.

Iv-D Discussions

The experimental results show that although great progress has been made towards the quality assessment of synthesized views, there is still significant room for improvement.

Iv-D1 Synthesized video quality assessment

the DIBR-synthesized videos contain not only the compression distortions but also the distortions induced by DIBR. The VQA-SIAT metric works well on capturing the temporal flicker caused by video compression, but it fails to assess the DIBR view synthesis distortions in the synthesized video frames. In addition, the imperfect view synthesis algorithms may also result in great miss-match between the adjacent frames in the synthesized video, which causes very annoying temporal distortions that the 8 by 8 block matching (in VQA-SIAT) may fail to detect. Therefore, we could try to further analyse the specific spatial-temporal distortions in the synthesized videos and design a complete metric for the DIBR-synthesized videos.

Iv-D2 Quality assessment of synthesized views in real applications

as introduced previously, DIBR can be used in various applications, but the quality assessment for these applications are rarely researched. For example, the free viewpoint videos (FVV) and multi-view videos (MVV) provide the images from multiple viewpoints at the same time instant. The temporal distortions in FVV or MVV are mainly introduced by the changing of viewpoints instead of timeline [74, 48]. This type of distortions are different from that in normal DIBR-synthesized views videos. Besides, in order to provide immersive perception for the observer, the AR or VR applications need to generate multiple synthesized images and change the viewpoint with the motion of the observer. The synthesized video contains both the inter-frame and inter-viewpoint temporal distortions, as well as the binocular asymmetric distortions which may happen in stereoscopic applications [13]. It could be interesting to try to design the metrics for these applications since they are currently rarely explored.

Iv-D3 Deep learning approaches

the main limitation of the usage of deep learning on the quality assessment of DIBR-synthesized views is the limited size of available dataset. Unlike the homogeneous distortions in the traditional 2D images, the distortions in the DIBR-synthesized views mostly occur in the dis-occlusion regions. In other words, the major part of the DIBR-synthesized view holds a perfect quality. The synthesized image can not be split into several patches and directly use the quality of the whole image as the quality of all the patches. Creating a very large-scale dataset may significantly help train a good deep model, but unlike the datasets for other tasks eg. object recognition, creating an image quality dataset necessarily requires subjective tests which are quite expensive and time-consuming. Thus, exploring how to train a comprehensive model on limited data could be more practical, eg. one-shot learning and few-shot learning [122, 123]. The fact that quality score of the whole synthesized image can not directly be distributed to all the image patches does not mean that the image can not be processed patch by patch. The main challenge is to find a proper pooling method to get the overall quality score. Although the pre-trained deep features have been successfully used in metrics [69, 70], more effort could be made to create a more general and effective end-to-end deep model.

V Conclusion

In this paper, we present an up-to-date overview for the quality assessment methods of DIBR-synthesized views. We firstly described the existing DIBR-synthesized view datasets. Secondly, we analysed and discussed the recently proposed state-of-the-art objective quality metrics for DIBR-synthesized views, and classified them into different categories based on their used approaches. Then, we conducted a reliable experiment to compare the performance of each metric, and analysed their advantages and disadvantages at the same time. Furthermore, we discussed the potential challenges and directions for future research. We hope this overview can help to better understand the state-of-the-art of this research topic and provide insights to design better metrics and experiments for effective DIBR-synthesized images/videos quality evaluation.

Acknowledgment

The authors would like to thank Dr. Suiyi Ling and Dr. Yu Zhou for sharing their code. We would also like to thank Prof. Patrick Le Callet and Dr. Lucas Krasula for their kind advices on the experiment.

References

  • [1] C. Fehn, “Depth-Image-Based Rendering (DIBR), compression, and transmission for a new approach on 3D-TV,” in Electronic Imaging 2004.   International Society for Optics and Photonics, 2004, pp. 93–104.
  • [2] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, April 2004.
  • [3] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 2004., vol. 2.   IEEE, 2003, pp. 1398–1402.
  • [4] Z. Wang and Q. Li, “Information content weighting for perceptual image quality assessment,” IEEE Transactions on Image Processing, vol. 20, no. 5, pp. 1185–1198, 2011.
  • [5] L. Ma, S. Li, F. Zhang, and K. N. Ngan, “Reduced-reference image quality assessment using reorganized dct-based image representation,” IEEE Transactions on Multimedia, vol. 13, no. 4, pp. 824–829, Aug 2011.
  • [6] Y. Liu, K. Gu, S. Wang, D. Zhao, and W. Gao, “Blind quality assessment of camera images based on low-level and high-level statistical features,” IEEE Transactions on Multimedia, vol. 21, no. 1, pp. 135–146, Jan 2019.
  • [7] Y. Liu, G. Zhai, K. Gu, X. Liu, D. Zhao, and W. Gao, “Reduced-reference image quality assessment in free-energy principle and sparse representation,” IEEE Transactions on Multimedia, vol. 20, no. 2, pp. 379–391, Feb 2018.
  • [8] Q. Li, W. Lin, J. Xu, and Y. Fang, “Blind image quality assessment using statistical structural and luminance features,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2457–2469, Dec 2016.
  • [9] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 1, pp. 50–63, Jan 2015.
  • [10] E. Bosc, R. Pepion, P. Le Callet, M. Koppel, P. Ndjiki-Nya, M. Pressigout, and L. Morin, “Towards a new quality metric for 3-d synthesized view assessment,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 7, pp. 1332–1343, 2011.
  • [11] E. Bosc, P. Le Callet, L. Morin, and M. Pressigout, “Visual quality assessment of synthesized views in the context of 3d-tv,” in 3D-TV system with depth-image-based rendering.   Springer, 2013, pp. 439–473.
  • [12] X. Liu, Y. Zhang, S. Hu, S. Kwong, C.-C. J. Kuo, and Q. Peng, “Subjective and objective video quality assessment of 3d synthesized views with texture/depth compression distortion,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4847–4861, 2015.
  • [13] Y. J. Jung, H. G. Kim, and Y. M. Ro, “Critical binocular asymmetry measure for the perceptual quality assessment of synthesized stereo 3d images in view synthesis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 7, pp. 1201–1214, 2016.
  • [14] S. Tian, L. Zhang, L. Morin, and O. Déforges, “A benchmark of dibr synthesized view quality assessment metrics on a new database for immersive media applications,” IEEE Transactions on Multimedia, vol. 21, no. 5, pp. 1235–1247, May 2019.
  • [15] R. Song, H. Ko, and C. Kuo, “Mcl-3d: A database for stereoscopic image quality assessment using 2d-image-plus-depth source,” Journal of Informatin Science and Engineering, vol. 31, no. 5, p. 1593–1611, 2015.
  • [16]

    A. Telea, “An image inpainting technique based on the fast marching method,”

    Journal of graphics tools, vol. 9, no. 1, pp. 23–34, 2004.
  • [17] Y. Mori, N. Fukushima, T. Yendo, T. Fujii, and M. Tanimoto, “View generation with 3d warping using depth information for ftv,” Signal Processing: Image Communication, vol. 24, no. 1, pp. 65–72, 2009.
  • [18] K. Mueller, A. Smolic, K. Dix, P. Merkle, P. Kauff, and T. Wiegand, “View synthesis for advanced 3d video systems,” EURASIP Journal on Image and Video Processing, vol. 2008, no. 1, pp. 1–11, 2009.
  • [19] P. Ndjiki-Nya, M. Köppel, D. Doshkov, H. Lakshman, P. Merkle, K. Müller, and T. Wiegand, “Depth image based rendering with advanced texture synthesis,” in 2010 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2010, pp. 424–429.
  • [20] P. Ndjiki-Nya, M. Koppel, D. Doshkov, H. Lakshman, P. Merkle, K. Muller, and T. Wiegand, “Depth image-based rendering with advanced texture synthesis for 3-d video,” IEEE Transactions on Multimedia, vol. 13, no. 3, pp. 453–465, 2011.
  • [21] M. Köppel, P. Ndjiki-Nya, D. Doshkov, H. Lakshman, P. Merkle, K. Müller, and T. Wiegand, “Temporally consistent handling of disocclusions with texture synthesis for depth-image-based rendering,” in 2010 IEEE International Conference on Image Processing.   IEEE, 2010, pp. 1809–1812.
  • [22] A. Criminisi, P. Pérez, and K. Toyama, “Region filling and object removal by exemplar-based image inpainting,” IEEE Transactions on image processing, vol. 13, no. 9, pp. 1200–1212, 2004.
  • [23] V. Jantet, C. Guillemot, and L. Morin, “Object-based layered depth images for improved virtual view synthesis in rate-constrained context,” in Image Processing (ICIP), 2011 18th IEEE International Conference on.   IEEE, 2011, pp. 125–128.
  • [24] I. Ahn and C. Kim, “A novel depth-based virtual view synthesis method for free viewpoint video,” IEEE Transactions on Broadcasting, vol. 59, no. 4, pp. 614–626, 2013.
  • [25] G. Luo, Y. Zhu, Z. Li, and L. Zhang, “A hole filling approach based on background reconstruction for view synthesis in 3d video,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2016, pp. 1781–1789.
  • [26] M. Solh and G. AlRegib, “Hierarchical hole-filling for depth-based view synthesis in ftv and 3d video,” IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 5, pp. 495–504, 2012.
  • [27] C. Zhu and S. Li, “Depth image based view synthesis: New insights and perspectives on hole generation and filling,” IEEE Transactions on Broadcasting, vol. 62, no. 1, pp. 82–93, 2016.
  • [28] M. Tanimoto, T. Fujii, K. Suzuki, N. Fukushima, and Y. Mori, “Reference softwares for depth estimation and view synthesis,” ISO/IEC JTC1/SC29/WG11 MPEG, vol. 20081, p. M15377, 2008.
  • [29] S. S. Yoon, H. Sohn, Y. J. Jung, and Y. M. Ro, “Inter-view consistent hole filling in view extrapolation for multi-view image generation,” in Image Processing (ICIP), 2014 IEEE International Conference on.   IEEE, 2014, pp. 2883–2887.
  • [30] G. J. Sullivan, J. M. Boyce, Y. Chen, J.-R. Ohm, C. A. Segall, and A. Vetro, “Standardized extensions of high efficiency video coding (hevc),” IEEE Journal of selected topics in Signal Processing, vol. 7, no. 6, pp. 1001–1016, 2013.
  • [31] IVC-IRCCyN lab, “IRCCyN/IVC DIBR image database,” http://ivc.univ-nantes.fr/en/databases/DIBR_Images/, last accessed Aug. 30th 2017, [Online].
  • [32] E. Bosc, P. Le Callet, L. Morin, and M. Pressigout, “An edge-based structural distortion indicator for the quality assessment of 3d synthesized views,” in 2012 Picture Coding Symposium.   IEEE, 2012, pp. 249–252.
  • [33] P.-H. Conze, P. Robert, and L. Morin, “Objective view synthesis quality assessment,” in IS&T/SPIE Electronic Imaging.   International Society for Optics and Photonics, 2012, pp. 82 881M–82 881M.
  • [34] F. Battisti, E. Bosc, M. Carli, P. Le Callet, and S. Perugia, “Objective image quality assessment of 3D synthesized views,” Signal Processing: Image Communication, vol. 30, pp. 78–88, 2015.
  • [35] D. Sandić-Stanković, F. Battisti, D. Kukolj, P. L. Callet, and M. Carli, “Free viewpoint video quality assessment based on morphological multiscale metrics,” in 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), June 2016, pp. 1–6.
  • [36] D. Sandić-Stanković, D. Kukolj, and P. Le Callet, “DIBR synthesized image quality assessment based on morphological wavelets,” in 2015 Seventh International Workshop onQuality of Multimedia Experience (QoMEX).   IEEE, 2015, pp. 1–6.
  • [37] D. Sandić-Stanković, D. Kukolj, and P. Le Callet, “Multi–scale synthesized view assessment based on morphological pyramids,” Journal of Electrical Engineering, vol. 67, no. 1, pp. 3–11, 2016.
  • [38] S. Ling, P. Le Callet, and G. Cheung, “Quality assessment for synthesized view based on variable-length context tree,” in Multimedia Signal Processing (MMSP), 2017 IEEE 19th International Workshop on.   IEEE, 2017, pp. 1–6.
  • [39] S. Ling and P. Le Callet, “Image quality assessment for free viewpoint video based on mid-level contours feature,” in 2017 IEEE International Conference on Multimedia and Expo (ICME), July 2017, pp. 79–84.
  • [40] S. Ling and P. Le Callet, “Image quality assessment for dibr synthesized views using elastic metric,” in Proceedings of the 2017 ACM on Multimedia Conference.   ACM, 2017, pp. 1157–1163.
  • [41] Y. Zhao and L. Yu, “A perceptual metric for evaluating quality of synthesized sequences in 3dv system,” in Visual Communications and Image Processing 2010.   International Society for Optics and Photonics, 2010, pp. 77 440X–77 440X.
  • [42] Y. Zhang, H. Zhang, M. Yu, S. Kwong, and Y. Ho, “Sparse representation-based video quality assessment for synthesized 3d videos,” IEEE Transactions on Image Processing, vol. 29, pp. 509–524, 2020.
  • [43] M. Solh, G. AlRegib, and J. M. Bauza, “3vqm: A vision-based quality measure for dibr-based 3d videos,” in 2011 IEEE International Conference on Multimedia and Expo, July 2011, pp. 1–6.
  • [44] Y. Zhou, L. Li, K. Gu, Y. Fang, and W. Lin, “Quality assessment of 3d synthesized images via disoccluded region discovery,” in 2016 IEEE International Conference on Image Processing (ICIP).   IEEE, 2016, pp. 1012–1016.
  • [45] S. Tian, L. Zhang, L. Morin, and O. Deforges, “A full-reference image quality assessment metric for 3d synthesized views,” in Image Quality and System Performance Conference, at IS&T Electronic Imaging 2018.   Society for Imaging Science and Technology, 2018.
  • [46] S. Tian, L. Zhang, L. Morin, and O. Déforges, “Sc-iqa: Shift compensation based image quality assessment for dibr-synthesized views,” in IEEE International Conference on Visual Communications and Image Processing, 2018.
  • [47] Y. Zhou, L. Li, S. Ling, and P. Le Callet, “Quality assessment for view synthesis using low-level and mid-level structural representation,” Signal Processing: Image Communication, vol. 74, pp. 309–321, 2019.
  • [48] S. Ling, J. Li, P. Le Callet, and J. Wang, “Perceptual representations of structural information in images: application to quality assessment of synthesized view in ftv scenario,” in 2019 IEEE International Conference on Image Processing (ICIP).   IEEE, 2019, pp. 1735–1739.
  • [49] X. Wang, F. Shao, Q. Jiang, R. Fu, and Y. Ho, “Quality assessment of 3d synthesized images via measuring local feature similarity and global sharpness,” IEEE Access, vol. 7, pp. 10 242–10 253, 2019.
  • [50] D. Sandic-Stankovic, D. Kukolj, and P. Le Callet, “DIBR-synthesized image quality assessment based on morphological multi-scale approach,” EURASIP Journal on Image and Video Processing, vol. 2017, no. 1, p. 4, 2016.
  • [51] V. Jakhetiya, K. Gu, W. Lin, Q. Li, and S. P. Jaiswal, “A prediction backed model for quality assessment of screen content and 3-d synthesized images,” IEEE Transactions on Industrial Informatics, vol. 14, no. 2, pp. 652–660, 2017.
  • [52] L. Li, Y. Zhou, K. Gu, W. Lin, and S. Wang, “Quality assessment of dibr-synthesized images by measuring local geometric distortions and global sharpness,” IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 914–926, 2018.
  • [53] M. S. Farid, M. Lucenteforte, and M. Grangetto, “Perceptual quality assessment of 3d synthesized images,” in Multimedia and Expo (ICME), 2017 IEEE International Conference on.   IEEE, 2017, pp. 505–510.
  • [54] ——, “Objective quality metric for 3d virtual views,” in 2015 IEEE International Conference on Image Processing (ICIP).   IEEE, 2015, pp. 3720–3724.
  • [55] ——, “Evaluating virtual image quality using the side-views information fusion and depth maps,” Information Fusion, vol. 43, pp. 47–56, 2018.
  • [56] K. Gu, V. Jakhetiya, J.-F. Qiao, X. Li, W. Lin, and D. Thalmann, “Model-based referenceless quality metric of 3d synthesized images using local image description,” IEEE Transactions on Image Processing, 2017.
  • [57]

    V. Jakhetiya, K. Gu, T. Singhal, S. C. Guntuku, Z. Xia, and W. Lin, “A highly efficient blind image quality assessment metric of 3d-synthesized images using outlier detection,”

    IEEE Transactions on Industrial Informatics, 2018.
  • [58] K. Gu, J. Qiao, S. Lee, H. Liu, W. Lin, and P. Le Callet, “Multiscale natural scene statistical analysis for no-reference quality evaluation of dibr-synthesized views,” IEEE Transactions on Broadcasting, 2019.
  • [59] D. D. Sandić-Stanković, D. D. Kukolj, and P. Le Callet, “Fast blind quality assessment of dibrsynthesized video based on high-high wavelet subband,” IEEE Transactions on Image Processing, 2019.
  • [60] S. Tian, L. Zhang, L. Morin, and O. Déforges, “NIQSV: A no reference image quality assessment metric for 3D synthesized views,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, March 2017.
  • [61] S. Tian, L. Zhang, L. Morin, and O. Déforges, “NIQSV+: A No-Reference Synthesized View Quality Assessment Metric,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1652–1664, 2018.
  • [62] F. Shao, Q. Yuan, W. Lin, and G. Jiang, “No-reference view synthesis quality prediction for 3-d videos based on color–depth interactions,” IEEE Transactions on Multimedia, vol. 20, no. 3, pp. 659–674, 2017.
  • [63] G. Yue, C. Hou, K. Gu, T. Zhou, and G. Zhai, “Combining local and global measures for dibr-synthesized image quality evaluation,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 2075–2088, 2018.
  • [64] G. Wang, Z. Wang, K. Gu, and Z. Xia, “Blind quality assessment for 3d-synthesized images by measuring geometric distortions and image complexity,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 4040–4044.
  • [65] G. Wang, Z. Wang, K. Gu, L. Li, Z. Xia, and L. Wu, “Blind quality metric of dibr-synthesized images in the discrete wavelet transform domain.” IEEE transactions on image processing: a publication of the IEEE Signal Processing Society, 2019.
  • [66] Y. Zhou, L. Li, S. Wang, J. Wu, Y. Fang, and X. Gao, “No-reference quality assessment for view synthesis using dog-based edge statistics and texture naturalness,” IEEE Transactions on Image Processing, 2019.
  • [67] Y. Zhou, L. Li, S. Wang, J. Wu, and Y. Zhang, “No-reference quality assessment of dibr-synthesized videos by measuring temporal flickering,” Journal of Visual Communication and Image Representation, vol. 55, pp. 30–39, 2018.
  • [68] S. Ling and P. Le Callet, “How to learn the effect of non-uniform distortion on perceived visual quality? case study using convolutional sparse coding for quality assessment of synthesized views,” in 2018 25th IEEE International Conference on Image Processing (ICIP).   IEEE, 2018, pp. 286–290.
  • [69] X. Wang, K. Wang, B. Yang, F. W. B. Li, and X. Liang, “Deep blind synthesized image quality assessment with contextual multi-level feature pooling,” in 2019 IEEE International Conference on Image Processing (ICIP), Sep. 2019, pp. 435–439.
  • [70] S. Ling, J. Li, J. Wang, and P. L. Callet, “Gans-nqm: A generative adversarial networks based no reference quality assessment metric for RGB-D synthesized views,” CoRR, vol. abs/1903.12088, 2019. [Online]. Available: http://arxiv.org/abs/1903.12088
  • [71] J. J. Lim, C. L. Zitnick, and P. Dollár, “Sketch tokens: A learned mid-level representation for contour and object detection,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp. 3158–3165.
  • [72] P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,” 2009.
  • [73] E. Shechtman and M. Irani, “Matching local self-similarities across images and videos.” in CVPR, vol. 2.   Minneapolis, MN, 2007, p. 3.
  • [74] S. Ling, J. Gutiérrez, K. Gu, and P. Le Callet, “Prediction of the influence of navigation scan-path on perceived quality of free-viewpoint videos,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 1, pp. 204–216, 2019.
  • [75] W. Mio, A. Srivastava, and S. Joshi, “On shape of plane elastic curves,” International Journal of Computer Vision, vol. 73, no. 3, pp. 307–324, 2007.
  • [76] A. Srivastava, E. Klassen, S. H. Joshi, and I. H. Jermyn, “Shape analysis of elastic curves in euclidean spaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 7, pp. 1415–1428, 2010.
  • [77] H. Freeman, “Application of the generalized chain coding scheme to map data processing.” RENSSELAER POLYTECHNIC INST TROY NY DEPT OF ELECTRICAL AND SYSTEMS ENGINEERING, Tech. Rep., 1978.
  • [78] A. Zheng, G. Cheung, and D. Florencio, “Context tree-based image contour coding using a geometric prior,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 574–589, 2016.
  • [79] V. Jakhetiya, O. C. Au, S. Jaiswal, L. Jia, and H. Zhang, “Fast and efficient intra-frame deinterlacing using observation model based bilateral filter,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 5819–5823.
  • [80] J. Wu, W. Lin, G. Shi, and A. Liu, “Reduced-reference image quality assessment with visual information fidelity,” IEEE Transactions on Multimedia, vol. 15, no. 7, pp. 1700–1705, Nov 2013.
  • [81] F. Battisti, “3DSwIM Source Code,” http://www.comlab.uniroma3.it/3DSwIM.html, last accessed Aug. 30th 2017, [Online].
  • [82] H. W. Lilliefors, “On the kolmogorov-smirnov test for normality with mean and variance unknown,” Journal of the American statistical Association, vol. 62, no. 318, pp. 399–402, 1967.
  • [83] P. Maragos and R. W. Schafer, “Morphological systems for multidimensional signal processing,” Proceedings of the IEEE, vol. 78, no. 4, pp. 690–710, 1990.
  • [84] Chun-Hsien Chou and Chi-Wei Chen, “A perceptually optimized 3-d subband codec for video communication over wireless channels,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 2, pp. 143–156, April 1996.
  • [85] J. H. Westerink and K. Teunissen, “Perceived sharpness in complex moving images,” Displays, vol. 16, no. 2, pp. 89–97, 1995.
  • [86] A. K. Moorthy and A. C. Bovik, “Visual importance pooling for image quality assessment,” IEEE journal of selected topics in signal processing, vol. 3, no. 2, pp. 193–201, 2009.
  • [87] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.   IEEE, 2013, pp. 2083–2090.
  • [88] P. Kovesi et al., “Image features from phase congruency,” Videre: Journal of computer vision research, vol. 1, no. 3, pp. 1–26, 1999.
  • [89] B. Julesz, “Cyclopean perception and neurophysiology,” Investigative Ophthalmology & Visual Science, vol. 11, no. 6, pp. 540–548, 1972.
  • [90] P. C. Teo and D. J. Heeger, “Perceptual image distortion,” in Human Vision, Visual Processing, and Digital Display V, vol. 2179.   International Society for Optics and Photonics, 1994, pp. 127–141.
  • [91] B. K. Patra, R. Launonen, V. Ollikainen, and S. Nandi, “A new similarity measure using bhattacharyya coefficient for collaborative filtering in sparse data,” Knowledge-Based Systems, vol. 82, pp. 163–177, 2015.
  • [92] K. Gu, G. Zhai, W. Lin, X. Yang, and W. Zhang, “Visual saliency detection with free energy theory,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1552–1555, 2015.
  • [93] L. Tang, L. Li, K. Gu, X. Sun, and J. Zhang, “Blind quality index for camera images with natural scene statistics and patch-based sharpness assessment,” Journal of Visual Communication and Image Representation, vol. 40, pp. 335–344, 2016.
  • [94] Y. Zhang, T. D. Phan, and D. M. Chandler, “Reduced-reference image quality assessment based on distortion families of local perceived sharpness,” Signal Processing: Image Communication, vol. 55, pp. 130–145, 2017.
  • [95] K. Gu, G. Zhai, W. Lin, X. Yang, and W. Zhang, “No-reference image sharpness assessment in autoregressive parameter space,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3218–3231, 2015.
  • [96] K. Gu, G. Zhai, W. Lin, X. Yang, and W. Zhang, “No-reference image sharpness assessment in autoregressive parameter space,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3218–3231, Oct 2015.
  • [97] X. Min, K. Gu, G. Zhai, J. Liu, X. Yang, and C. W. Chen, “Blind quality assessment based on pseudo-reference image,” IEEE Transactions on Multimedia, vol. 20, no. 8, pp. 2049–2062, Aug 2018.
  • [98] R. Ferzli and L. J. Karam, “A no-reference objective image sharpness metric based on the notion of just noticeable blur (jnb),” IEEE Transactions on Image Processing, vol. 18, no. 4, pp. 717–728, April 2009.
  • [99] A. Cohen, I. Daubechies, and J.-C. Feauveau, “Biorthogonal bases of compactly supported wavelets,” Communications on pure and applied mathematics, vol. 45, no. 5, pp. 485–560, 1992.
  • [100] K. Gu, J. Zhou, J.-F. Qiao, G. Zhai, W. Lin, and A. C. Bovik, “No-reference quality assessment of screen content pictures,” IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 4005–4018, 2017.
  • [101] K. Gu, W. Lin, G. Zhai, X. Yang, W. Zhang, and C. W. Chen, “No-reference quality metric of contrast-distorted images based on information maximization,” IEEE transactions on cybernetics, vol. 47, no. 12, pp. 4559–4565, 2016.
  • [102] K. Gu, J. Qiao, X. Min, G. Yue, W. Lin, and D. Thalmann, “Evaluating quality of screen content images via structural variation analysis,” IEEE transactions on visualization and computer graphics, vol. 24, no. 10, pp. 2689–2701, 2017.
  • [103] H. G. Kim and Y. M. Ro, “Measurement of critical temporal inconsistency for quality assessment of synthesized video,” in 2016 IEEE International Conference on Image Processing (ICIP).   IEEE, 2016, pp. 1027–1031.
  • [104] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the dct domain,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, 2012.
  • [105] ——, “DCT statistics model-based blind image quality assessment,” in 2011 18th IEEE International Conference on Image Processing (ICIP),.   IEEE, 2011, pp. 3093–3096.
  • [106] A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE transactions on Image Processing, vol. 20, no. 12, pp. 3350–3364, 2011.
  • [107] L. Liu, H. Dong, H. Huang, and A. C. Bovik, “No-reference image quality assessment in curvelet domain,” Signal Processing: Image Communication, vol. 29, no. 4, pp. 494 – 505, 2014.
  • [108] C. Li, Y. Zhang, X. Wu, W. Fang, and L. Mao, “Blind multiply distorted image quality assessment using relevant perceptual features,” in 2015 IEEE International Conference on Image Processing (ICIP), Sep. 2015, pp. 4883–4886.
  • [109] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2, July 2001, pp. 416–423 vol.2.
  • [110] X. Yang, F. Li, and H. Liu, “A survey of dnn methods for blind image quality assessment,” IEEE Access, vol. 7, pp. 123 788–123 806, 2019.
  • [111] J. Kim, H. Zeng, D. Ghadiyaram, S. Lee, L. Zhang, and A. C. Bovik, “Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 130–141, Nov 2017.
  • [112] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [113] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770–778.
  • [114] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available: http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
  • [115] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, Jan 2015. [Online]. Available: https://doi.org/10.1007/s11263-014-0733-5
  • [116]

    B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”

    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, June 2018.
  • [117] M. Manassi, B. Sayim, and M. H. Herzog, “When crowding of crowding leads to uncrowding,” Journal of Vision, vol. 13, no. 13, pp. 10–10, 2013.
  • [118] V. Q. E. Group, “Final report from the video quality experts group on the validation of objective models of multimedia quality assessment,” VQEG, March 2008.
  • [119] L. Krasula, K. Fliegel, P. Le Callet, and M. Klíma, “On the accuracy of objective image and video quality models: New methodology for performance evaluation,” in Quality of Multimedia Experience (QoMEX), 2016 Eighth International Conference on.   IEEE, 2016, pp. 1–6.
  • [120] S. Tian, “Image quality assessment of 3d synthesized views,” Ph.D. dissertation, Rennes, INSA, 2019.
  • [121] A. Moorthy and A. Bovik, “A modular framework for constructing blind universal quality indices,” IEEE Signal Processing Letters, vol. 17, 2009.
  • [122] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object categories,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 4, pp. 594–611, 2006.
  • [123] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Advances in Neural Information Processing Systems, 2017, pp. 4077–4087.