As immersive multimedia has developed in leaps and bounds along with the emergence of more advanced technologies for capturing, processing and rendering, applications like Free-viewpoint TV (FTV), 3DTV, Virtual Reality (VR) and Augmented Reality (AR) have engaged a lot of users and become the novel hot topic in the multimedia field. In this sense, FTV, which allow the users to immerse themselves into a scene by freely switching the viewpoints as they do in the real world, enables Super Multi-View (SMV) and Free Navigation (FN) applications. On one side, in SMV an horizontal set of more than 80 views (linearly or angularly arranged) is needed to provide users a 3D viewing experience with wide-viewing horizontal parallax, and smooth transition between adjacent views. On the other hand, for FN only a limited set of input views is required, coming from sparse camera arrangements in large baseline setup conditions. In both cases, to deal with such a huge amount of data for delivery and storage, efficient compression techniques are essential, together with robust view-synthesis algorithms, such as Depth-Image-Based Rendering technology (DIBR), which allows to reconstruct the FVV content from a limited set of input views.
Importance and difficulties of video quality assessment in FVV: video Quality Assessment Metric (VQM) is desirable for evaluating video systems’ performance, covering the whole processing chain, from capturing to rendering. In this sense, while hardware developments are leading the advances for capturing and rendering FVV, compression techniques and view synthesis algorithms are main focus of research, as reflected by the ongoing standardization activities within MPEG . This is mainly due to their importance on the perceived quality, and thus, on the success of the related applications and services .
Aside from the well-known compression artifacts, view synthesis techniques (such as Depth Image Based Rendering) have to deal with disoccluded regions . It is due to the reappearing of the sheltered regions, which are not shown in the reference views but are made visible later in the generated ones. Techniques to recover disocluded regions often introduce geometric distortion and ghosting artifacts. These synthesized-related artifacts are different in nature to compression artifacts, since, they mostly appear locally along the disoccluded areas, while compression artifacts are usually spread over the entire video. In addition, view-synthesis artifacts increase with the baseline distance (i.e., number of synthesized views between two real views) till a point they may be dominant over compression artifacts . Thus, it is very unlikely that VQM proposed for compression-related distortions would be efficient for predicting the quality of sequences produced using synthesized views.
Impact of navigations scan-path on perceived quality: free navigation vs predefined trajectories
Immersive media technologies offer to the users more freedom to explore the content allowing more interactive experiences, than with traditional media. These new possibilities introduce the observers’ behavior as an important factor for the perceived quality .
Given the fact that each observer can explore the content differently, there are two approaches can be adopted to practically study this factor: (1) let the observers to navigate the content freely; (2) let the observer to watch the sequences in form of certain pre-defined navigation trajectories. By employing the first approach, one could obtain a common trajectory according to all the observers’ data. However, this common trajectory does not necessarily represent the critical one that will stress the system to the worse case. Moreover, if observers are allowed to navigate freely during the test, it will become a new factor that increase the variability of the mean opinion score (MOS), despite observer’s variability in forming quality judgment. As a result, more observers are likely to be required to obtain MOS that can distinguish one system from another statistically significantly. The second approach (predefined trajectories) is not affected by this trajectory-source of variability but comes with the challenge of selecting the ”right” trajectory. In case of system benchmark, one could define ”right” trajectory as the most critical one or weakest link, e.g. the one leading to the lowest perceived quality. Nevertheless, there is a good chance that this trajectory-effect is highly dependent on content, some being more sensitive than some others to the choice of trajectory. Identifying the impact of navigation trajectory among different viewpoints on perceived quality for a given content is then of particular interest. For quality evaluation it may be useful to know how navigation affects the visual experience and which are the ”worst” trajectories for the system, to carry out performance evaluations of the system under study in the most stressful cases. Consequently, the availability of computational tools to select the critical trajectories would be extremely useful.
Contribution: Based on the discussion above, there are two main research questions in this paper, including (1) does how observer navigate FVV content affect perceived quality; (2) if trajectory affect quality, how to develop a objective metric to indicate ”worse” trajectory. To answer these two questions, the contribution of this paper is twofold. Firstly, a subjective test is conducted to study the impact of the exploration trajectory on perceived quality in FVV application scenarios, containing compression and view-synthesis artifacts. In this sense, the concept of Hypothetical Rendering Trajectory (HRT) is introduced. Also, the annotated database obtained from this test is released for research purposes in the field. Secondly, a full-reference Sketch-Token-based synthesized Video Quality Assessment Metric (ST-VQM) is proposed by quantifying to what extent the classes of contours change due to synthesis. This metric is capable of predicting if sequences based on a given trajectory are of higher/lower quality than sequences based on other trajectories, with respect to subjective scores.
The remainder of the paper is organized as follows. In Section II, an overview of the state-of-the-art in terms of subjective and objective quality evaluation in relation with FVV scenarios is presented and discussed. In Section III, the details of the subjective experiment are described, while Section V introduces the proposed VQA metric based on mid-level descriptor. The experimental results from the subjective experiment and the performance evaluation of the proposed objective metric are presented in Section IV and Section VI. Finally, conclusions are given in Section VII.
Ii Related Work
Ii-a Subjective studies
Although the development of technical aspects related to FTV has been addressed already for some years, the subjective evaluation of the QoE of such systems is still an open issue . As previously mentioned, the majority of the existing studies have been carried out using conventional screens and limiting the interactivity of the users by showing some representative content or predefined trajectories simulating the movement of the observers . In FVV systems, this is especially the case given the limited access to SMV or light-field displays, since only a few prototypes are already available. Nevertheless, it is worth noting the preliminary subjective study that Dricot et al.  carried out a considering coding and view-synthesis artifacts using a light-field display.
In addition to compression techniques, the evaluation and understanding of view-synthesis algorithms is crucial for a successful development of FTV applications and is still an open issue . In this sense, some works that were carried out with previous technologies (e.g., multi-view video), should be taken into account in the study of the effects of view-synthesis in current FTV applications. Firstly, Bosc et al. carried out subjective studies to evaluate the visual quality of synthesized views using DIBR. In these studies, the quality performance of view synthesis was evaluated through different ways, such as: a) the quality of synthesized still images , b) the quality of videos showing a synthesized view of Multi-View plus Depth (MVD) video sequence , and c) video sequences showing a smooth sweep across the different viewpoints of an static scene . These different approaches are represented in Fig. 1, showing that the first approach only considers spatial DIBR-related artifacts, the second approach considers also temporal distortions within the synthesized view, and the third approach considers spatial DIBR-related artifacts of all the views. To complete the evaluation, another use case should consider the use of view-sweep over the views in video sequences, as depicted in Fig. 1(d) (i.e., generating videos in which a sweep across the different viewpoints is shown, as if the observer was moving his head horizontally). This approach has been recently adopted in subjective studies with SMV , which were carried out to study different aspects of this technology, such as smoothness in view transitions and comfortable view-sweep speed , and the impact of coding artifacts . MPEG has adopted this type of alternative for their current standardization activities regarding the evaluation of compression techniques for FTV .
Furthermore, as a result from subjective tests, the availability of appropriate datasets is a crucial aspect for the research on both subjective and objective quality. Especially for supporting the development of objective quality metric, databases containing suitable stimuli (images/videos) annotated with results from subjective are essential. Some efforts have been already made to publish datasets containing free-viewpoint content  and some of the aforementioned subjective tests . Nevertheless, none of these dataset has considered the effect of content adapted trajectories in the ”view-sweeping along time” scenario.
Ii-B Objective metrics
Some image quality metrics have been recently proposed especially design to handle view-synthesis artifacts. For instance, Battisti et al.  proposed a metric based on statistical features of wavelet sub-bands. Furthermore, considering that using multi-resolution approaches could increase the performance of image quality metrics, Sandić-Stanković et al. proposed to use morphological wavelet decomposition , and multi-scale decomposition based on morphological pyramids . Later, the reduced version of these two metrics is presented in  claiming that PSNR is more consistent with human judgment when calculated at higher morphological decomposition scales.
All the aforementioned metrics are limited to quality assessment of synthesized static images, so they do not explicitly consider temporal distortions that may appear in videos containing synthesized views. Some ad hoc video metrics have been proposed. Zhao and Yu  proposes a measure which calculates temporal artifacts that can be perceived by observers in the background regions of the synthesized videos. Similarly, Ekmekcioglu et al.  proposed a video quality measure using depth and motion information to take into account where the degradations are located. Moreover, another video metric was recently introduced by Liu et al.  considering the spatio-temporal activity and the temporal flickering that appears in synthesized video sequences. However, the aforementioned video quality measures are able to predict the impact of view-synthesis degradations comparing videos corresponding with one single view (as represented in Fig. 1(b)). In other words, switching among views (resulting from the possible movement of the viewers) and related effects (e.g., inconsistencies among views, geometric flicker along time and view dimensions, etc. ) are not addressed. Hanhart et al.  evaluated the performance of state-of-the-art quality measures for 2D video in sequences generated by view-sweep  (as depicted in Fig. 1(c)), thus considering view-point changes, and reported low performance of all measures in predicting the perceptual quality. An efficient objective video quality measure able to deal with the ”view-sweeping along time” scenario is still needed.
Iii Subjective study of the impact of trajectory on perceived quality
As described in the introduction, the first research question of this paper is to identify the impact of navigation trajectory among different viewpoints on perceived quality taking contents into account. To this end, a subjective study is conducted by designing content related trajectories. A video quality database for FVV scenarios is built, including both compression and view-synthesis artifacts and containing the scores from the subjective assessment test describe in the following. The videos in this database are generated by simulating exploring trajectories that the observers may use in real scenarios, which are set by the Hypothetical Rendering Trajectory (HRT), defined in the following subsection. This database is named as ’Image, Perception and Interaction group Free-viewpoint video database’ (IPI-FVV)111The public link for downloading the database will be added in the final version of this paper.
Iii-a Hypothetical Rendering Trajectory
A commonly used naming convention for subjective quality assessment studies was provided by the Video Quality Experts Group , including: SRC (i.e., source or original sequences), HRC (i.e., Hypothetical Reference Circuit or processing applied to the SRC to obtain the test sequences, such as compression techniques), PVS (i.e., Processed Video Sequence or the resulting test sequence from applying an HRC to a SRC). In the context of FN, one should reflect another dimension of the systeme under test related to the interactivity part (e.g. the use of exploration trajectories in quality evaluation of immersive media). Towards this goal, we introduce the term Hypothetical Rendering Trajectories (HRT), to reference the simulated exploration trajectory that is applied to a PVS (as the result of a HRC on a give SRC) for rendering. It is worth mentioning the generality of this term applicable to all immersive media from multiview video to VR, light fields, AR and point clouds.
Iii-B Test Material
Three different SMV sequences are utilized in our study. These three sequences are Champagne Tower (CT), Pantomime (P) and Big Buck Bunny Flowers (BBBF). Description of the three SMV sequences are summarized in Table I. They were also selected as test materials in . For each of the 3 SRC sequences, 20 HRCs, were selected, covering 5 baselines and 4 rate-points (RP). In addition, 2 HRTs were also included to generate 120 PVSs. Details on these parameters, which were selected after a pretest with expert viewers’, are described in the following subsections.
|Name||Views||Resolution||Fps||Seconds||Frames||QP values||Baseline Distance|
|BBBF||91||1280 x 768||24||5||121||35||-||45||50||B0, B2, B5, B9, B13|
|CT||80||1280 x 960||29.4||10||300||37||43||-||50||B0, B4, B8, B12, B16|
|P||80||1280 x 960||29.4||10||300||37||43||-||50||B0, B2, B6, B12, B16|
Iii-B1 Camera configuration
For each source sequence (SRC), 5 stereo baseline values, as summarized in TABLE I, are selected in the test including the setting without using synthesized views. The baseline is measured based on the camera distances/gaps between left and right real views. Here, or represents the stereo baseline distances that were settled to generate the synthesized virtual views, where is the number of synthesized views between two reference views. For instance, for camera setting in the the upper part of Fig. 2, between each pair of views that captured by original cameras (indicated by two closest black cameras in the figure) there are four virtual views are synthesized using them as left, right reference. In this case, the baseline distance is 4, denoted as . Fig.2 illustrates the baseline setting for synthesized views generation in the subjective study. For example, in the lower part of Fig.2, for , between each two transmitted encoded views, there are totally 4 virtual synthesized views were generated.
Iii-B2 3D-HEVC configuration
In our experiment, HTM 13.0 in 3D High Efficiency Video Coding (3D-HEVC) mode was used to encode all the views of the three selected SMV sequences. These encoded views along with the selected original views will be used as the reference views in the following synthesis process, which are also named as ’anchors’. The configuration of the 3D-HEVC encoder recommended in  is adopted in the experiment. Specifically, in this experiment, taking into account the contents and the limitations of the duration of subjective experiment tests, 3 rate-points, as summarized in Table I, were selected for each SRC according to the results of the pretest. For each content, the original sequences without compression are also included in the experiment and noted as .
Iii-B3 Depth maps and virtual views generation
In this paper, reference software tools were used for the preparation of the synthesized views, including Depth Estimation Reference Software (DERS) and View Synthesis Reference Software (VSRS), which have been developed throughout the MPEG-FTV video coding exploration and standardization activities. To generate virtual views with reference sequences taken by real cameras, depth maps and related camera parameters are needed. For sequences ’CT’ and ’P’, since original depth maps were not provided, DERS in version of 6.1 is used to generate depth map for each corresponding view. Relative parameters are set as recommended in [46, 47]. For synthesized views-generation, the version 4.1 of VSRS is applied. The configuration of the relative parameters is set according to  for each corresponding content.
Iii-B4 Navigation trajectory generation
One of the purposes of this subjective experiment is to check whether semantic contents of the videos and how the navigation trajectories among views will affect the perceived quality. Therefore, different HRTs are considered in this study, generating sweeps that focus more on important objects, since human visual system tends to attach greater interest on ’Regions of Interest’ (ROI)  that contain important objects. Specifically, the following two HRTs are chosen from the pretest session considering the fact that human observers may pay more attention and even stop navigating to observe targeted objects in the video. These two HRTs are denoted with and as depicted in Fig. 3: () An ’important-objects HRT’ that first scans from the left-most to the right-most views to observe the overall contents in the video, then scans back to the views that contains the main objects and looking left and right around the central view that contain the objects several times at a velocity of one frame per view (1fpv); () An ’important-objects-stay HRT’ that first scans from the left-most to the right-most views to observe the overall content in the video, then scans back to the views that contain main objects at a velocity of 2fpv and finally stays in the central view that contains the main object. Due to limitation of resources, only two trajectories are considered in this study as initial exploration.
Iii-C Test Methodology
The methodology of Absolute Category Rating with hidden reference (ACR-HR)  was adopted for the subjective experiment. Thus, the observers watched sequentially the test videos, and after each one, they provided an score using the five-level quality scale. For this, an interface with adjectives representing the whole scale was shown until the score was provided, and then, the next text video was displayed. Also, it is worth noting that each test video was shown only once and the test videos were shown to each observer in different random orders. At the beginning of the test session, an initial explanation was given to the participants indicating the purpose and how to accomplish the test. Then, a set of training videos was shown to the observers to familiarize them with the test methodology and the quality range of the content. The entire session for each observer lasts for around 30 minutes.
Iii-D Environment and Observers
The test sequences were displayed on a professional screen TVLogic LVM401W, using a high-performance computer. Observers are provided with a tablet connected to the displayed computer for voting. The test room was set up according to the ITU recommendation BT.500 , so the walls were covered by gray-color curtains and the lightning conditions were regulated accordingly to avoid annoying reflections. Also, a viewing distance of 3H (H being the height of the screen) was chosen.
There were totally 33 participants in the subjective test, including 21 females and 12 males, with ages varying from 19 to 42 (average age of 24). Before the test, the observers were screened for correct visual acuity and color vision using the Snellen chart and Ishihara test, respectively, and all of them reported normal or corrected-to-normal vision. After the subjective test, the obtained scores were screened according to the procedure recommended by the ITU-R BT.500  and the VQEG . As a result form this screening, four observers were removed.
Iv Subjective Experiment Results and Analysis
The subjective result is shown in Fig. 4
, where each sub-graph summarizes the mean opinion score (MOS) (with confidence intervals) for each content in each virtual sweep. Apart from MOS, the differential mean opinion score (DMOS) is also provided along with the database, computed from the hidden references according to 
. As required for a quality dataset, the MOS values are well distributed covering almost the whole rating scale. In addition, in order to verify whether different Baselines (B), Rate-Points (RP) and ,specially, virtual Trajectories (T) have significant impacts on perceived quality, a three-way analysis of variance (ANOVA) was performed. From the results of this test and the results shown in Fig.4, the following main conclusions could be drawn:
At same configuration (i.e. baseline, rate-point and trajectory), the quality obtained with different contents are significantly different.
The effects of view-synthesis and compression artifacts are obvious, as shown when considering how the perceived quality changes with only baseline (for a given RP), or with only bitrate (fixing the baseline). The accumulation of the effects can be also observed in the scores for the tests sequences with combined degradations.
The three considered factors, specially trajectory , have significant impact on the perceived quality ( for and , and for ).
In terms of interaction among the considered factors, the interaction between baseline distance and coding quality has a significant effect on the MOS scores (), as expected.
Following are more detailed analysis of the impact of trajectory on perceived quality:
The averaged MOS values (averaged contents ’CT’, ’P’, ’BBBF’ and conditions) of sequences in form of is smaller than the one of . Apart from ANOVA test, to further confirm the impact of trajectory on perceived quality, the database is divided into two sets (i.e., sequence with and with
). A t-test is conducted by taking the pairs of sequences in form ofand with same baseline, rate-point configuration as input. According to the result, there is a significance difference between the quality of these two sets (i.e. and ).
Certain contents are more sensitive to certain trajectories. To further check whether the impact of certain trajectories depend on the content of the sequences, another t-test is conducted. More specifically, for each content, pairs of sequences that generated with the same baseline and ratepoint but different trajectory are first formed. Then, a t-test is conducted by taking the individual subjective scores (opinion scores from all the observers) of each pair of these sequences as input. According to the t-test result, for content ‘C’, 50 % of the pairs are of significantly different perceived quality. However, for content ‘CT’ and ‘BBBF’, only around 10% of pairs are of significantly different quality. It is proven that the impact of the trajectory on quality is content dependent. In other words, ‘extreme trajectory’ of videos with different contents are different.
Whether the the quality of sequence in form of one trajectory is higher than another depends also on quality range (in terms of baseline and rate-point setting). Result of t-test taking individual subjective score of each trajectory pair as input also shows that, for content ’C’ videos that in form of is of better quality than the ones in when quality is higher than a certain threshold (smaller baseline or smaller rate-point) and vise versa. For example, for content ’C’ with rate-point larger than , sequence in form of is better than the one with .
In conclusion, it is confirmed by the subjective study that there is an impact on perceived quality from navigation trajectory. It is found that content related trajectory is able to stress the system one step further for a more extreme situation. Therefore, image/video objective metrics that is able to indicate sequences in form of one trajectory is of better quality than another is required to better push the system to its limit according to the contents. To fill out this need, a video quality metric is introduced in the next section.
V Video Quality Measure for Free Viewpoint Videos
Objective quality measure that could provide more robust indication of the quality for a given HRT is required. Towards this goal, a Sketch-Token based Video Quality Measure (ST-VQM) is proposed to quantify the change of structure. ’Sketch-Token’ (ST)  model is a bag-of-words approach training a dictionary for representing the contours with contour’s categories. Considering the fact that (1) content related trajectory is able to stress the system; (2) content is related to structure; (3) geometric distortions are the most annoying degradations that interrupt structure introduced by view synthesis, the main idea of the proposed method is to assess the quality of the navigation videos by quantifying to what extent the classes of contours change due to view synthesis, compression and transition among views. It is an extended version of our previous work  (a quality measure for image) to cope with the FVV scenario. In this version, the complex registration stage is replaced by local regions selection, and a ST-based temporal estimator is incorporated to quantify temporal artifacts.
The improved video quality metric is consist of two parts, including a spatial metric ST-IQM as shown in Fig. 5 and a temporal metric ST-T as shown in Fig. 6. Details of each part is given in the following subsections.
V-a Sensitive Region Selection based on Interest Points Matching and Registration
Sensitive region selection is important for the later evaluation of the quality of DIBR-based synthesized views mainly for the following reasons: (1) instead of uniform distortions distributed equally throughout the entire frame, synthesized views contains mainly local nonuniform geometric distortion; (2) distortions distributed around region of interest are less tolerant for human observers than a degradation locating at an inconspicuous area . Meanwhile, ’poor’ regions are more likely to be perceived by humans in an image with more severity than the ’good’ ones. Thus images with even a small number of ’poor’ regions are penalized more gravely; (3) global and local shifting of objects introduced by DIBR algorithms is a big challenge for point to point metrics like PSNR due to the mismatched correspondences.
Interest point-based descriptors like SURF , which reveal image’s local properties and local shape information of objects are good candidates for selecting important local regions where DIBR local geometric artifacts could appear. Furthermore, later interest point matching can also be useful to compensate for consistent ’Shift of Objects’ artifacts which are, to some extent, acceptable for the human visual system.
The process of sensitive regions selection is summarized by the red dash bounding box in Fig. 5. First SURF and points are extracted in respectively both original and synthesized frames . Then SURF points matching between the two frames is achieved following the reference method in  (the original frame being considered as the reference for this matching process). Pairs of interest points that have significantly different and values are discarded, being considered as not plausible matched regions from the synthesis process. The patches centered at the corresponding matched SURF points in synthesized and original images are then considered. The size of these patches is set as to match ST formalism as introduced by  (see next section). The matching relation for all patches is encoded in a matching matrix , where is the coordinate of the SURF point of the patch of the reference frame and is the coordinate of its matched SURF point of the patch in the synthesized frame.
To illustrate the capability of SURF for selecting sensitive regions, one example is presented in Fig. 5 (e). The error maps are generated with the synthesized and the reference images as introduced in . The darker the region the more distortions it contain, as showed in the top part of the dashed bounding green box in Fig. 5 (e). The red bounding box represent the sensitive regions as extracted by the proposed process. It can be observed that, as desired, the majority of regions containing severe local distortions are well identified by this process.
V-B Sketch-token based Spatial Dissimilarity
Structures convey critical visual information and are beneficial to scene understanding, particularly the fine structures (edge) and main structures (contour)[57, 58]. Considering the process for synthesizing virtual views by DIBR methods, the key target is to transfer the occluded regions (mainly occurred at the contour of the foreground objects) in the original view to be visible in the virtual view. Measuring the variations occurred at the contours is highly related to the degradation of image quality in that use case. Consequently a method that encodes well contour would be a good candidate. The local edge-based mid-level features called ’Sketch Token’  has been proposed to capture and encode contour boundaries. It is based on the idea that structure in an image patch can be described as a linear combination of ”contour” patches from an universal codebook (trained once for all).
In Lim and al. work, to train the codebook of contour patches, human subjects were asked to draw sketches as structural contours for each image in a training set. 151 classes of sketch token were formed by clustering
pixels patches from the labeled training set. After extracting a set of low-level features from the patches, random decision forests model was adopted to train 150 contour classifiers for the contours within patches. Each output of every trained contour classifier is the likelinessof the existence of one correspondence contour in the patch. The category is for patch that does not contain any structural contours, e.g. patches with only smooth texture. One can calculate with , since . Finally, the output of these 151 classifiers are concatenated to form the ST vector so that with a given pixel , the corresponding patch can be represented as and the set of classifiers as the universal codebook.
In our metric, we extract the ST vectors and for each patches and of the matched SURF points pairs in matching matrix . The dissimilarity between each matched contour vectors and is then computed. As the vectors contains probality with the sum of all the
equals to 1, we propose to use Jensen–Shannon divergence as a dissimilarity measure which present the advantages to be bounded as opposed to the original Kullback–Leibler divergence. The dissimilarity between the matched patches centering atand respectively is then calculated as
Where , and is the Kullbackâ Leibler divergence defined as
In order to amplify error regions with larger dissimilarity, the Minkowski distance measure is used as pooling strategy accross sensitive regions. The spatial part of the proposed metric ST-IQM is then defined as
Where is the total number of matched SURF points in the frame and is a parameter corresponds to the defining the vector space.
V-C Sketch Token based Temporal Dissimilarity
Sweeping between views introduces and amplifies specific temporal artifacts including flickering, temporal structure inconsistency and so on. Among them, temporal structure inconsistency is usually the most sensitive artifact for human observers since it is usually located around important moving objects and is more obvious to notice compared to other temporal artifacts.
To quantify temporal structure inconsistency, we further compute the dissimilarity score between each pair of continuous frames using the proposed Sketch-Token model introduced in section V-B. In the previous section, ST-IQM was used to quantify the difference of structure organization between two images (original purpose of this framework). It can also be used to encode and describe how structure are evolving from one frame to another along a given sequence. Temporal structure changes as observed in FVV should affect this description. This idea is exploited to refine the quality estimation in case of FVV in order to capture temporal inconsistency.
Fig. 6 is a diagram explaining how the Sketch Token based temporal distortion is calculated. More specifically, for each pair of continuous frames of a sequence , and , one can compute using equation (3). A vector can be formed considering all frames of the sequence (each component of the vector corresponding to ). We define the Sketch Token based temporal dissimilarity (ST-T) between the original and the synthesized sequences as the euclidean distance between the two temporal vectors of the original and the synthethised sequence:
where is the euclidean distance function.
With the spatial Sketch Token based score (ST-IQM) and the temporal Sketch Token based score (ST-T), it is desirable to combine them to produce an overall score. The final quality score of a synthesized sequence is defined as:
where are two parameters used to balance the relative contributions of the spatial and temporal scores with a bias term . The selection and the influence of the related parameters will be given in section VI.
Vi Experiment Results of the Proposed ST-VQM
The IPI-FVV database concluded in section III is adopted for the evaluation of the objective measures’ performance. For comparison, only image/video measures designed for quality evaluation of view-synthesis artifacts are tested since commonly used metrics fail to quantify geometric distortions as already reported in [28, 16, 32, 31]. To compare the performances of the proposed measure with the state of the art, we firstly used the common criteria of computing Pearson correlation coefficient (PCC), Spearman’s rank order correlation coefficient (SCC) and root mean squared error (RMSE) between the subjective scores and the objective ones (after applying a non-linear mapping over the measures) . In case of image quality measures, their corresponding spatial objective scores are first calculated frame-wise, and the final object score is computed by averaging the spatial scores.
|Image Quality Measures|
|Video Quality Measures|
The overall results are summarized in Table II and the best performance values are marked in bold. As it can be observed from Table II, ST-VQM, Liu-VQM are the two best performing metrics, with PCC equals to 0.9509, 0.9286 correspondingly. To analyze if the differences between those values are significant, a T-test was carried out taking the difference of the predicted score between DMOS and Liu-VQM, and the one between DMOS and ST-VQM as inputs. The results showed that our proposed metric significantly outperform the second best performing Liu-VQM. As it can be observed, the performance of the image metrics, including MW-PSNR and MP-PSNR, is very limited, which can be due to (1) they over-penalize the consistent shifting artifacts, and (2) these measures do not take temporal distortions into account.
As it has been verified in the subjective experimental results, navigation scan-paths affect the perceived quality. Therefore, it is important for an objective metric to point out whether the perceived quality using a given trajectory is better than using other trajectories. As thus, the metric can be used to evaluate the limit of the system in worse navigation situations. To this end, the Krasula performance criteria [26, 27] is used to assess the ability of objective measures to estimate whether one trajectory is better than another with the same rate-point and baseline configurations in terms of perceived quality. Pairs of sequences generated with the same configurations but in form of and in the dataset are selected to calculate the area under the ROC curve of the ’Better vs. Worse’ categories (AUC-BW), area under the ROC curve of the ’Different vs. Similar’ category (AUC-DS), and percentage of correct classification (CC) (see [26, 27] for more details). More specifically, since pairs are collected in form of with other parameters fixed, if one metric obtain higher AUC-BW, it shows more capability to indicate that sequences with certain trajectory are better/worse than with another. Similarly, if the metric obtain higher AUC-DS, then it can better tell whether the quality of sequences in form of one trajectory is different/similar to the ones in form of another trajectory. Results are reported in Table III. As it can be observed, the proposed metric obtain the best performance in terms of the three evaluation measures. It is proven that the proposed ST-VQM is able to quantify temporal artifacts introduced by views switch. More importantly, ST-VQM is the most promising metric in telling sequence generated in form of which trajectory is of better quality than the others.
|Image Quality Metrics|
|Video Quality Metrics|
Vi-a Selection of Parameters
It would be desirable that the performance of a VQM does not vary significantly with a slight change of the parameters. In this section, an analysis on the selection of the parameter of the proposed metric is presented. In order to properly select and in equation (5), as well as to check the performance dependency of the parameters, a 1000 times cross validation is conducted. More specifically, the entire database is separated into a training set (80%) and testing set (20%) 1000 times, and the most frequently occurred value will be selected for the corresponding parameter. Before the validation test, we first multiply by and by so that the difference between the corresponding parameter will be smaller making easier for latter visualization (it has to be pointed out that this operation does not change the performance). The values of the three parameters with the corresponding PCC value across of 1000 times cross validation are shown in Fig. 7 (d). It can be observed that both the values of the three parameters and the performance do not change significantly throughout 1000 times, which verifies the fact that the performance of the metric does not change dramatically along with the modification of the parameters. Fig. 7 (a)-(b) depicts the histograms of frequencies of the three parameters’ values relatively. As it can be seen that and are the three most frequent value among 1000 times. They are thus selected and fixed for reporting the final performance in Table II and III. The mean value of PCC, SROCC, and RMSE of the proposed metric across the 1000 times is 0.9513, 0.9264 and 0.2895 correspondingly, which are close to the performance values reported in Table II with the selected configuration.
Subsequently, the performance dependency of the proposed algorithm on the exponent variable in equation (3) and the distance approaches has been reported and examined in . Therefore, in this paper, the same and the Jensen Shannon divergence are selected.
In this paper, aiming at better quantifying the specific distortions in sequences generated for FVV systems, both subjective and objective analyses have been conducted. On one side, in the subjective study, different configurations of compression and view-synthesis have been considered, which are the two main sources of degradations in FVV. In addition, following the approach of using simulating navigation trajectories that the users of immersive media may employ to explore the content, two different trajectories (referred as Hypothetical Rendering Trajectories) have been used to study their impact on the perceived quality. Knowing these posible effects, may help on the identification of critical trajectories that may be more suitable to carry out quality evaluation studies related to the benchmark of systems in the worst cases, Also, it must be pointed out that the sweeps that generated in this test focus more on views that contain region of interest (e.g. moving objects) in videos since human observers are more interested in them and even stop navigating after these regions show up. By analyzing the subjective results, we find that the way of how the trajectories are generated does affect the perceived quality. In addition, the dataset generated for the subjective tests (called IPI-FVV), along worth the obtained scores is made available for the research community in the field. On the other side, in the objective study, a Sketch-Token-based VQA metric is proposed by checking how the classes of contours change between the reference and the degraded sequences spatially and temporally. The results of the experiments conducted on IPI-FVV database has shown that the performance of proposed ST-VQM is promising. More importantly, ST-VQM is the best performing metric in predicting if sequences based on a given trajectory are of higher/lower quality than sequences based on other trajectories, with respect to subjective scores. Finally, in the future, (1) related subjective and objective studies will be extended for light-field applications; (2) ST-VQM will be improved as no reference metric.
-  M. Tanimoto, “Ftv: Free-viewpoint television,” Signal Processing: Image Communication, vol. 27, no. 6, pp. 555–570, 2012.
-  K. Muller, P. Merkle, and T. Wiegand, “3-d video representation using depth maps,” Proceedings of the IEEE, vol. 99, no. 4, pp. 643–656, 2011.
-  C. Fehn, R. De La Barre, and S. Pastoor, “Interactive 3-dtv-concepts and key technologies,” Proceedings of the IEEE, vol. 94, no. 3, pp. 524–538, 2006.
-  C. Fehn, “Depth-image-based rendering (dibr), compression, and transmission for a new approach on 3d-tv,” in Electronic Imaging 2004. International Society for Optics and Photonics, 2004, pp. 93–104.
-  L.-H. Wang, X.-J. Huang, M. Xi, D.-X. Li, and M. Zhang, “An asymmetric edge adaptive filter for depth generation and hole filling in 3dtv,” IEEE Transactions on Broadcasting, vol. 56, no. 3, pp. 425–431, 2010.
-  Y. Zhang, S. Kwong, S. Hu, and C.-C. J. Kuo, “Efficient multiview depth coding optimization based on allowable depth distortion in view synthesis,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4879–4892, 2014.
-  T.-Y. Chung, J.-Y. Sim, and C.-S. Kim, “Bit allocation algorithm with novel view synthesis distortion model for multiview video plus depth coding,” IEEE Transactions on Image Processing, vol. 23, no. 8, pp. 3254–3267, 2014.
-  P. Carballeira, J. Gutiérrez, F. Morán, J. Cabrera, and N. García, “Subjective evaluation of super multiview video in consumer 3d displays,” in International Workshop on Quality of Multimedia Experience (QoMEX), Costa Navarino, Greece, May 2015, pp. 1–6.
-  M. Domański, A. Dziembowski, A. Grzelka, and D. Mieloch, “Optimization of camera positions for free-navigation applications,” in Signals and Electronic Systems (ICSES), 2016 International Conference on. IEEE, 2016, pp. 118–123.
-  P. Hanhart, E. Bosc, P. Le Callet, and T. Ebrahimi, “Free-viewpoint video sequences: A new challenge for objective quality metrics,” in IEEE International Workshop on Multimedia Signal Processing (MMSP), Jakarta, Indonesia, Sep. 2014, pp. 1–6.
-  P. Merkle, Y. Morvan, A. Smolic, D. Farin, K. Mueller, P. de With, and T. Wiegand, “The effects of multiview depth video compression on multiview rendering,” Signal Processing: Image Communication, vol. 24, no. 1, pp. 73–88, 2009.
-  J. Kilner, J. Starck, J.-Y. Guillemaut, and A. Hilton, “Objective quality assessment in free-viewpoint video production,” Signal Processing: Image Communication, vol. 24, no. 1, pp. 3–16, 2009.
K.-J. Oh, S. Yea, A. Vetro, and Y.-S. Ho, “Virtual view synthesis method and self-evaluation metrics for free viewpoint television and 3d video,”International Journal of Imaging Systems and Technology, vol. 20, no. 4, pp. 378–390, 2010.
-  L. Do, S. Zinger, Y. Morvan, and P. H. de With, “Quality improving techniques in dibr for free-viewpoint video,” in 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video, 2009. IEEE, 2009, pp. 1–4.
-  Y. Liu, S. Ma, Q. Huang, D. Zhao, W. Gao, and N. Zhang, “Compression-induced rendering distortion analysis for texture/depth rate allocation in 3d video compression,” in Data Compression Conference, 2009. DCC’09. IEEE, 2009, pp. 352–361.
-  X. Liu, Y. Zhang, S. Hu, S. Kwong, C.-C. J. Kuo, and Q. Peng, “Subjective and objective video quality assessment of 3d synthesized views with texture/depth compression distortion,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4847–4861, 2015.
-  F. Battisti and P. Le Callet, “Quality Assessment in the context of FTV: challenges, first answers and open issues,” IEEE COMSOC MMTC Communications - Frontiers, vol. 11, no. 2, pp. 22–26, Mar. 2016.
-  A. Dricot, J. Jung, M. Cagnazzo, B. Pesquet, F. Dufaux, P. T. Kovács, and V. K. Adhikarla, “Subjective evaluation of Super Multi-View compressed contents on high-end light-field 3D displays,” Signal Processing: Image Communication, vol. 39, pp. 369–385, Nov. 2015.
-  O. Stankiewicz, K. Wegner, T. Senoh, G. Lafruit, V. Baroncini, and M. Tanimoto, “Revised summary of call for evidence on free-viewpoint television: Super-multiview and free navigation,” ISO/IEC JTC1/SC29/WG11 MPEG2016/N16523, Oct. 2016.
-  P. Carballeira, J. Gutiérrez, F. Morán, J. Cabrera, F. Jaureguizar, and N. García, “Multiview perceptual disparity model for super multiview video,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 1, pp. 113–124, 2017.
-  R. Recio, P. Carballeira, J. Gutierrez, and N. Garcia, “Subjective Assessment of Super Multiview Video with Coding Artifacts,” IEEE Signal Processing Letters, vol. 24, no. 6, pp. 868–871, Jun. 2017.
A. T. Hinds, D. Doyen, and P. Carballeira, “Toward the realization of six degrees-of-freedom with compressed light fields,” in2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, Jul. 2017, pp. 1171–1176.
-  E. Bosc, R. Pepion, P. Le Callet, M. Koppel, P. Ndjiki-Nya, M. Pressigout, and L. Morin, “Towards a new quality metric for 3-d synthesized view assessment,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 7, pp. 1332–1343, Nov. 2011.
-  E. Bosc, P. Le Callet, L. Morin, and M. Pressigout, “Visual Quality Assessment of Synthesized Views in the Context of 3D-TV,” in 3D-TV System with Depth-Image-Based Rendering. New York, NY: Springer New York, 2013, pp. 439–473.
-  E. Bosc, P. Hanhart, P. Le Callet, and T. Ebrahimi, “A quality assessment protocol for free-viewpoint video sequences synthesized from decompressed depth data,” in International Workshop on Quality of Multimedia Experience (QoMEX), Klagenfurt, Germany, Jul. 2013, pp. 100–105.
-  L. Krasula, K. Fliegel, P. Le Callet, and M. Klima, “On the accuracy of objective image and video quality models: New methodology for performance evaluation,” in International Workshop on Quality of Multimedia Experience (QoMEX), 2016.
-  H. Philippe, L. Krasula, P. Le Callet, and T. Ebrahimi, “How to benchmark objective quality metrics from paired comparison data?,” in International Workshop on Quality of Multimedia Experience (QoMEX), 2016.
-  D. Sandic-Stankovic, D. Kukolj, and P. Le Callet, “DIBR-synthesized image quality assessment based on morphological multi-scale approach,” EURASIP Journal on Image and Video Processing, 2016.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
-  P.-H. Conze, P. Robert, and L. Morin, “Objective view synthesis quality assessment,” in IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics, 2012, pp. 82 881M–82 881M.
-  F. Battisti, E. Bosc, M. Carli, P. Le Callet, and S. Perugia, “Objective image quality assessment of 3d synthesized views,” Signal Processing: Image Communication, vol. 30, pp. 78–88, 2015.
-  D. Sandić-Stanković, D. Kukolj, and P. Le Callet, “Dibr synthesized image quality assessment based on morphological wavelets,” in Quality of Multimedia Experience (QoMEX), 2015 Seventh International Workshop on. IEEE, 2015, pp. 1–6.
-  D. Sandic-Stankovic, D. Kukolj, and P. Le Callet, “Dibr synthesized image quality assessment based on morphological pyramids,” in 2015 3DTV-Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON). IEEE, 2015, pp. 1–4.
-  C.-T. Tsai and H.-M. Hang, “Quality assessment of 3d synthesized views with depth map distortion,” in Visual Communications and Image Processing (VCIP), 2013. IEEE, 2013, pp. 1–6.
-  Y. Zhao and L. Yu, “A perceptual metric for evaluating quality of synthesized sequences in 3dv system,” in Proc. SPIE, vol. 7744, 2010, p. 77440X.
-  E. Ekmekcioglu, S. Worrall, D. De Silva, A. Fernando, and A. M. Kondoz, “Depth based perceptual quality assessment for synthesized camera viewpoints,” in International Conference on User Centric Media. Springer, 2010, pp. 76–83.
-  “DIBR Videos Quality Database.” [Online]. Available: http://ivc.univ-nantes.fr/en/databases/DIBR_Videos/ (Last visited Jan. 2018).
-  “DIBR Images Quality Database.” [Online]. Available: http://ivc.univ-nantes.fr/en/databases/DIBR_Images/ (Last visited Jan. 2018).
-  “Free-Viewpoint Synthesized Videos Quality Database .” [Online]. Available: http://ivc.univ-nantes.fr/en/databases/Free-Viewpoint_synthesized_videos/ (Last visited Jan. 2018).
-  “SIAT Synthesized Video Quality Database.” [Online]. Available: http://codec.siat.ac.cn/SIATDatabase/index.html (Last visited Jan. 2018).
-  R. Song, H. Ko, and C. C. Kuo, “MCL-3D: A database for stereoscopic image quality assessment using 2D-image-plus-depth source,” Journal of Information Science and Engineering, vol. 31, no. 5, pp. 1593–1611, Mar. 2015.
-  “MCL 3D Database.” [Online]. Available: http://mcl.usc.edu/mcl-3d-database/ (Last visited Jan. 2018).
-  G. Lafruit, K. Wegner, and M. Tanimoto, “Call for evidence on free-viewpoint television: Super-multiview and free navigation,” MPEG N15348, Warsaw, 2015.
-  ——, “Draft call for evidence on ftv,” ISO/IEC JTC1/SC29/WG11 MPEG2015 N, vol. 15095, 2015.
-  “Nagoya university sequences,” http://www.fujii.nuee.nagoya-u.ac.jp/multiview-data/.
-  K. Wegner and O. Stankiewicz, “Ders software manual,” ISO/IEC JTC1/SC29/WG11 M, vol. 34302.
-  G. Lafruit, K. Wegner, T. Grajek, T. Senoh, P. Kovács, P. Goorts, L. Jorissen, B. Ceulemans, P. C. Lopez, S. G. Lobo et al., “Ftv software framework,” MPEG N15349, Warsaw, 2015.
-  U. Engelke and P. Le Callet, “Perceived interest and overt visual attention in natural images,” Signal Processing: Image Communication, vol. 39, pp. 386–404, 2015.
-  ——, “Perceived interest and overt visual attention in natural images,” Signal Processing: Image Communication, vol. 39, pp. 386–404, 2015.
-  ITU, “Methods for the subjective assessment of video quality, audio quality and audiovisual quality of internet video and distribution quality television in any environment,” ITU-T Recommendation P.913, 2014.
-  ——, “Methodology for the subjective assessment of the quality of television pictures,” Recommendation ITU-R BT.500, 2012.
-  VQEG, “Report on the Validation of Video Quality Models for High Definition Video Content,” Jun. 2010.
-  J. J. Lim, C. L. Zitnick, and P. Dollár, “Sketch tokens: A learned mid-level representation for contour and object detection,” in
-  S. Ling and P. Le Callet, “Image quality assessment for free viewpoint video based on mid-level contours feature,” in IEEE International Conference on Multimedia and Expo, Hong Kong, China, Jul. 2017, pp. 79–84.
-  A. Ninassi, O. Le Meur, P. Le Callet, and D. Barba, “Does where you gaze on an image affect your perception of quality? applying visual attention to image quality metric,” in Image Processing, 2007. ICIP 2007. IEEE International Conference on, vol. 2. IEEE, 2007, pp. II–169.
-  H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
-  K. Gu, L. Li, H. Lu, X. Min, and W. Lin, “A fast reliable image quality predictor by fusing micro-and macro-structures,” IEEE Transactions on Industrial Electronics, vol. 64, no. 5, pp. 3903–3912, 2017.
-  A. Liu, W. Lin, and M. Narwaria, “Image quality assessment based on gradient similarity,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1500–1512, 2012.
-  P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,” 2009.
-  E. Shechtman and M. Irani, “Matching local self-similarities across images and videos,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2007, pp. 1–8.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, vol. 2. IEEE, 2003, pp. 1398–1402.
-  M. H. Pinson and S. Wolf, “A new standardized method for objectively measuring video quality,” IEEE Transactions on broadcasting, vol. 50, no. 3, pp. 312–322, 2004.
-  “Video quality metric (vqm) software,” http://www.its.bldrdoc.gov/resources/video-quality-research/software.aspx.