Point clouds, nowadays, denote a prominent solution for the representation of 3-D photo-realistic content in immersive applications. This possibility was raised by advancements in depth sensing and graphics processing technologies the past years, and more recently fuelled by relevant activities of the JPEG [Ebrahimi2016a] and MPEG [Schwarz2018a] standardization bodies. As a result of these efforts, MPEG has crafted two standards, namely, Geometry-based Point Cloud Compression (G-PCC) [MPEG-GPCC-standard], and Video-based Point Cloud Compression (V-PCC) [MPEG-VPCC-standard] which are tailored for static and dynamic point cloud contents respectively, while JPEG has recently issued a Call for Evidence within the JPEG Pleno framework [N88014]. Released standards are expected to establish interoperability and facilitate the integration of point cloud technology in daily use-cases, unravelling new possibilities for remote communication and extended reality (XR) experiences.
Point clouds offer advantages in terms of ease of acquisition, accurate modelling, and photo-realistic rendering. However, for faithful representation of 3D visual information, a vast amount of data is required. Thus, lossy compression schemes are commonly employed, which achieve larger data size reductions in comparison to lossless counterparts, at the expense of perceptual distortions. Point clouds might also undergo signal deformations during processing, transmission, and/or rendering, which have a negative impact on the fidelity of the original content. Therefore, there is need for mechanisms to quantify the induced visual impairments. To this aim, subjective and objective quality assessment methodologies are essential. Subjective quality evaluations rely on human opinions, and provide ground truth ratings of visual quality for distorted stimuli. Although accurate, these approaches are time and cost expensive. Objective quality metrics refer to algorithms that computationally predict the visual quality of distorted stimuli. They are easily operated, yet, their performance depends on their ability to assess distortions in perceptual terms. The prediction accuracy of objective quality metrics is validated through benchmarking and, specifically, after comparison with subjective ground truth.
Objective quality metrics can be classified based on their requirement for the original content (i.e., reference) at execution time as full-reference, reduced-reference, and no-reference. In the first class, the presence of the reference content is necessary; in the second class some related data are needed as input, whereas, in the third class no reference information is required. Point cloud metrics can additionally be categorized as projection-based and point-based. The former refer to 2D solutions that are applied on projected views, capturing geometric and textural point cloud distortions as reflected upon rendering on planar arrangements. These predictors commonly adopt or extend techniques that were devised for images in the past; however, they are view- and rendering-dependent[Alexiou2019b]. Conversely, point-based quality metrics operate in the 3-D point cloud domain and are rendering-agnostic. Initial attempts built on simple distances between individual points, whereas more recent submissions utilize richer features that capture local patterns of geometric or textural information. The effectiveness of the latest point-based metrics heavily relies on the accuracy of textural predictors [Meynet2020a, Alexiou2020b, Yang2020a], which compute statistics over spatial neighborhoods, thus, carrying contributions from neighbors and capturing geometric distortions in an implicit manner. Yet, there is still need for explicit evaluations of geometric degradations, in order to improve the performance and robustness of quality metrics. Current point-based solutions rely on a very small set of geometrical features, often capturing a specific property of the underlying surfaces, as reflected by the statistical distribution of point locations [Javaheri2021a]
, normal vectors[Tian2017a, Alexiou2018a, Alexiou2020a], or curvature values [Meynet2020a, Alexiou2020a].
In this paper, we propose the use of geometric descriptors based on Principal Component Analysis (PCA) to estimate structural distortions in point cloud contents. Such descriptors have been used with lidar data for urban classification[Chehata2009a], semantic interpretation [Weinmann2015a], semantic segmentation [Hackel2016b], and contour detection [Hackel2016a], while more recently, a subset with well-behaving distributions was employed for no-reference objective quality assessment [Zhang2021a]. In order to better capture local variations in the distribution of the descriptors, we adopt statistical features that can estimate average trends and dispersion in a neighborhood. We complement our geometrical features by employing textural descriptors, based on perceptually-relevant luminance variations. Corresponding quality scores obtained in both domains are linearly combined to a global indication of visual degradation. Our results show the high performance of our proposed features in predicting the perceptual quality of point cloud contents under various impairments, with improvement with respect to state-of-the-art solutions.
Ii Related work
The point-to-point and point-to-plane [Tian2017a] denote the earliest attempts for the establishment of point-based objective quality metrics. The former measures the Euclidean distance between point coordinates, while the latter relies on the projected error of distorted points across reference normal vectors. In both metrics, the Mean Square Error (MSE) or the Hausdorff distance (HSD) is applied over the individual, per point error values, in order to deliver a global degradation score. In [Javaheri2020a]
, the generalized Hausdorff distance is proposed to mitigate the sensitivity of the Hausdorff distance in outlying points, by excluding a percentage of the largest individual errors. The geometric Peak-Signal-to-Noise-Ratio (PSNR), defined in[M39966] for both metrics to account for differently scaled contents, was also revised in [Javaheri2020c]. Specifically, the voxel grid’s diagonal originally used as the peak value, was then replaced by the average over distances between neighbors in 3D space or after projection onto local planes, to represent the content’s intrinsic or rendering resolution, respectively. The plane-to-plane metric is described in [Alexiou2018a] and estimates the angular similarity of tangent planes, as expressed through unoriented normals. The point-to-distribution metric, introduced in [Javaheri2020b], computes the Mahalanobis distance between a distorted point and a reference neighborhood. The PC-MSDM [Meynet2019a] evaluates the similarity of local curvature statistics, extracted after quadric fitting in support regions.
The PC-MSDM was extended to PCQM [Meynet2020a], by incorporating local statistical measurements from luminance, chrominance and hue components to evaluate textural impairments. The PointSSIM [Alexiou2020a] relies on statistical dispersion of location, normal, curvature and luminance data. An optional pre-processing step of voxelization is proposed to enable different scaling effects and reduce intrinsic geometric resolution differences across contents. The VQA-CPC [Hua2020a] computes statistics upon Euclidean distances between every sample and the arithmetic mean of the point cloud, using geometric coordinates and color values. An extension is presented in [Hua2021a]
, involving a point cloud partition stage, before extraction of features per region. The geometric features consist of statistical moments applied on Euclidean distances, angular distortions and local densities, which are weighted according to the roughness of a region. The textural features rely on the same statistics, after conversion to the HSV color space. The above algorithms follow a similar operating principle to the SSIM[Wang2004a], in terms of identifying structural features that reflect local changes in either geometric, or textural information.
In [Javaheri2021a], the point-to-distribution metric was extended to capture color degradations, by applying the same formula on the luminance, and averaging the obtained error values. A graph signal processing-based approach is described in [Yang2020a], evaluating statistical moments of color gradients on keypoints of the reference content, which are identified after high-pass filtering on its topology. In [Diniz2020a], local binary patterns (LBP) on the luminance channel are applied in local neighborhoods. This work is extended in [Diniz2020b] considering the point-to-plane distance between point clouds, and the point-to-point distance between feature maps. A variant descriptor called local luminance patterns (LLP) is proposed in [Diniz2020c], introducing a voxelization stage. In [Diniz2021a], a texture descriptor to compare neighboring color values using the CIEDE2000 distance, is proposed. The color differences are coded as bit-based labels, which denote frequency values of pre-defined intervals. An extension is presented in [Diniz2021b], namely, BitDance, which incorporates bit-based labels from a geometric descriptor that relies on comparison of neighbouring normal vectors.
The point-to-point metric has been adapted to measure the MSE or the PSNR for color-only degradations [M40522]. This effectively simulates corresponding 2D algorithms that have been extensively used with color images. More sophisticated paradigms of texture-only metrics have been proposed in [Viola2020a], which compute histograms or correlograms, of luminance and chrominance, in order to characterize color distributions.
The first reduced-reference objective quality metric is reported in [Viola2020b], and relies on global features that are extracted from location, color and normal data. More recently, a reduced reference metric for point clouds encoded with V-PCC is presented in [Liu2021a]. It is based on a linear model of geometry and color quantization parameters, with the model’s parameters determined by a local and a global color fluctuation feature.
Regarding projection-based approaches, the prediction accuracy of 2D quality metrics was initially examined in [Torlig2018a], over images obtained after projecting point clouds on the six faces of a surrounding cube. The influence of the number of viewpoints in denser camera arrangements and the exclusion of background pixels is explored in [Alexiou2019a], with a proposed weighting scheme based on user interactivity. In [Yang2020b]
, a weighted combination of global and local features extracted from texture and depth images, is defined. The Jensen-Shannon divergence on the luminance channel serves as the global feature, whereas a depth-edge map, a texture similarity map, and an estimated content complexity factor, account for the local features. In[He2021a]
, color and curvature values are projected on planar surfaces. Color impairments are evaluated using probabilities of local intensity differences, together with statistics of their residual intensities, and similarity values between chromatic components. Geometric distortions are assessed based on statistics of curvature residuals.
A hybrid approach that makes use of both projection- and point-based algorithms, is proposed in [Chen2021a]. The point clouds are divided into non-overlapping partitions, called layers, with a planarization process taking place at each layer, before applying the IW-SSIM [Wang2011a] to assess geometric distortions. Color impairments are evaluated using RGB-based variants of similarity measurements defined in [Meynet2020a].
Iii Description of PointPCA
The architecture of the proposed metric can be decomposed in five stages, namely, (a) fusion, (b) correspondence, (c) descriptors, (d) statistical features, and (e) comparison. An illustrative system diagram is provided in Figure 1. The metric requires the reference content during execution in order to provide a quality prediction for a distorted stimulus. Thus, a correspondence between the reference and distorted point cloud is computed, after a fusion step which removes duplicated points. Then, geometric and textural descriptors are computed per point, for both stimuli. For every descriptor, local relations are captured through statistics, which are estimated over support regions defined around points. This process is repeated for every descriptor, leading to a set of so-called statistical features, which are obtained per point. The statistical features extracted from both the reference and the distorted point clouds are eventually compared, given the correspondence function, and error maps are obtained. For each map, the error samples are pooled together and a quality prediction is obtained, per statistical feature. Finally, a total quality score for a distorted stimulus is computed as a linear combination of individual quality predictions. Below, we detail each of the aforementioned stages separately.
In this step, duplicated coordinates in a point cloud are removed and corresponding color values are averaged. Thus, points with unique locations are employed during neighborhood formulation, while redundant correspondences between the reference and the model under evaluation, are avoided.
Identifying matches between two sets of points, is generally an ill-posed problem. To favor lower complexity, we make use of the nearest neighbor algorithm for identification of correspondences between two point clouds, similarly to the majority of existing solutions. For this purpose, one model is selected as the reference and the other is set under evaluation. In particular, for every point that belongs to a point cloud under evaluation (i.e., ), a matching point is identified as its nearest neighbor in terms of Euclidean distance, and is registered as its correspondence. Formally, for a point cloud under evaluation, a correspondence function is defined as with . Note that, different sets of matching points are obtained when iterating over the points of to identify nearest neighbors in , with respect to starting from to find matches in ; that is, when setting , or as reference, respectively. In our case, we set both the pristine and the impaired models as reference, and we use a max operation in order to obtain a final prediction independent of the reference selection. This is commonly referred in the literature as symmetric error [Alexiou2018a, Javaheri2020a].
A set of geometric and textural descriptors are defined per point, in order to reflect local properties of a point cloud with respect to its topology and appearance, respectively.
|Sum of eigenvalues|
We propose the use of low-level shape descriptors that are extracted from quantites obtained after PCA. Provided a query point , we identify a surrounding support region that belongs to the same point cloud, forming a set of points . The covariance matrix of this set is computed, as shown in Equation 1,
with indicating the cardinality, and the centroid of , which is given in Equation 2.
Since the covariance matrix, also known as the 3D structure tensor, represents a symmetric positive-definite matrix, its eigenvalues exist, are non-negative and correspond to an orthogonal system of eigenvectors. The latter, denote directions across which the data are mostly dispersed, whereas the former reflect the variance of the transformed data across the principal axis. Eigenvalues and eigenvectors are employed in measurements that estimate interpretable local 3D shape properties.
Let us assume that , , and denote the eigenvectors corresponding to the eigenvalues , and , with . Moreover, let us define , and to depict unit vectors across the , and axis, respectively. In Table I, we present the definitions of the geometric descriptors, . Intuitively, denote the individual (i.e., ) and the aggregated sum (i.e., ) of dispersion magnitudes for the points distribution across principal axis. Descriptors reveal behaviors of a neighborhood’s point arrangement, capturing the dimensionality of the local surface. Another measurement indicating the shape of the local point distribution, is given by and focuses on the data variation across the 1st and the 3rd principal directions. Descriptors , and provide an estimate of spread, uncertainty and variation of the underlying surface, respectively, considering all principal axis. Descriptor depends on the third eigenvector, , and quantifies the projected error of the queried point from the centroid of the set, across the linearly approximated local surface. Finally, descriptors measure the projected error of the estimated normal vector (i.e., ) across unit vectors parallel to the coordinate system axes where a point cloud is laying. Note that have been proposed in [West2004a], in [Pauly2002a], and in [Demantke2012a] named as verticality, in order to measure local geometric properties.
We propose luminance-based measurements to capture distortions in the appearance of a point cloud, inspired by their high correlation with human perception proven by previous efforts [Meynet2020a, Alexiou2020b, Javaheri2021a]. Considering that RGB is the most common representation of color attributes, the ITU-R Recommendation BT.709 [ITURBT7096] is employed for conversion to the YCbCr color space, as shown in Equation 3.
After transformation, the obtained luminance of every point serves as our descriptor, with , and .
A support region around a point sample is required for the computation of geometric descriptors. There are two alternatives widely employed to specify point cloud neighborhoods; that is, the nearest neighbor and the range search algorithms, hereafter, noted as -nn and -search, respectively. The former leads to neighborhoods of arbitrary extent and a fixed population, whereas the latter results in regions that span in the same volume, with a varying number of samples.
The -search is selected to estimate descriptors. This choice is justified by our requirement to represent properties of the same length of surface areas in both the reference and the distorted stimuli. This behavior is granted by the -search variant, as opposed to the -nn algorithm, which is susceptible to different point densities. For example, in the presence of down-sampling, there is no difference between the size of regions identified in the pristine and the impaired point clouds using the -search. However, when using the -nn alternative, larger regions are considered in the impaired point cloud. Thus, descriptor values would reflect properties of different size of underlying surfaces and, in turn, such a comparison would introduce unreliable measurements.
Iii-D Statistical features
Statistical functions are applied on geometric and textural descriptor values, which are computed per point, in order to capture inter-point relations that lie in the same neighborhood. In particular, the mean is computed, in order to reflect a smoother estimate of the surface property, accounting for a broader region. The standard deviation is also obtained, in order to quantify the level of variation of a surface property in the surrounding area. Considering a query pointand a set of neighbors , the first statistical feature is computed per Equation 4,
where is a row vector of concatenated geometric and textural descriptors that corresponds to . The second statistical feature, is then obtained from Equation 5.
A complete statistical features vector is composed of both terms, and is annotated by .
Statistical features are able to better capture dependencies within local neighborhoods, and provide measurements that are more perceptually-relevant with respect to single points. Specifically, they are well-aligned with primary characteristics of human visual system, such as low-pass filtering and sensitivity to high-pass frequencies. Applying the mean in local regions mimics the former, whereas the standard deviation provides an estimate of the latter. Moreover, statistical features computed on a point contain contributions from its surroundings, alleviating the negative effects of an erroneous correspondence, or outlying descriptor values. That is, considering that distorted stimuli are characterized by point removal or displacement with respect to their reference positions, errors might be introduced by the matching algorithm, or individual descriptor values might be poorly estimated. Thus, for instance, comparing the means instead of the individual descriptor values, mitigates the error.
The -nn is selected in order to compute statistical features. We argue that, in this case, the operating principle of this approach is beneficial towards revealing topological deformations. In particular, by appending neighboring samples until we reach , we effectively extend to larger areas in case of lower point density, and recruit erroneous points in case of re-positioning. Thus, larger differences will be observed when compared to corresponding measurements taken from the pristine content. In simpler terms, using -nn enables us to penalize point sparsity and displacement.
Given a correspondence function , the -th statistical feature of point , namely , is compared the -th statistical feature of point , namely , as shown in Equation 6,
where indicates the relative difference that corresponds to , , , while represents a small constant to avoid undefined operations; in this case, we use the machine rounding error for floating point numbers. For every statistical feature , a corresponding predictor is exported after pooling across all points of , as shown in Equation 7.
The same computations are repeated after setting the point cloud as the reference and computing a correspondence function . In an analogous way, for every statistical feature , a corresponding predictor is derived, and a symmetric quantity is obtained after a max operation, as shown in Equation 8.
Iii-F Quality score
Each predictor essentially provides a quality rating based on the -th statistical feature. A total quality score can be obtained as a linear combination of domain-specific, or across-domain predictors, as per Equation 9
in which denotes geometry, texture, and geometry-plus-texture attribute domains respectively, indicates a subset of predictors, with , , and , while , with the number of selected predictors, and the corresponding weight. Such weights can be either manually set to specific values, e.g., giving equal weight to each predictor, or can be learned. In our case, the latter is performed through an optimization problem minimizing the distance between the predicted and the ground truth Mean Opinion Score (MOS):
with representing the selected fitting function to map the objective score to the MOS (see Section IV-B).
Additionally, in case of domain-specific quality scores, a final quality score encompassing both domains can be computed using the following linear combination:
with a regularization term to drive the contributions of geometry and texture.
Iv Benchmarking setup
A total of three subjectively annotated data sets is recruited in order to evaluate the performance of quality metrics under consideration, namely, ICIP2020 (D1) [Perry2020a], M-PCCD (D2) [Alexiou2019b] and SJTU (D3) [Yang2020b]. D1 contains six colored static point clouds that represent human figures, whose geometry and color is encoded using V-PCC and two G-PCC variants (i.e., Octree-plus-Lifting and TriSoup-plus-Lifting) at 5 degradation levels, for a total of 96 stimuli. D2 consists of eight colored static point clouds illustrating both human figures and inanimate objects, whose geometry and color is encoded using V-PCC and four G-PCC variants (i.e., Octree-plus-Lifting, Octree-plus-RHAT, TriSoup-plus-Lifting and TriSoup-plus-RAHT), resulting in 240 stimuli. Finally, D3 is comprised of nine colored point clouds depicting both human figures and inanimate objects, that are subject to octree-based compression, color noise, geometry Gaussian noise, downscaling, and all superimposed combinations of the different types of degradation excluding compression, for a sum of 387 stimuli.
Iv-B Performance indexes
To evaluate how well an objective metric is able to estimate perceptual quality, the Mean Opinion Scores (MOS) computed from ratings of subjects that participate in an experiment are required and serve as ground truth. The metrics are typically benchmarked after applying a regression model in order to map the objective scores to the subjective quality range, while also to account for biases, non-linearities and saturations that might appear in subjective testing. In particular, let us define the result of execution of a particular objective metric indicates a Predicted Quality Score (PQS). A predicted MOS, denoted as P(MOS), is estimated by applying a fitting function on the [PQS, MOS] data-set. In our analysis, the Recommendation ITU-T J.149 [ITUTJ149] is followed, using the logistic fitting function type II. The PLCC, the SROCC, and the RMSE are computed between the PQS and MOS to conclude on the linearity, monotonicity, and accuracy of the objective quality predictors, respectively.
Iv-C Execution of objective quality metrics
State-of-the-art objective quality metrics are employed in our performance evaluation analysis for comparison purposes. In particular, we use the point-to-point, point-to-plane [Tian2017a] and plane-to-plane [Alexiou2018a] metrics, which are focused on geometry-only distortions. The color PSNR on Y channel, as well as the histogram-based metric on luminance [Viola2020a] (DistY) are recruited, assessing textural-only degradations. Finally, the joint point-to-distribution metric [Javaheri2021a] using logarithmic values, the PointSSIM [Alexiou2020a] applied on normal, curvature and luminance separately, the PCQM [Meynet2020a], as well as the reduced reference PCM_RR [Viola2020b] are also benchmarked.
To compute the point-to-point and point-to-plane, the software version 0.13.5 [M40522] is used. For the latter, the required normal vectors are computed using quadric fitting with -search and , where indicates the maximum length of the bounding box of the reference point cloud. For plane-to-plane, the normal vectors are computed based on quadric fitting and -search with , in a variant of the settings employed in [Perry2020a]. In the point-to-distribution metric, neighborhoods consisting of point samples are considered. For PointSSIM, the default parameters settings are employed, with the variance as the selected estimator of statistical dispersion, and . Quality predictions based on luminance, curvature and normal vectors, are computed, with the latter two obtained after quadric fitting with -search (). In PCQM, the default configurations are used, and the proposed weighted combination of the computed features is employed. For the histogram metric and the reduced reference point cloud metric, the script implementations with default settings are executed. Regarding the proposed metric, namely, PointPCA, to estimate the geometric descriptors, we use the -search with equal to . For the computation of statistical features, we use the -nn algorithm with .
V-a Geometric descriptors
In Figure 2, the PLCC and SROCC of every statistical feature is illustrated in the form of bars, grouped per geometric descriptor, against subjectively annotated datasets. It can be noticed that the prediction accuracy of the proposed features is generally high, reaching a different performance plateau per dataset. We observe that, for a given descriptor, one statistical function might perform better than the other in one dataset, whereas the other might be superior in the rest. For instance, for , the standard deviation exhibits higher accuracy in D1 and D2 with respect to the mean, with the opposite being true for D3. However, the differences remain limited. It is also remarked that there is no particular statistical feature performing poorly, or excelling consistently across all datasets.
As the next step, we examine different combination approaches for the proposed geometric descriptors. In Figure 3, the ranking scores of every descriptor, computed as the average PLCC and SROCC among corresponding statistical features, are shown with gray bars in descending order. The overlaying plots show how performance in terms of SROCC varies as new statistical features in ranking order are combined together, either with equal weights (solid line), or with optimized weights (dashed line). The latter is carried out by training a linear model using the leave--out method over each dataset, separately. In particular, a dataset is split into two equal groups, ensuring that all versions of the same content are included only in one group. Optimal weights are extracted from every partition under the problem formulation given in Equation 10, with the final weights obtained as the average across all partitions. Each plot additionally reports cross-dataset validation results, i.e., performance obtained by applying the weights learned on one dataset to the rest.
Focusing on the ranking scores, we observe resemblances in the ranking patterns between D1 and D2, due to the similar types of contents and distortions both datasets exhibit; yet, for D3, the trends differ. For instance, the top five descriptors in D1 coincides with D2, although at a different order, while they are ranked low in D3. Similarly, the top three descriptors in D3 are ranked below middle in D1 and D2. These results demonstrate that the visual quality of some types of distortions and contents are more accurately captured by certain descriptors.
Regarding the effect of different combinations and weighting strategies, the performance of statistical features varies depending on the dataset. In particular, considering all tested combinations and weightings, the SROCC is ranging for D1 in the interval , for D2 in , and for D3 in . Evidently, geometry-only predictors are excelling in D1, showing high resilience under any selected weighting approach. In D2, the performance decreases while maintaining good robustness, whereas in D3, the worst performance with the largest variability is found.
Considering a learning approach, we remark that, when weights are obtained from the same dataset used for testing, high performance is maintained across all examined combinations, with marginal differences (i.e., dashed blue, red and purple lines in Figures (a)a, (b)b and (c)c, respectively). This indicates that optimal weights that are learned even on small sets of highly-performing descriptors on a dataset, lead to good results. When the weights are learned on other datasets, performance reductions naturally occur; this depends not only on the combination of statistical features in use, but also on the intrinsic characteristics of the testing dataset. For instance, under any set of weights, small deviations are observed when testing on D1 and D2. On the contrary, evident reductions are noted in D3 under highly-ranked descriptors in D1 and D2 (which are less perceptually relevant for D3), implying the larger impact of a poorly tuned weighting scheme in more challenging datasets. Recruiting more statistical features, though, seems to be beneficial, as the performance is progressively improving. The same trend is observed when applying equal weights. Note that using the latter weighting scheme leads to performance reductions with respect to learning from the same testing dataset; yet, better results are obtained when compared to learning weights on other datasets.
Summarizing our analysis, high performance is achieved under a suitable subset of properly weighted geometric statistical features, per dataset. By incorporating a larger number of descriptors, lower performance variability across datasets is observed using either equal or learned weights, indicating higher generalization capabilities. Thus, we opt for the recruitment of all geometric descriptors. Corresponding performance results are presented in Table II, for equal weights (), and weights learned () using every dataset .
V-B Textural descriptor
The performance of every statistical feature is reported individually in Table III, with indicating the mean and the standard deviation. Moreover, we report performance indexes obtained after combining the above using either equal weights (), or weights learned () in every dataset separately, similarly to the geometric descriptors analysis. We observe that the standard deviation performs better in D1 and D2, whereas the mean is superior in D3. Combining the two statistical functions with equal weights leads to performance gains in D2 with respect to using either of them individually. Conversely, the prediction accuracy drops for the remaining datasets, reaching values closer to the highest- and to the lowest-performing statistic for D1 and D3, respectively. By using learned weights, we observe improvements when the learning and testing are carried out on the same dataset, with declines occurring when learning on the rest. The only exception is observed when training with D2, where very similar results are obtained compared to the equal weights; indeed, the learned weights assign to and to .
Independently of the weights in use, the textural descriptor performs overall better in D2, and worse in D1 and D3. Note that the prediction accuracy of any combination is bounded by the performance of the individual statistical features, shown in Table III. We remark larger variability in performance for D1 with respect to D2 and D3, which indicates higher sensitivity for the former on the weights assigned to the statistical functions. It is noteworthy that learning weights on D1 leads to worse performance compared to using only the standard deviation. We recall that optimal weights are obtained using leave--out and by averaging across partitions that involve different contents for training and testing. Based on our results, learning in any subset of the 8i contents [Eon2017a] leads to a dominant contribution of the mean, which steers the final weight allocation accordingly. This statistical function under-performs in D1, which explains this performance drop.
Summarizing, different statistical functions are proven more effective in capturing visual impairments across datasets. Sub-optimal weight allocation leads to substantial performance drops, as witnessed for example with D1. This can be explained by considering that only one descriptor is adopted; that is, only a single aspect of visual appearance is examined, implying lower robustness and higher sensitivity in weight assignment. Finally, comparing the results presented in Tables II and III, we note that the textural descriptor outperforms the geometric in D2, while the performance is notably lower in D1 and D3. Higher variability in prediction accuracy is observed across different combinations of textural predictors, for the same reason mentioned above. To this result, we should additionally account that the benchmarking is conduced in datasets that involve both textural and geometric distortions.
V-C Geometric and textural descriptors
In Table IV, different combinations of geometry and texture descriptors are employed and corresponding performance results are reported. In particular, geometric and textural features are merged to a new set, with equal weights assigned (); that is, . Moreover, using the same set, weights are learned per dataset (). Finally, geometry and texture quality scores with equal weights ( and , per Table II and III), are combined under Equation 11 and ().
We observe that using equal weights across all statistical features results in similar or better performance with respect to the same weighting on geometry-only features. In comparison to textural-only features, a performance decrease in D2 is observed; however, remarkable improvements are noted in D1 and D3. Our results suggest that, by using equal weighting, geometric descriptors govern the outcome with the integration of textural predictors being beneficial. The former is justified by the substantially larger number of geometric features employed, compared to the textural ones (i.e., 30 against 2).
Combining attribute-specific quality scores with same contributions leads to performance improvements in D2 and D3, while a slight decrease is observed in D1 with respect to . To understand how the optimal contribution of geometric and textural quality scores varies per dataset, we perform a grid search with values of . For , only textural quality scores () are considered, whereas for , only geometric quality scores are employed (). Results are shown for all datasets in Figure 4. The trends indicate that for D1, the integration of texture-based predictors doesn’t improve the results, thus, using geometry-only descriptors is optimal. For D2, texture-only descriptors perform better than geometry-only, while their combination can lead to enhancements. For D3, geometry-only descriptors are found better than texture-only, with notable improvements observed when considering both. Evidently, a balanced contribution achieved with , provides a good compromise across all datasets.
Finally, by learning weights, it can be seen that the overall performance is improved when considering both geometric and textural descriptors, with respect to using descriptors from only one domain. Performance reductions when weights are learned on another dataset with respect to weights obtained from learning in the same dataset are observed, as expected. However, combining geometry and texture descriptors marks an improvement in both cases, excluding D1 where we note decreases; in this dataset, any weight allocation that drives contributions towards texture is sub-optimal, as noted earlier.
The above observations indicate the benefits in accounting for both types of attributes, which result in improving the accuracy and the resilience of the predictions. The best performance is attained by learning on the entire set of descriptors using D2. Thus, these weights, shown in Figure 5, are selected as the final weights in our metric to provide a quality score. The close-second performance achieved by is noteworthy, considering that there is no learning involved in its computation. Thus, it provides a viable alternative that avoids concerns regarding over-fitting or optimistic biases, validating the efficiency of the proposed features.
V-D Comparison with state-of-the-art metrics
In Table V, we report performance evaluation results for existing point cloud quality metrics, for comparison purposes. It can be seen that the proposed metric achieves competitive, if not better performance on all tested datasets. In the case of D1, it is worth noting that geometry-only approaches achieve very good performance, supporting our findings that the geometric descriptors exhibit stronger performance, as illustrated in Figure 4. In this dataset, PCQM achieves the best performance in terms of PLCC, SROCC, and RMSE. For D2, we observe that our proposed metric achieves the best performance for all indicators. PointSSIM using luminance, MMD-joint and PCQM also attain high correlation scores, with the former confirming our observations that luminance-based predictors are accurate in capturing visual distortions in this dataset. Finally, for D3 we remark that our proposed metric achieves the highest performance in terms of PLCC and RMSE, while reaching a close second SROCC. In this dataset, PCQM is a direct competitor, with the normal-based and curvature based variants of PointSSIM following, and the plane-to-plane denoting the last member of the top five metrics. Note that the results of point-to-point and point-to-plane with PSNR are not provided, considering that correlation computations were not possible for 54 out of 378 stimuli.
In this paper, we propose a point cloud objective quality metric that relies on PCA-based shape descriptors and luminance-based predictors to evaluate distortions in the geometry and color domain, respectively. Statistical functions are applied on the descriptor values in order to capture local relationships between point samples, which are compared between the pristine and the distorted stimuli in order to provide a quality prediction for the latter. The proposed features are evaluated individually, showing good overall performance, although not consistently ranked across datasets. To alleviate this behaviour, several weighting combination strategies were examined, both within the same attribute domain of descriptors, as well as by combining them. Well-tuned configurations were shown to lead to substantial performance improvements and higher generalization capabilities. Learning weights over the entire set of geometric and textural descriptors was found to be the highest-performing approach. A balanced contribution between geometry and texture quality scores computed using equal weights, denoted a close alternative, confirming the effectiveness of the proposed features. Our results show that the proposed metric achieves state-of-the-art performance, out-performing existing solutions in the majority of cases across all tested datasets. Considering that certain descriptors are more efficient against particular types of contents and degradations, future work will focus on identification and adoption of most perceptually-relevant predictors, per case.
Appendix A Parameter selection
A-a Support regions for statistical features
In Figure 6, the mean and standard deviation of SROCC values achieved by every predictor with are illustrated, under neighborhoods formed using -nn, with . Recall that - indicates the use of mean, while from - the standard deviation is employed. For every statistic, the first 15 predictors refer to the geometry and the last to the texture domain. Regarding the neighborhood sizes, we choose the aforementioned values, considering that the point clouds included in the datasets under exam are voxelized, dense, and represent large models. Thus, we may assume that small point neighborhoods represent local regions, which in turn can be approximated by planar surfaces. The selected ’s represent the number of vertices in fully-occupied planars of length size equal to 2, 4, 6 and 8 times the distance between two voxels, as shown in Figure 7. Our results indicate that all predictors show high robustness against the tested neighborhood sizes.
It is worth noting that there is an inter-dependency between support regions used for the estimation of geometric descriptors, and the computation of corresponding statistical features. For example, increasing the support region to estimate a descriptor leads to lower deviations between neighboring descriptor values; concurrently, decreasing the population to compute statistical features has a similar effect. In our case, the relationship between the two is examined by fixing the neighborhood size over which the descriptors are estimated (i.e., -search with ) and altering the population size (i.e., values) to compute statistical features. The obtained results confirm that the geometric descriptors’ configuration leads to high prediction accuracy, with stable performance across all tested ’s. In the proposed settings of our metric, we set .
different color spaces.
A-B Color spaces
In Table VI, we present the performance indexes obtained from textural measurements computed identically, considering different color spaces. We aim at validating the selection of the luminance-only component, by examining the performance achieved with the proposed statistical features over alternative color spaces that are popular in the literature. In particular, we employ the YCbCr color space as a straight-froward baseline, the default RGB as an anchor, and we additionally recruit the GCM [Geusebroek2001a], which is reported to correlate well with human perception, and CIELAB recommended by the International Commission on Illumination in 1976 and designed for perceptual uniformity. In this analysis, we employ only textural information. As usual, we learn optimal weights per dataset and test their prediction accuracy on all. Our results show that luminance-only leads to higher robustness across datasets, with very similar, or substantially better performance than the alternatives. Specifically, when training with D1 or D2, there are color spaces that perform slightly better in either of them; for instance, CIELAB performs better at both, while RGB and GCM are better at D2, with marginal differences in all cases. However, luminance-only out-performs the alternatives by substantial margins when testing on D3. By learning weights in D3, Y and YCbCr are found to attain almost equivalent performance. Our analysis indicates that the proposed predictors show similar performance under all tested color spaces; yet, the higher resilience observed using only the luminance component, justifies our selection.
A-C Comparison methods
We analyze the impact of selecting a different comparison method between statistical features extracted from the reference and the distorted stimulus. Considering conventional 2D imaging formulas, the relative difference between pixel values has been proven more effective, as it follows the Weber’s law and, hence, accounts for the higher sensitivity of human visual perception in lower luminance regions. Initially, mesh [Lavoue2006a, Lavoue2011a] and, more recently, point cloud quality metrics [Meynet2019a, Meynet2020a, Alexiou2020a, Hua2020a, Hua2021a] have been adopting relative difference formulas for both color and geometric measurements, without evidence though regarding how they compare to simpler alternatives. We shed light into the matter by illustrating performance differences between the descriptors proposed in the context of PointPCA. Specifically, we employ the comparison methods provided in Equations 12-17, with and indicating corresponding statistical features obtained from the reference and the distorted point clouds:
with set to machine rounding error for floating point numbers. In particular, in the first two equations we account for the absolute (AD) and the squared difference (SD) of the measurements, which are straightforward comparison methods. The subsequent RD1, RD2 and RD3 denote three different implementations of the relative difference, which lead to different penalization strategies; that is, the RD1 has been used in all point cloud metric implementations [Meynet2019a, Meynet2020a, Alexiou2020a, Hua2020a, Hua2021a], the RD2 has been widely used in conventional image metrics, such as in [Wang2004a, Zhang2011a], whereas the RD3 is a simple alternative. Finally, RD4 is an ad-hoc implementation following the opposite behaviour than the Weber’s law; namely, penalizing equal differences more at larger magnitudes.
In Figure 8, we show the SROCC attained using every predictor and corresponding comparison methods, per dataset. We observe that RD1, RD2 and RD3 are providing overall better performance, with marginal differences between them favoring RD1. Evidently, these results hold for both geometric and textural descriptors, and are in principle generalized across all datasets. For RD4, obvious performance reductions are noted for all predictors, excluding (i.e., mean of textural descriptor) in D1 and D2. However, this behaviour cannot be generalized in D3, while also, using RD4 and (i.e., standard deviation of textural descriptor), also leads to lower accuracy in all datasets. Thus, despite the individual gains, the overall performance of RD4 is judged limited, as expected.
Considering the AD and SD, we note that they bring some advantages to predictors that capture surface orientation (i.e., - and -) in D3. Moreover, gains are sporadically noted for the eigenvalue-based predictors linearity, planarity, sphericity and anisotropy (i.e., - and -); however, no consistent trends are reported, while the differences remain minor in all cases. It is noteworthy that both AD and SD perform consistently better with textural predictors (i.e., and ) in D1. Their superior performance in comparison to the RD1-RD3 variants is explained by the color distribution of the contents of this dataset. In particular, “ricardo” and “sarah” denote two point clouds from the Microsoft Voxelized Upper Bodies dataset [Loop2016a] with large surfaces covered in dark, which leads the RD1-RD3 variants to penalize more exhibiting distortions. Hence, their generalization capabilities across the contents of this dataset are affected negatively. This observation justifies our previous findings, regarding the low performance of the textural descriptor under the proposed RD1 variant integrated in our metric, against this dataset. Future developments should consider mechanisms to compensate such cross-content variations, for better performance.
Appendix B Performance evaluation
In Figure 9, we provide scatter plots illustrating the prediction accuracy of the proposed metric and the PCQM as a competing baseline against all datasets. We note that the objective scores from PointPCA are spanning in a larger range, showing higher discrimination power. In principle, though, very similar performance is observed between the two metrics, which is also reflected in the indicators that have been reported in the paper. Regarding PointPCA, we observe that the correlation achieved in D1 is high. Based on previous findings, in this dataset geometric predictors excel, with textural ones being sub-optimal. From Figure (a)a, it is evident that these discrepancies are effectively introduced by the “ricardo” and “sarah” contents, for the reasons mentioned in section A-C. In D2, the performance remains high, achieving the highest linearity, monotonocity and accuracy, compared to other existing solutions, based on the reported PLCC, SROCC and RMSE performance indexes. In D3, the performance reduces, with a larger spread of data points. However, this is mainly caused from the “unicorn” content, which shows an outlying behavior. In particular, when excluding it, the performance is reaching a PLCC of 0.910, SROCC of 0.891 and RMSE of 0.988. In the same setting, PCQM achieves 0.865, 0.858 and 1.195 for the above indicators, respectively.