It has been shown that Mean Squared Error (MSE) is not a promising measure for image quality, fidelity, or similarity [wang2009mean]. The distortions of an image or similarities of two images can be divided into two main categories, i.e., structural and non-structural distortions [wang2004image]. The structural distortions, such as JPEG blocking distortion, Gaussian noise, and blurring, are the ones which are easily noticeable by Human Visual System (HVS), whereas the non-structural distortions, such as luminance enhancement and contrast change, do not have large impact on the visual quality of image.
Structural similarity index (SSIM) [wang2004image, wang2006modern] has been shown to be an effective measure for image quality assessment. It encounters luminance and contrast change as non-structural distortions and other distortions as structural ones. Due to its performance, it has recently been noticed and used in optimization problems [brunet2018optimizing] for tasks such as image denoising, image restoration, contrast enhancement, image quantization, compression, etc, noticing that the distance based on SSIM is quasi-convex under certain conditions [brunet2012mathematical].
So far, the fields of manifold learning and machine learning have largely used MSE and Euclidean distance in order to develop algorithms for subspace learning. Principal Component Analysis (PCA) is an example based on Euclidean distance or norm. However, MSE is not as promising as SSIM for image structure measurement [wang2009mean, wang2006modern] making these algorithms not effective enough in terms of capturing the structural features of image. In this paper, we introduce the new concept of image structure subspace
which is a subspace capturing the intrinsic features of an image in terms of structural similarity and distortions, and can discriminate the various types of image distortions. This subspace can also be useful for parameter estimation for (or selection between) different denoising methods, but that topic will be dealt with in future work.
The outline and contributions of the paper are as follows: We begin by defining the background methods of SSIM and PCA. We then introduce ISCA using orthonomal bases and kernals by analogy to PCA, where ISCA can be seen as PCA which uses SSIM instead of the norm. We then describe an extensive set of experiments demonstrating the performance of ISCA on projection, reconstruction and out-of-sample analysis tasks compared to various kernel PCA methods. The derivations of expressions in this paper are detailed more in the supplementary-material paper which will be released in https://arXiv.org.
2 Structural Similarity Index
The SSIM between two reshaped image blocks and , in color intensity range , is [wang2004image, wang2006modern]:
where , , , , , and and are defined similarly for . In this work, . The , , and are for avoidance of singularity [wang2006modern] and is the dimensionality of the reshaped image patch. Note that since , we can simplify SSIM to , where and
. If the vectorsand have zero mean, i.e., , the SSIM becomes , where [otero2014unconstrained]. We denote the reshaped vectors of the two images by and , and a reshaped block in the two images by and . The (squared) distance based on SSIM, which we denote by , is [otero2014unconstrained, brunet2012mathematical, brunet2011class]:
where . In ISCA and PCA which inspires ISCA, the data should be centered; therefore, the fact that and should be centered is useful.
3 Principal Component Analysis
Since ISCA is inspired by PCA [jolliffe2011principal] we briefly review it here. Assume that the orthonormal columns of matrix are the vectors which span the PCA subspace. Then, the projected data onto PCA subspace and the reconstructed data are and , respectively. The squared length of the projected data is where and denote the trace and Frobenius norm of matrix, respectively. Presuming that the data are already centered, the is the covariance matrix; therefore: . Maximizing the squared length of projection where the projection matrix is orthogonal is:
The Lagrangian [boyd2004convex] is: , where is a diagonal matrix including Lagrange multipliers. Equating derivative of to zero gives us: . Therefore, columns of
are the eigenvectors of the covariance matrix.
PCA can be looked at with another point of view. The reconstruction error is where is the matrix of residuals. We want to minimize the reconstruction error:
The objective function is . The Lagrangian [boyd2004convex] is: , where is a diagonal matrix including Lagrange multipliers. Equating the derivative of to zero gives:
, which is again the eigenvalue problem for the covariance matrix. Therefore, PCA subspace is the best linear projection in terms of reconstruction error.
As shown above, PCA [jolliffe2011principal] is based on (or Frobenius) norm which is not a promising measure for image quality assessment [wang2009mean]. In order to have both the minimization of reconstruction error as in PCA and using a proper measure for image fidelity, we propose ISCA.
4 Image Structural Component Analysis (ISCA)
4.1 Orthonormal Bases for One Image
Our goal is to find a subspace spanned by directions for some desired . Consider an image block which is centered (its mean is removed). We want to project it onto a -dimensional subspace and then reconstruct it back, where . Assume is a matrix whose columns are the projection directions spanning the subspace. The projection and reconstruction of are and , respectively. We want to minimize the reconstruction error with orthonormal bases of the subspace; therefore:
According to Eq. (2) and noticing the orthonormality of projection directions, , we have:
The gradient of the is:
We partition a -dimensional image into non-overlapping blocks each of which is a reshaped vector . The parameter is an upper bound on the desired dimensionality of the subspace of block (). This parameter should not be a very large number due to the spatial variety of image statistics, yet also not very small so as to be able to capture the image structure. Also note that is an upper bound on the rank of .
We have instances of -dimensional subspaces, one for each of the blocks. For projecting an image into the subspace and reconstructing it back, one can project and reconstruct every block of an image separately using the bases of the block subspace. The overall bases of an image can be visualized in image-form by putting the bases of blocks next to each other (see the experiments in Section 6).
Considering all the blocks in an image, the problem in Eq. (5) becomes:
where and are the -th block and the bases of its subspace, respectively. We can embed the constraint as an indicator function in the objective function [boyd2011distributed]:
where and . The denotes the indicator function which is zero if its condition is satisfied and is infinite otherwise. The and are defined as union of partitions to form an image-form array, i.e., and [otero2018alternate].
The Eq. (9) can be solved using Alternating Direction Method of Multipliers (ADMM) [boyd2011distributed, otero2018alternate]. The augmented Lagrangian for Eq. (9) is: , where is the Lagrange multiplier, is a parameter, and . Note that the term is a constant with respect to and and can be dropped. The updates of , , and are done as [boyd2011distributed, otero2018alternate]:
Considering for a matrix , the gradient of the objective function in Eq. (10) with respect to is where is defined in Eq. (7). We can use the gradient decent method [boyd2004convex] for solving the Eq. (10). Our experiments showed that even one iteration of gradient decent suffices for Eq. (10) because the ADMM itself is iterative. Hence, we can replace this equation with one iteration of gradient decent.
The proximal operator is defined as [parikh2014proximal]:
where is the proximal parameter and is the function that the proximal algorithm wants to minimize. According to Eq. (13), the Eq. (11) is equivalent to . As is indicator function, its proximal operator is projection [parikh2014proximal]. Therefore, Eq. (11) is equivalent to where denotes projection onto a set. Here, the variable of proximal operator is a matrix and not a vector. According to [parikh2014proximal], if
is a convex and orthogonally invariant function, and it works on the singular values of a matrix variable, i.e., where the function gives the vector of singular values of , then the proximal operator is:
The and are the matrices of left and right singular vectors of , respectively. In our constraint , the function deals with the singular values of . The reason is that we want: , where and are because and are orthogonal matrices. Therefore, we can use Eq. (14) for Eq. (11) where sets the singular values of to one. In summary, Eqs. (10), (11), and (12) can be restated as:
where columns of and are the left and right singular vectors of and is the learning rate. Iteratively solving Eq. (15) until convergence gives us the for for the image blocks indexed by . The columns of are the bases for the ISCA subspace of the -th block. Unlike in PCA, the ISCA bases do not have an order of importance but as in PCA, they are orthogonal capturing different features of image structure. The -th projected block is where its dimensions are image structural components. Note that , whether it is a block in a training image or an out-of-sample image, is centered. It is noteworthy that if we consider only one block in the images, the subscript is dropped from Eq. (15).
4.2 Orthonormal Bases for a Set of Images
So far, if we have a set of images, we can find the subspace bases for the -th block in each of them using Eq. (15). Now, we want to find the subspace bases for the -th block in all training images of the dataset. In other words, we want to find the subspace for the best reconstruction of the -th block in all training images. For this goal, we can look at the optimization problem in Eq. (8) or (9
) as an undercomplete auto-encoder neural network[goodfellow2016deep] with one hidden layer where the input layer, hidden layer, and output layer have , , and neurons, respectively. The and fill the role of applying the first and second weight matrices to the input, respectively. The weights are . Therefore, we will have auto-encoders, each with one hidden layer.
For training the auto-encoder, we introduce the blocks in an image as the input to this network and update the weights based on Eq. (15). Note that we do this update of weights with only ‘one’ iteration of ADMM. Then, we move to the blocks in the next image and update the weights again by an iteration of Eq. (15
). We do this for all images one by one until an epoch is completed where an epoch is defined as introducing the block in all training images of dataset to the network. After termination of an epoch, we start another epoch to tune the weightsagain. The epochs are repeated until the convergence. The termination criterion can be average reconstruction error , where is a small number and is the -th block in the -th image. After training the network, we have one -dimensional subspace for every block in all training images where the columns of the weight matrix span the subspace. Note that because of ADMM, the auto-encoders are trained simultaneously and in parallel. Again, the columns of are the bases for the ISCA subspace of the -th block.
5 Kernel Image Structural Component Analysis
We can map the block to higher-dimensional feature space hoping to have the data fall close to a simpler-to-analyze manifold in the feature space. Suppose is a function which maps the data to the feature space. In other words, . Let denote the dimensionality of the feature space, i.e., . We usually have . The kernel of the -th block in images and , which are and , is [hofmann2008kernel]. The kernel matrix for the -th block among the images is where . After calculating the kernel matrix, we normalize it [ah2010normalized] as where denotes the -th element of the kernel matrix. Afterwards, the kernel is double-centered as where . The reason for double-centering is that Eq. (2) requires and thus the to be centered (see Eq. (16)). Therefore, in kernel ISCA, we center the kernel rather than centering .
According to representation theory [alperin1993local], the projection matrix can be expressed as a linear combination of the projected data points. Therefore, we have where every column of is the vector of coefficients for expressing a projection direction as a linear combination of projected image blocks.
As we did for ISCA, first we consider learning the subspaces for ‘one’ image, here. Considering for the -th block in the image, the objective function of Eq. (8) in feature space is where . Note that includes the mapping of the -th block in all the images while is the mapping of the -th block in the image we are considering. The constraint of Eq. (8) in the feature space is . Therefore, the Eq. (8) in the feature space is:
Noticing the constraint and using Eq. (2), we have:
where . The gradient of the is:
We can simplify the constraint . As the kernel is positive semi-definite, we can decompose it as:
where . Therefore, the constraint can be written as: . In Eq. (16), if we embed the constraint in the objective function [boyd2011distributed], we have:
where and and . Taking , we can restate Eq. (19) as: , subject to , where . The ADMM solution to this optimization problem is [boyd2011distributed, otero2018alternate]:
With the similar explanations which we had for Eq. (15), we have:
where columns of and are the left and right singular vectors of . Iteratively solving Eq. (23) until convergence gives us the for for the image blocks indexed by . The columns of are the bases for the kernel ISCA subspace of the -th block. The -th projected block is and its dimensions are the kernel image structural components. Note that , whether it is the kernel over a block in a training image or an out-of-sample image, is normalized and centered. Also, note that we had considered the blocks of only one image for Eq. (23). Again, with the auto-encoder approach, we can solve these equations in successive epochs in order to find the subspaces for all the training images.
Training Dataset: We formed a dataset out of the standard Lena image. Six different types of distortions were applied on the original Lena image (see Fig. 1), each of which has images in the dataset with different MSE values. Therefore, the size of the training set is including the original image. For every type of distortion, different levels of MSE, i.e., from to with step , were generated to have images on the equal-MSE or iso-error hypersphere [wang2006modern].
Training: In our experiments for ISCA, the parameters used were and , and for kernel ISCA, we used and . We took ( blocks inspired by [otero2014unconstrained, otero2018alternate]), , and . One of the dimensions of the trained , , and for ISCA are shown in Fig. 2. The dual variable has captured the edges because edges carry much of the structure information. As expected, and are close (Lena can be seen in them by noticing scrupulously). Note that the variables in kernel ISCA are not -dimensional and thus cannot be displayed in image form.
Projections and Comparisons:
In order to evaulate the trained ISCA and kernel ISCA subspaces, we projected the training images onto these subspaces. For projecting an image, each of its blocks is projected onto the subspace of that block. After projecting all the images, we used the 1-Nearest Neighbor (1NN) classifier to recognize the distortion type of every block. The 1NN is useful to evaluate the subspace by closeness of the projected distortions. The distortion type of an image comes from a majority vote among the blocks. The linear, Radial Basis function (RBF), and sigmoid kernels were tested for kernel ISCA. The confusion matrices for distortion recognition are shown in Fig.3. Mostly kernel ISCA performed better than ISCA because it works in feature space; although, ISCA performed better for some distortions like contrast stretching and blurring. Moreover, we compared with PCA and kernel PCA. PCA showed weakness in contrast stretching. RBF and sigmoid kernels in kernel PCA do not perform well for JPEG distortion and contrast stretching, respectively.
Out-of-sample Projections: For out-of-sample projection, we created test images with having different distortions and some having a combination of different distortions (see Fig. 4). We did the same 1NN classification for these images. Table 1 reports the top two votes of blocks for every image with the percentage of blocks voting for those distortions. ISCA did not recognize luminance enhancement well enough because, for Eq. (2), the block is centered while in kernel ISCA, the block is centered in feature space. Overall, both ISCA and kernel ISCA performed very compelling even in recognizing the combination of distortions.
|distortion||C||G||L||B||I||J||B G||B L||I L||J G||J L||J C|
|ISCA||69.3% O||49.1% G||69.7% O||99.8% B||30.3% G||96.4% J||55.2% B||98.7% B||48.9% G||39.4% J||96.4% J||97.9% J|
|30.2% C||27.2% I||29.6% C||0.2% J||23.8% I||3.6% B||19.9% G||1.3% J||33.3% I||32.9% B||3.6% B||2.1% B|
|kernel ISCA (linear)||88.1% C||59.2% G||99.8% L||95.9% B||37.4% G||80.4% J||40.8% G||93.4% B||45.6% I||38.7% G||70.2% J||74.1% J|
|11.2% I||25.2% I||0.1% O||3.4% J||32.3% I||17.6% B||33.4% B||5.8% J||39.4% G||21.7% J||27.0% B||25.0% B|
|kernel ISCA (RBF)||72.0% C||79.1% G||99.2% L||70.6% B||39.6% I||74.3% J||44.8% G||88.1% L||48.2% L||33.1% G||82.8% L||43.1% J|
|10.9% I||5.0% B||0.5% G||13.1% C||36.4% C||13.5% C||28.5% B||6.6% B||37.7% G||21.6% L||8.9% G||30.0% B|
|kernel ISCA (sigmoid)||80.3% C||76.3% G||99.6% L||76.2% B||38.5% I||79.3% J||47.9% G||81.7% L||52.1% L||37.9% G||80.7% L||43.9% J|
|7.6% I||6.8% I||0.2% G,B||10.6% J||36.5% C||10.6% C||24.6% B||9.8% B||26.6% G||19.8% L||11.3% G||31.0% B|
Reconstruction: The images can be reconstructed after the projection onto the ISCA subspace. For reconstruction, every block is reconstructed as where the mean of block should be added to the reconstruction. Similar to kernel PCA, reconstruction cannot be done in kernel ISCA because . Figure 5 shows reconstruction of some of training and out-of-sample images. As expected, the reconstructed images, for both training and out-of-sample images, are very similar to the original images.
7 Conclusion and Future Direction
This paper introduces the concept of an image structure subspace which captures the structure of an image and discriminates the distortion types. We hope this will open a broad new field for research in this area and build a greatly needed bridge between the worlds of image quality assessment and manifold learning.
For image structure subspace learning, ISCA and kernel ISCA were proposed, taking inspiration from PCA. As future work, we can consider designing deeper auto-encoder [goodfellow2016deep]
with non-linear activation functions for image structure subspace learning.
8 Supplementary Material
This section is the supplementary material for the paper “Principal Component Analysis Using Structural Similarity Index for Images”. In this paper, the derivation of the mathematical expressions, which were not completely detailed in the main paper, are explained. We explain the derivation of Eqs. (6), (7), (15), (17), (18), and (23).
8.1 Derivation of Eq. (6)
In the following, we mention the derivation of Eq. (6):
where is because of Eq. (2). The numerator of is simplified as:
where is because of the constraint in Eq. (5).