Nonnegative data matrices appear in many data analysis applications. For instance, in image analysis, image pixel values are nonnegative and the associated nonnegative image data matrices can be formed for clustering and recognition [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. In text mining, the frequencies of terms in documents are nonnegative and the resulted nonnegative term-to-document data matrices can be constructed for clustering [13, 14, 15, 16]. In bioinformatics, nonnegative gene expression values are studied and nonnegative gene expression data matrices are generated for diseases and genes classification [17, 18, 19, 20, 21]. Low rank matrix approximation for nonnegative matrices plays a key role in all these applications. Its main purpose is to identify a latent feature space for objects representation. The classification, clustering or recognition analysis can be done by using these latent features.
Nonnegative Matrix Factorization (NMF) has emerged in 1994 by Paatero and Tapper  for performing environmental data analysis. The purpose of NMF is to decompose an input -by- nonnegative matrix into -by- nonnegative matrix and -by- nonnegative matrix : , and more precisely
where means that each entry of and is nonnegative, is the Frobenius norm of a matrix, and (the low rank value) is smaller than and . Several researchers have proposed and developed algorithms for determining such nonnegative matrix factorization in the literature [1, 8, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]. Lee and Seung  proposed and developed NMF algorithms, and demonstrated that NMF has part-based representation which can be used for intuitive perception interpretation. For the development of NMF, we refer to the recently edited book .
Nowadays, data that comes from many fields are more naturally represented as multi-dimensional data which refers to as tensor, for example, video data, hyperspectral data, fMRI data and so on. For more details and applications, we refer to the review paper . To handle tensor data, there are two well-known and widely used decompositions, Canonical Polyadic decomposition (CPD) and Tucker decomposition. Because of the interpretability of NMF, many nonnegative tensor decompositions or models are proposed and developed. Most models are focused on CPD and Tucker decomposition with the nonnegative constraints.
For the CPD with nonnegative constraints, which is known as Nonnegative Tensor Factorization (NTF), early in 2001, Max Welling et.al, introduced a fixed point algorithm in . Cichocki et al. [26, 36] developed a class of optimized local algorithms known as Hierarchical ALS (HALS), also extended the approach to other cost functions using the Alpha and Beta divergences for NTF computation. In , Park et al. proposed an NTF algorithm based on the alternating nonnegative constrained least squares method. In , the block pricipal pivoting method is further introduced to solve nonnegative constrained least squares (NNLS) to overcome some difficulties for the NNLS problems with a large number of variables. Xu et. al  studied regularized block multiconvex optimization and proposed a block coordinate descent method which can be applied to NTF problem.
In , Kim and Choi studied the Tucker decomposition with nonnegative constraints. Multiplicative updating algorithms for NMF are extended to solve the nonnegative Tucker decomposition. Zhou et al.  transformed this problem into a series of NMF problem as well, and used MU and HALS algorithms on the unfolding matrices for Tucker decomposition calculation. Some other constraints like orthogonality on the factor matrices are also considered and studied by some researchers [42, 43]. For instance, in , Pan et al. proposed orthogonal nonnegative Tucker decomposition and apply the alternating direction method of multipliers (ADMM), aims to get clustering informations from the factor matrices and their joint connection weight from the core tensor. Besides these two standard and well-known decompositions, some other models are designed for the specific applications. For instance, for blind unmixing of hyperspectral problem, Qian et al.  proposed a model which is derived from block term decomposition (BTD) and applied alternating least square method to solve it.
1.1 The Contribution
The main aim of this paper is to propose and study Nonnegative Low Rank Tensor Approximation (NLRT) for applications of multi-dimensional images. Our approach is completely different from classical NTF which has been studied for many years. For a given nonnegative tensor, the classical NTF approach is to determine nonnegative low rank tensor by computing factor matrices or tensors (for example, CPD finds factor matrices while Tucker decomposition finds core tensor and factor matrices), such that the distance between this nonnegative low rank tensor and given tensor is as small as possible. The proposed NLRT approach is different with classical NTF. It determines a nonnegative low rank tensor without using decompositions or factorization methods. The minimized distance by the proposed NLRT method can be smaller than that by the NTF method, and it implies that the proposed NLRT method can obtain a better low rank tensor approximation. The proposed NLRT approximation algorithm is derived by using the alternating projection on the product of low rank matrix manifolds and the non-negativity property. We show the convergence of the alternating projection by constructing two projections which combine a projection of low rank matrix manifolds and the nonnegative projection, and a projection of taking average of tensors. Experimental results for synthetic data and multi-dimensional images are presented to demonstrate the above mentioned advantages of the proposed NLRT method compared the NTF methods.
The paper is organized as follows. In Section 2, we review tensor operations and decompositions, and present our algorithm. In Section 3, we show the convergence of the proposed alternating projections. In Section 4, numerical results are presented to demonstrate the proposed algorithm. Finally, some concluding remarks and future research work are given in Section 4.
2 Nonnegative Low Rank Tensor Approximation
2.1 Tensor Operations and Decompositions
An -dimensional tensor is a multi-dimensional array, i.e., . We denote its -th entry of as . The inner product of two same-sized tensors and is defined as
The Frobenius norm of an -dimensional tensor is then defined as
The -th unfolding of is defined as . In the following discussion, we will introduce two well-known tensor decompositions.
denotes the outer product of vectors,, . The minimal value of such that CP decomposition exists is called the CP rank of .
where , is a -by- matrix where its columns are mutually orthogonal, denotes the -mode matrix product of a tensor defined by
The Tucker (or multilinear) rank of is defined by .
When the nonnegativity constraints are imposed on the factor matrices generated by CPD or Tucker decompositions, the models given in (2) and (4) refer to be nonnegative tensor factorization and nonegative Tucker decomposition respectively. Nonnegative tensor factorization [26, 36, 38, 39] and nonnegative Tucker decomposition [40, 41, 42, 43, 44] models and algorithms were developed to deal with nonnegative tensor applications.
2.2 The Proposed Model
In [50, 51], Song and Ng proposed a new algorithm for computing nonnegative low rank matrix approximation for nonnegative matrices. Instead of using matrix factorization in (1), their method is to determine a rank nonnegative matrix such that the distance between such matrix and the given nonnegative matrix is as small as possible:
They have shown the convergence of the proposed algorithm based on the projection onto a manifold of rank matrices and the projection onto nonnegative matrices.
The main purpose of this paper is to develop nonnegative low rank tensor approximation for nonnegative tensors. The idea is to find a nonnegative low rank tensor such that the distance between and is as small as possible. Mathematically, it can be formulated as the following optimization problem
where and are the -th mode of unfolding matrix of and , respectively. The sizes of and are -by- with
. According to the singular value decomposition of:
where is -by- with orthonormal columns, is a diagonal matrix of size -by-, and is -by- with orthonormal columns ( is the transpose of ), a nonnegative low rank tensor can be expressed as follows:
and thus has a multilinear rank .
2.3 The Proposed Algorithm
We note from (7) that there are manifolds of low rank matrices, and the constraints of nonnegativity to be considered in the optimization. In this subsection, we propose to solve (7) by using the averaged projection method which is a replacement of the alternating projections on many manifolds and the constraints of nonnegativity. Let
denote the set of tensors which their -mode unfolding matrices has fixed rank , and
denote the set of nonnegative tensors.
Denote “fold” as the operator that fold a matrix into a tensor along the -mode. Then by the Eckart-Young-Mirsky theorem , the projections that project an given tensor onto , can be expressed as follows:
where is the -mode unfolding matrix of , is the -th singular values of , and their corresponding left and right singular vectors: and . The projection that projects an given tensor onto the nonnegative tensor manifold can be expressed as follows:
In our proposed algorithm, we consider the following two sets of tensors in the product space ( times):
Note that is a convex set and an affine manifold, thus is also. As are manifolds (Example 2 in ), is a product of manifold. Now we define the two projections onto and in the alternating projections algorithm. The first projection is given by
where is defined in (11). The second projection is given by
where () are defined in (10). The proposed algorithm is summarized in Algorithm 1. With input nonnegatve tensor , the algorithm computes the projections and alternatively until it is convergent. Note that the dominant overall computational cost of Algorithm 1 can be expressed as the SVDs of unfolding matrices with sizes by , respectively, which leads to a total of flops.
3 The Convergence Analysis
The framework of this algorithm is the same as the convex case for finding a point in the intersection of several closed sets, while the projection sets here are two product manifolds. In , Lewis and Malick proved that a sequence of alternating projections converges locally linearly if the two projected sets are -manifolds intersecting transversally. Lewis et al.  proved local linear convergence when two projected sets intersecting nontangentially in the sense of linear regularity, and one of the sets is super regular. Later Bauschke et al. [55, 56] investigated the case of nontangential intersection further and proved linear convergence under weaker regularity and transversality hypotheses. In , Noll and Rondepierre generalized the existing results by studying the intersection condition of the two projected sets. They esatablished local convergence of alternating projections between subanalytic sets under a mild regularity hypothesis on one of the sets. Here we analyze the convergence of the alternating projections algorithm by using the results in .
We remark that the sets and given in (12) and (13) respectively are two smooth manifolds which are not closed. The convergence cannot be derived directly by applying the convergence results of alternating projections between two closed subanalytic sets. By using the results in variational analysis and differential geometry, the main convergence results are shown in the following theorem.
 Let and be two sets of points in a Hibert space equipped with the inner product and the norm . Let . The set is -Hlder regular with respect to at if there exists a neighborhood of and a constant such that for every and every , one has
where . Note that is the projection of onto and is the projection of onto , with respect to the norm. We say that is Hlder regular with respect to if it is -Hlder regular with respect to for every .
 Let and be two sets of points in a Hibert space equipped with the inner product and the norm . We say intersects separably at with exponent and constant if there exist a neighborhood of such that for every building block in , the condition
holds, i.e., it is equivalent to
where is a projection point of onto , is a projection point of onto , and is the angle between and .
This separable intersection definition is a new geometric concept which generalized the transversal intersection , the linear regular intersection , and the intrinsic transversality intersection . It has been shown that the definitions of these three kinds of intersections imply in the separable intersection.
Proof of Theorem 1. It is clear that finding a point in is equivalent to finding a point in the intersection of and .
The first task is to show that intersects separably at with exponent . Define as
It follows the deﬁnition of that and is a critical point of .
Recall that and are two manifolds. Then is locally Lipschitz continuous, i.e., for each , there is an such that is Lipschitz continuous on the open ball of center with radius . Assume that is a local smooth chart of around with bounded . Therefore, is bounded by the fact that is local Lipschitz continuous. According to the definition of semi-algebraic function , we can deduce that is also semi-algebraic. Then the Kurdyka-Łojasiweicz inequality  for holds for . It implies that there exist and a concave function such that
for all with , we have
Moreover, is analytic on , thus is continuous on , where is the differential operator. For every compact subset in , there exists , where denotes the operator norm. Suppose that is an open set containing in such that is compact ( denotes the closure of and denotes the of ). Then, for every with , we have
where is the Fréchet subdifferential of . We see that the Kurdyka-Łojasiweicz inequality is satisfied for given in (17).
Here we construct a function which satisfies (i)-(iv). Because , (18) becomes
Since , there always exists a neighborhood of such that , i.e.,
for all and every .
In Algorithm 1, we construct the following sequences according to Definition 2:
Here is the projection and is the projection . Suppose and are in , , we get the proximal normal cone to at :
According to the definition of Fréchet subdifferential, if and only if for every of the form .
for every . It follows that
Let the angle be the angle between the iterations, which can be defined as the angle between and .
Let us consider two cases.
(i) When ,
Substitute it into (20), then
Note that , we have
when the numerator tends to , the denominator has to go to zero, which implies that , i.e., . Therefore, we get intersects separably with exponent , the corresponding constant can be set to be .
On the other hand, intersects separably can be proved by using the similar argument.
It follows from Theorem 1 and Corollary 4 in  that there exists a neighborhood of such that every sequence of alternating projections that enters converges to . The convergence rate is and with . The result follows.
In the next section, we test our method and nonnegative tensor decomposition methods on the synthetic data and real-world data, and show the performance of the proposed alternating projections method is better than the other methods.
4 Experimental Results
The state-of-the-art methods for nonnegative tensor decompositions in Section 2.1 are used as follows.
Nonnegative CP decomposition (NCPD). For the CP scheme approximation, we will use the hierarchical ALS algorithm (referred to as “NCPD-HALS”) proposed in [26, 36], a fixed point (FP) algorithm (referred to as “NCPD-FP”) in , and a block coordinate descent (BCD) method (referred to as “NCPD-BCD”) from .
Nonnegative Tucker decomposition (NTD). For the Tucker scheme approximation, we apply the HALS algorithm (referred to as “NTD-HALS”), the accelerated proximal gradient algorithm (referred to as “NTD-APG”), and the plain multiple updating algorithm (referred to as “NTD-MU”) in  and the block coordinate descent (BCD) method (referred to as “NTD-BCD”) from .
In the following, we list the computational cost of these methods. The cost of the proposed NLRT method per iteration is about the same as that of NTD-type methods. As they involve the calculation of nonnegative vectors only, the cost of NCP-type methods per iteration is smaller than that of the proposed NLRT method.
|Method||Complexity||Details of most expensive compuations|
|NCPD-BCD||Khatri-Rao product and unfolding matrices times Khatri-Rao product.|
|NCPD-HALS||Khatri-Rao product and unfolding matrices times Khatri-Rao product.|
|NCPD-FP||Khatri-Rao product and unfolding matrices times Khatri-Rao product.|
|NTD-BCD||The tensor-matrix multiplication and the matrix multiplication between the -th unfolding matrix of and its transpose.|
|NTD-APG||The tensor-matrix multiplications among a) the -th factor matrix b) the transpose of the -th unfolding matrix of and c) the -th unfolding matrix of .|
|NTD-HALS||HALS on unfolding matrices .|
|NTD-MU||MU on unfolding matrices .|
|NLRT||SVDs of unfolding matrices .|
The stopping criterion of the proposed method and other comparison methods is that the relative difference between successive iterates is smaller than . All the experiments are conducted on Intel(R) Core(TM) i9-9900K CPU@3.60GHz with 32GB of RAM using Matlab.
4.1 Synthetic Datasets
We first test different methods on synthetic datasets. We generate two kinds of synthetic data as follows:
Test 1 (Low-rank nonnegative tensor): We generate low rank nonnegative tensors by two steps. First, a core tensor of the size (i.e., multilinear rank is ) and factor matrices of sizes
are generated with entries uniformly distributed in. Second, these factor matrices are multiplied to the core tensor via the tensor-matrix product and the low rank nonnegative tensors of size are normalized to
. Finally, we add the truncated Gaussian noise (i.e., set the negative noisy value to be 0) to the generated tensor with different signal-to-noise ratios (SNR). Given the multilinear rank (), the CP rank is cannot be larger than . Therefore, we set the CP rank in the NCPD methods from three possible values: . In the results, we report the best relative approximation error in the NCPD methods.
Test 2 (Nonnegative tensor) We randomly generate nonnegative tensors of size where their entries follow a uniform distribution in between 0 and 1. The low rank minimizer is unknown in this setting. In the CP decomposition methods, the CP rank is set to be . In the Tucker decomposition methods, the multilinear rank is set to be .
We report the relative approximation error, which is defined as
to measure the approximation quality. The ground truth tensor is the generated tensor without noise. The relative approximation errors of the results by different methods in Test 1 are reported in Table 2. The reported entries of all the comparison methods in the table are the average results over ten different initial guesses in CP decomposition vectors and Tucker decomposition matrices. However, the results of the proposed NLRT method are determistic when the given nonnegative tensor is fixed. We see from Table 2 that the proposed NLRT method achieves the best performance and it is also quite robust to SNRs. On the other hand, the relative approximation errors in Test 2 are plotted in Fig. 1 with respect to different values of rank . Again the reported entries are the average results over ten different initial guesses in the comparison methods. From Fig. 1, we can see that the proposed NLRT method and NTD-BCD perform better than the other methods. For the tensors of the size , the superior of our method over NTD-BCD is obvious when the rank is in between 27 and 39.
|Tensor size:||Multilinear rank:|
|Tensor size:||Multilinear rank:|
4.2 Video Data
In this subsection, we select 4 videos111Videos are available at http://trace.eas.asu.edu/yuv/ and https://sites.google.com/site/jamiezeminzhang/publications. to test our method on the task of approximation. Three videos (respectively named “foreman”, “coastguard”, and “news”) are of the size (heightwidthframe) and one (named “basketball”) is of the size . Firstly, we set the multilinear rank to be and the CP rank to be . We test our method to approximate these four videos with varying from 5 to 100. Moeover, we add the Gaussian noise to the video “coastguard” with different noise levels (SNR = 20, 30 ,40, 50), and test the approximation ability for the noisy data. We plot the relative approximation errors with respect to on four videos in Fig. 2. It can be seen that the approximation errors of the results by our method are the lowest. Fig. 3 shows the relative approximation errors on the noisy video “coastguard” with respect to the . Similarly, our method achieves the lowest approximation errors.
|SNR = 20||SNR = 30|
|SNR = 40||SNR = 50|
4.3 Hyperspectral Data
In this subsection, we test different methods on the hyperspectral data. We consider four hyperspectral images (HSIs): a subimage of Pavia City Center dataset222Data available at http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes. of the size (heightwidthband), a subimage of Washington DC Mall dataset333Data available at https://engineering.purdue.edu/b̃iehl/MultiSpec/hyperspectral.html. of size , the RemoteImage444Data available at https://www.cs.rochester.edu/ jliu/code/TensorCompletion.zip. of the size , and a subimage of Curprite dataset555Data available at https://aviris.jpl.nasa.gov/data/free_data.html. of the size .
Figs. 4 and 5 report the relative approximation errors with respect to different values of rank . It is evidently that the relative approximation errors by our NLRT are the lowest among all the methods. It is interesting to note that the difference between our method and NTD-BCD (the second best comparison method) is more significant than that on the synthetic data. In Fig. 6, we display the pseudo-color images of the results on the Washington DC Mall dataset with multilinear rank (100,100,100) and CP-rank = 100. The pseudo-color image is composed of the 145th, 127-th, and 92-th bands as the red, green, and blue channel, respectively. In Fig. 6, we also report two image quality assessments: the mean value of the peak signal to noise ratio (MPSNR)666https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio and the mean value of the structural similarity index of all the spectral bands (MSSIM) . It can be found in Fig. 6 that both visual and quality assessments of the proposed NLRT method are better than those by the other comparison methods.
|Pavia City Center||Washington DC Mall|
|MPSNR: 29.16 MSSIM: 0.84||MPSNR: 29.15 MSSIM: 0.84||MPSNR: 28.37 MSSIM: 0.81|
|MPSNR: 29.74 MSSIM: 0.85||MPSNR: 28.25 MSSIM: 0.80||MPSNR: 27.25 MSSIM: 0.76|
|MPSNR: 28.77 MSSIM: 0.82||MPSNR: 34.56 MSSIM: 0.94||MPSNR: Inf MSSIM: 1|
4.4 Selection of Features
One advantage of the proposed NLRT method is that it can provide a significant index based on singular values of unfolding matrices  that can be used to identify important singular basis vectors in the approximation. Here we take the HSI Washington DC Mall as an example. We compute the low-rank approximations of the proposed NLRT method and the other comparison methods with multlinear rank and CP rank for . For the approximation results by NCPD methods, we normalize the base vectors in 2 such that the norms of , and are equal to 1, and rearrange the resulting values in the descending order in the CP decomposition. In Fig. 7, we plot with respect to , where . Similarly, for the results of NTD methods, we also plot with respect to , where , and indicates a vector composed of the indexes corresponding to the largest norms of ’s columns. For the results by our methods, we plot with respect to , where , is the -th singular values of , and is the third-mode unfolding matrix of . The third-mode of is chosen in NTD and our NLRT, we are interested to observe how many indices required in the spectral model of given hyperspectral data.
In Fig. 7, we can see that when the number of components (namely ) increases, the relative residual decreases. Our NLRT could provide a significant index based on singular values to identify important singular basis vectors for the approximation. Thus, the relative residuals by the proposed NLRT algorithm are significantly smaller than those by the testing NTD and NCPD algorithms. Similar phenomena can be found in Fig. 8, in which and are computed using the number of indices in the first or second modes of .
|no till||min till||pasture||trees||pasture-mowed||windrowed|
|no till||min till||clean||Trees-Drives||Steel-Towers|
4.5 Image Classification
The advantage of the proposed NLRT method is that the important singular basis vectors can be identified. Such basis vectors can provide useful information for image recognition such as classification. Here we conduct hyperspectral image classification experiments on the Indian Pines dataset777Data available at https://engineering.purdue.edu/biehl/MultiSpec/hyperspectral.html.. This data set was captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in North-western Indiana in June 1992. After removing 20 bands, which cover the region of water absorption, this HSI is of the size . The ground truth contains 16 land cover classes as shown in Fig. 9. Therefore, we set the multilinear rank to be and the CP rank to be 16 for all the testing methods. We randomly choose of the available labelled samples, which are exhibited in Table 3. Labelled samples from each class are used for training and the remaining samples are used for testing.
After obtaining low rank approximations, 16 singular vectors corresponding to the largest 16 singular values of the unfolding matrix of the tensor approximation along the spectral mode (the third mode) are employed for classification. We apply the -nearest neighbour (-NN, =