The multimodal and heterogeneous data have been increasingly generated in a vast range of fields. Tensors, as a natural way for high-dimensional data representation, have attracted more and more attention in many real-world applications. In contrast to many traditional methods that transform the multimodal and heterogeneous data into vectors, tensor representations for high-dimensional data can naturally preserve the coherency and consistency in space, and decrease the number of parameters in learningphan2010tensor
. Most of the current methods for tensor decomposition are utilized in machine learning, signal processing, compressed sensing, and so on. Many studies have showed that high-dimensional data, such as image and video sequence are actually reside on an underlying structure of nonlinear geometrycai2007learning; cruceru2020computationally; li2016mr. However. these tensor decomposition based models for dimensionality reduction are implicitly assumed that the processing arguments are global and multilinear; as a result, they usually fail to explore the local and nonlinear structures.
To learn the nonlinear geometry of high-dimensional data, manifold learning is an alternative approach to maintain the topological structures in dimensionality reduction. It is based on the assumption that high-dimensional data resides on a low-dimensional manifold space. Since 2000, the first method of manifold learning of isometric mapping (ISOMAP) that proposed by tenenbaum290global, manifold learning have been extensively explored and researched in many application fields. A great number of algorithms for manifold learning have been proposed as well. For instance, roweis2000nonlinear introduced the locally linear embedding (LLE) to maintain the local linearity of nearest neighbors, by minimizing the reconstruction errors in a low-dimensional space. It generally assumes that the local neighborhood of a point on the manifold can be well approximated by the affine subspace spanned by the neighbors of the point, and finds a low-dimensional embedding of the data based on these affine approximations. However, LLE is very sensitive to the parameters setting of the nearest neighbors. Instead of local linear, Laplacian Eigenmaps (LE) is to preserve the similarity between the nearest neighbors by constructing the similarity matrix of Laplacian graph belkin2003laplacian. LLE is a locally linear and globally nonlinear algorithm, and LE is a global and local nonlinear algorithm. However, most of the aforementioned algorithms have not taken into account the tangent space projection of manifold geometry. It should be noted that the local tangent space alignment (LTSA) is closely analogous to the concept of manifold geometry zhang2004principal, which is projection of each data point onto a local neighborhood of the tangent space and local alignment in such a low-dimensional space. Later on, its variants of locally nonlinear alignment zhang2008patch, as well as many other methods, such as manifold regularization non-negative tensor decomposition li2016mr
have gained great attention in feature extraction and clustering. Non-negative preservation can give an enhancement on the physical interpretation of the non-negative attributes of the original data.
In this paper, we incorporate hypergraph into non-negative tensor factorization to preserve the high-order correlations and non-negative attributes in tensor decomposition. This proposed method is called Hypergraph Regularized Nonnegative Tensor Factorization (HyperNTF). The main contributions of this paper include the following aspects.
HyperNTF incorporates hypergraph into nonnegative tensor factorization (NTF), and uses the last mode of the factor matrix for low-dimensional representation. In this way, it can largely reduce to the storage space and computation complexity.
Hypergraph can effectively unfold the smooth curved manifolds from a 3-dimensional space to a low-dimensional one. Our experimental results from multiple synthetic data (e.g. Punctured Sphere, Gaussian surface, Twin Peak, and Toroidal Helix) demonstrate that hypergraph uncovers the local neighborhoods of nonlinear geometry in dimensionality reduction.
HyperNTF achieved state-of-the art (SOTA) performance in clustering analysis. The numerical results from multiple real datasets (e.g.
COIL20, ETH80-1, face94 male, MNIST, USPS, Olivetti) suggest that HyperNTF reliably achieves better performance regardless of the cluster number.
The paper is organized as follows. We review some of related work in nonnegative tensor decomposition and hypergraph in Section 2. The notations and basic operations including tensor decomposition and hypergraph are presented in Section 3. The algorithmic model of HyperNTF and solutions to the cost function are developed in Section 4. In Section 5, numerical comparisons with state-of-the-art algorithms on various synthetic and real-world benchmarks suggest a superior performance of HyperNTF. Section 6 summarizes our work and provides some prospects for the future work.
2 Related Work
In this section we briefly review the methods used in the non-negative tensor decomposition or factorization, and introduce the hypergraph for the purpose of dimensionality reduction.
The traditional methods usually vectorize the high-dimensional data before implementing the learning objective. They suffer from high risks of breaking the natural structures and correlations in the high-dimensional data, and fail to encode the higher-order nonlinearity in local neighborhoods. To solve this issue, a great number of algorithms using tensors to represent data, rather than using vectors or matrices, have been proposed. A canonical method is the so-called higher order singular value decomposition (HOSVD). It is an extension of matrix’s SVD to higher-order arrays and widely used in subspace learningde2000best. One application of HOSVD is to cluster of handwritten digits savas2007handwritten. However, since the orthogonal constraint of HOSVD cannot provide unique solution, it is necessary to impose many other constraints to meet the physical interpretations and to employ the prior knowledge, such as sparsity, smoothness, low-rank property and so on.
A popular application of the constrained decomposition is nonnegative tensor decomposition cichocki2007nonnegative
which aims to handle nonnegative data, such as spectrum, probability and energy. The family of nonnegative tensor decomposition includes the nonnegative tensor factorizationcichocki2007nonnegative, nonnegative Tucker decomposition (NTD) kim2007nonnegative, manifold regularization NTD (MR-NTD) li2016mr and some other variants.
It is an increasing trend to take the heterogeneity and multimodality of data into accounts to construct the learning objective. Following such concept, sun2015heterogeneous proposed the heterogeneous tensor decomposition (HTD-Multinomial). HTD-Multinomial utilizes the constraint space of learning objective to convert the constrained optimization problem to an unconstrained one; then it can be solved by Riemannian manifold optimization using the second-order geometry of trust-region method. However, it has two major limitations: 1) HTD-Multinomial in cluster analysis requires to know the number of clusters before implementing the learning objective; 2) it suffers from tiny step updates and consequentially slow convergence speed, when the learning objective is sparse and singular.
Motivated by HTD-Multinomial, a recent work of low-rank regularized heterogenous tensor decomposition (LRRHTD) for subspace clustering has been proposed zhang2017low. It assumes that the last mode of the factor matrix is low-rank and other modes of projection matrices are orthogonal. Even though it is suitable for the high-dimensional data with low-rank property, its low-dimensional representation of the factor matrix is a square matrix whose size is the sample size, resulting in particularly high costs of storage and computation.
As the aforementioned methods, most of existing approaches do not taken into account of higher-order geometry in tensor decomposition. Hypergraph, as a natural extension of 2-order graph to the high-order relations, has great potential to uncover the local geometry via learning a local group information in tensor decomposition hu2014eigenvectors; hu2015laplacian. Many studies have shown that compared to the traditional 2-order graph, hypergraph provides better performance in both clustering zhou2007learning and classification sun2008hypergraph, especially for the data with complex relations, such as the gene expression tian2009hypergraph, social networks yang2017hypergraph
, neuronal networksfeng2019hypergraph. These previous findings motivate us to incorporate hypergraph into the tensor factorization in this paper.
3 Notations and Basic Operations
This section reviews the basic notations and operations used in tensor decomposition. The terminology used in this paper keeps consistency with previous literatures as much as possible cohen2016environmental; de2000multilinear; zafeiriou2010nonnegative; zhang2008patch.
Tensor is regarded as a multi-index numerical array. The order of a tensor is the number of its dimensions or modes, which may include space, time, frequency, subject and class. In this paper, matrix is indicated by the bold case letters, e.g. , while tensors with more than two dimensions are indicated by Lucida calligraphy, e.g., . The elementwise division of two same-sized matrices and is denoted by , and the elementwise product or Hadamard product of two same-sized matrices and is indicated by . The transpose of a matrix is denoted by . If the entries of a tensor are arranged as a matrix, it is termed as the tensor mode unfolding cichocki2015tensor. The inner product of two same-sized tensors and is the sum of their entries, that is denoted by . The norm of a tensor is defined by . Table 1 lists the fundamental symbols defined in this paper.
|tensor, matrix, vector, scalar|
3.1 Tensor Decomposition
Dimensionality reduction refers to transform a high-dimensional data into a low-dimensional equivalent representation while maximally maintaining the underlying structures, such as the non-negativity and topological structures, and any other attributes. Tensor decomposition is widely used in dimensionality reduction. Two most prominent approaches for tensor decomposition are tensor Tucker and Canonical Polyadic (CP).
For a given tensor , Tucker decomposition can be expressed as a form of core tensor multiplied by a series of projection matrices:
where is the so-called core tensor, and , are the projection matrices.
Note that the generic Tucker decomposition is not unique, and the projection matrices are not restricted to be column-wise orthogonal. In order to obtain an unique decomposition, it is necessary to impose additional constraints, such as orthogonality, smoothness, sparsity, or low-rank property on the projection matrices.
Tucker decomposition can also be expressed as an equivalent formulation of matrix’s unfolding, that reads by set of Kronecker products of the projection matrices,
In the context that all the projection matrices are constrained to be orthonormal, i.e., , , Tucker decomposition turns into the well-known higher order orthogonal iteration (HOOI) de2000best
. More restrictively, if we constrain the projection matrices to be all column-wise orthogonal and the core tensor to be orthogonal as well, Tucker decomposition becomes the higher order singular value decomposition (HOSVD)de2000multilinear.
An important special case of Tucker decomposition is the CP factorization, in which the nonzero entries of the core tensor only lie in the supper-diagonal positions, shown as following
If is a 3-order super-diagonal tensor, then
where is actually an identity diagonal tensor of size that contains the nonzero elements of unit one, and
is an identity matrix. Theoperator means to transform a matrix into a tensor along the mode. Then, the equivalent formulation of tensor mode unfolding can be formulated as follows
wherein the shorthand notation is to simplify the presentation, standing for a series of Khatri-Rao products in all modes except mode.
Notably, CP and Tucker decomposition are used in different situations. Specifically, CP factorizes data into the sum of rank-one tensors that usually has an interpretable meaning of components, and Tucker decomposition compresses a high-dimensional data into a smaller core tensor with low-dimensional representation. Thus, CP is often used for factor analysis, whereas Tucker decomposition is usually utilized for the subspace learning.
The main distinction between hypergraph and normal 2-order graph is that the hypergraph uses a subset of the vertices as an edge and the edge of hypergraph connects more than two vertices (called hyperedge), whereas the edge of a 2-order graph only connects two vertices tian2009hypergraph.
Assuming is a weighted hypergraph, then is the set of vertices of the hypergraph, which is a finite set of objects. is the set of hyperedges of the hypergraph, and each hyperedge is a subset of . Then, the relationship of vertices and hyperedges can be expressed by an indicator matrix . Specifically, the element-wise indicator of matrix is defined as follows
wherein, a vertex and a hypergraph is called an incident if . Figure 1 is an example of a hypergraph.
The degree of hyperedge , namely , is the number of vertices incident with ,
The degree of each vertex is the sum of the weights defined for the hyperedges incident with ,
The weight associated with each hyperedge is defined by
where is the 2-order combination of number ; the parameter is the scaling factor; and are the collection of edges. Let be the low-dimensional representation of sample labels to be learnt. The cost function is defined to minimize the discrepancy measure between the sample labels of the same class, as following
where and are diagonal matrices, and . Eventually, is the so-called hypergraph.
4 Hypergraph Regularized Nonnegative Tensor Factorization (HyperNTF)
4.1 The cost function of HyperNTF
Here we propose a hypergraph regularized NTF model, called HyperNTF, to factorize the nonnegative tensors. In general, CP factorization of sparse tensors is essential for dimensionality reduction of large-scale data. Specifically, for a given non-negative tensor data , , the corresponding CP factorization is approximated by an identity diagonal tensor multiplied by a chain of factor matrices on each mode. Thus, our goal is to use the hypergraph to learn the last mode of the factor matrix, i.e., as the low-dimensional representation of high-dimensional data. In other words, hypergraph is incorporated into the general framework of non-negative tensor factorization to minimize the following cost function,
in which is a row vector of all ones; is an identity diagonal tensor; is the hypergraph that used to characterize the locally geometrical structures of tensor data; is a tradeoff parameter to prevent overfitting. In practice, the choice of is determined by attributes of the specific dataset.
4.2 The learning algorithm for HyperNTF
To solve the learning objective in Eq.(11), we adopt the method of Multiplicative Updating Rules (MUR) lee2001algorithms; lee1999learning; li2016mr.
First, we reformulate Eq.(11) as follows
Then, by using matrix’s properties, the first term of Eq.(12) can be extended as follows
We introduce the Lagrange multipliers , corresponding to the constraint of each factor matrix, i.e., , , and the cost function of Eq.(12) becomes the following
This relaxed problem as above can be solved in a fashion of alternative updating , . To this end, we have to derive the partial derivative of Eq.(14) with respect to . The first term of Eq.(14) is a constant that can be omitted. Then the second term of Eq.(14) can be computed as following
Sequentially, the third term of Eq.(14) is computed as follows
Note that, .
Therefore, by using the matrix properties, i.e., and petersen2012matrix, we can obtain the partial derivative of Eq.(14) with respect to as follows
Moreover, by taking into account of the KKT conditions, i.e., , we can derive the solution of Eq.(17) as following
After a few steps of computations, the learning rules relative to the factor matrix is given by
For the calculation of Eq.(19), the Khatri-Rao product of the involved results in a matrix of size , and can get very costly in terms of computation and memory requirements when and are very large. To address such problem, we adopt an useful approach, named matricized tensor-times Khatri-Rao product (MTTKRP) kaya2017high, is used for computing the mode vector multiplication,
Clearly, the solution of is computed column by column, which can efficiently reduce the computational and storage consumption, resulting in the cost of computation is the product of the tensor with vectors times.
Eventually, we derive the partial derivative of Eq. (14) with respect to . There only needs to take into account of the regularization term of Eq. (14), thus we obtain the following equation
To update , there also needs to formulate the iterative formula into the form of tensor mode multiplication of a chain of vectors. Or equivalently, which is formulated as the matrix form as following
Till now, all the derivations used in the iterative updates with respect to the factor matrices , and are completed. By using Eq. (19) and Eq. (22) to start from an initialization of factor matrices to update each variable until the termination criteria is met. After all matrices , and are updated, the maximum number of iterations is measured to check convergence at the end of each iteration. Given the pseudo-code of HyperNTF in Algorithm 1.
We conduct two types of tests to validate the performance of hypergraph for topological preservation and HyperNTF for clustering, namely the manifold unfolding test and the clustering test, each consisting of a list of numerical experiments.
In the the manifold unfolding test, we apply hypergraph to unfold some common spheres and then visualize the unfolded manifold. In the cluster test, we compare HyperNTF with six existing methods, including higher order singular value decomposition (HOSVD) de2000multilinear; savas2007handwritten, nonnegative Tucker decomposition (NTD) kim2007nonnegative, nonnegative tensor factorization (NTF) cichocki2007nonnegative, heterogenous tensor decomposition (HTD-Multinomial) sun2015heterogeneous, low-rank regularized heterogeneous tensor decomposition (LRRHTD) zhang2017low, and graph-Laplacian Tucker decomposition (GLTD) jiang2018image.
All the numerical experiments are conducted on a desktop with an Intel Core i5-5200U CPU at 2.20GHz and with RAM of 8.00 GB, and repeated 10 times, with randomly selected images in each time.
5.1 Manifold unfolding test
The manifold unfolding test is based on the simulated data qiao2012explicit. We firstly embedded the simulated manifolds (i.e. Punctured Sphere, Gaussian surface, Twin Peak, and Toroidal Helix) in the three-dimensional ambient space, using Matlab Demo (mani.m). On each manifold, 1000 data samples are randomly generated for training, the numbers of nearest neighbor are associated with Punctured Sphere, Gaussian surface, Twin Peaks, and Toroidal Helix, respectively. The polynomial degree is set to .
Then we applied LE, LLE and Hypergraph methods to unfold these 3D data. We visualize the local geometry in the manifold unfolding test. The unfolding results of Punctured Sphere, Gaussian surface, Twin Peaks, and Toroidal Helix were shown in Figure 2-5, respectively.
As shown in Figure 2, Hypergraph can preserve the topological structures of Punctured Sphere in dimensionality reduction. Compared with Hypergraph, even though the contour obtained by LLE is generally preserved, the centre parts of the data points are very sparse, suggesting LLE loses certain important information. More severely, LE almost fails to uncover the high-dimensional structure in a low-dimensional space. Therefore, if only an appropriate parameter is chosen, Hypergraph superior to LE and LLE in unfolding the Punctured Sphere. Likewise, the unfolding Gaussian surface with Hypergraph, LLE and LE show similar performance in Figure 3.
As shown in Figure 4, althought all the three methods considerably recover the symmetric structure in a low-dimensional space, Hypergraph maximally reserve the information in dimensionality reduction. LLE has worst performance for unfolding the Twin Peaks.
As shown in Figure 5, Hypergraph, as well as LE can obtain a closely same performance for the dimensionality reduction of Toroidal Helix, whereas the contour obtained by LLE has been destorted.
To summarize, with an appropriate selection of the
neighbors, Hypergraph can effectively uncover the nonlinear geometry of nearest neighborhoods in high-dimensional data. All the results consistently suggest that Hypergraph is superior to LE and LLE in unraveling the higher-order correlations and recovering the underlying structures of two degrees of freedom. Therefore, hypergraph is an effective approach for dimensionality reduction.
5.2 Datasets for clustering test
Table 2 presents the general description of the datasets used in clustering test. Specifically, six image datasets (i.e. COIL20, Faces94 male, ETH80, MNIST Digits, Olivetti Faces, and USPS) are involved. The data are randomly shuffled, and the gray value of pixels are normalized to unit. refers to the raw image size, while the indicates the size of the dataset after dimensionality reduction.
The COIL20 dataset contains 1420 grayscale images of 20 objects viewed from 72 equally spaced orientations. The images contain pixels. The Faces94 male dataset consists of 113 male individuals. The original image resolution is , each image was resized to be pixels, for a total number of 600 images. The ETH80 is a multi-view image dataset for object categorization, which includes eight categories that include eight categories corresponding to apple, car, cow, cup, dog, horse, pear and tomato. Each category contains ten objects, and each object is represented 41 images of different views. The original image resolution is , we resized each image to be pixels, for a total of 3280 images. The MNIST dataset contains 60000 grayscale images of handwritten digits. For our experiments, we randomly selected 3000 of the images for computational reasons. The digit images have pixels. The Olivetti faces dataset consists of images of 40 individuals with small variations in viewpoint, large variations in expression, and occasional addition of glasses. The dataset consists of 400 images (10 per individual) of size pixels, and is labeled according to identity. The USPS is handwritten digits dataset, it contains a total of 2000 images of size pixels.
Each dataset used in our clustering test has the ground-truth class labels. For evaluation, we first reduce the dimension of tensor data, and then cluster them with the algorithm. We used 3-order tensor to execute our numerical experiments. Empirically, the first two modes are associated with image pixels, and the last mode denotes the number of image data.
5.3 Experimental results of clustering
Here we present the numerical results on the cluster analysis. Since HyperNTF involves with two essential parameters, the regularization parameter and neighbors, we test to what extent the performance is relying on the selection of and neighbors. We vary from to , and the corresponding from to , respectively. To this end, we firstly perform clustering using algorithms. We run clustering with random initialization ,
10 times and compute the averaged results as the final clustering results. We use clustering accuracy (ACC), and normalized mutual information (NMI) as the evaluation metrics. The results inFigure 6 suggest that the performance of HyperNTF is robust with the selection of and neighbors.
According to Figure 6, for the following analyses, we set the parameters , for COIL20; , for Faces94 male; , for ETH80; , for MNIST ; , for Olivetti; and , for USPS dataset, respectively.
Table 3-8 compares the clustering results from HyperNTF, the traditional nonnegative Tucker decomposition (NTD) kim2007nonnegative and nonnegative tensor factorization (NTF) cichocki2007nonnegative methods. These results suggest that HyperNTF reliably cluster the data into the labeled classes regardless of the selection of cluster numbers , compared to NTD and NTF.
Table 9 lists the comparisons between some start-of-the-art (SOTA) methods, including low-rank regularized heterogeneous tensor decomposition (LRRHTD) zhang2017low, heterogenous tensor decomposition (HTD-Multinomial) sun2015heterogeneous, higher order singular value decomposition (HOSVD) de2000multilinear; savas2007handwritten, and graph-Laplacian Tucker decomposition (GLTD) jiang2018image. The results show that HyperNTF outperform the SOTA methods in most datasets, except Face94 male.
To sum up, our proposed algorithm, HyperNTF, achieved state-of-the art (SOTA) performance both in dimensionality reduction and clustering tests. The unfolding the curved manifolds tests obtained reliable embedding space from to dimensions under an optimal choice of value (Figure 2-5). These results indicated that hypergraph helps to maintain the higher-order correlations among data points. Moreover, in the clustering tests (Table 3-9), HyperNTF is superior to the compared methods (HTD-Multinomial, LRRHTD, GLTD and HOSVD) regardless of the cluster numbers. It thus has a distinct advantage to automatically determine the number of clusters without the necessity to know the class number in advance. Despite these merits, some issues of HyperNTF have to be noted and require further investigations, such as the optimal initialization, the discrepancy measurement, as well as the optimal stopping criterion.
This research was supported by the National Natural Science Foundation of China (No.62001205), Guangdong Natural Science Foundation Joint Fund (No.2019A1515111038), High-level University Fund (No.G02386301, G02386401).