Efficient Nonnegative Tucker Decompositions: Algorithms and Uniqueness

Nonnegative Tucker decomposition (NTD) is a powerful tool for the extraction of nonnegative parts-based and physically meaningful latent components from high-dimensional tensor data while preserving the natural multilinear structure of data. However, as the data tensor often has multiple modes and is large-scale, existing NTD algorithms suffer from a very high computational complexity in terms of both storage and computation time, which has been one major obstacle for practical applications of NTD. To overcome these disadvantages, we show how low (multilinear) rank approximation (LRA) of tensors is able to significantly simplify the computation of the gradients of the cost function, upon which a family of efficient first-order NTD algorithms are developed. Besides dramatically reducing the storage complexity and running time, the new algorithms are quite flexible and robust to noise because any well-established LRA approaches can be applied. We also show how nonnegativity incorporating sparsity substantially improves the uniqueness property and partially alleviates the curse of dimensionality of the Tucker decompositions. Simulation results on synthetic and real-world data justify the validity and high efficiency of the proposed NTD algorithms.

READ FULL TEXT VIEW PDF

Authors

page 2

page 10

page 12

page 13

09/17/2019

Nonnegative Canonical Polyadic Decomposition with Rank Deficient Factors

Recently, there is an emerging interest for applications of tensor facto...
07/31/2020

Geometric All-Way Boolean Tensor Decomposition

Boolean tensor has been broadly utilized in representing high dimensiona...
09/06/2021

Fast Hypergraph Regularized Nonnegative Tensor Ring Factorization Based on Low-Rank Approximation

For the high dimensional data representation, nonnegative tensor ring (N...
12/27/2018

Sparse Nonnegative CANDECOMP/PARAFAC Decomposition in Block Coordinate Descent Framework: A Comparison Study

Nonnegative CANDECOMP/PARAFAC (NCP) decomposition is an important tool t...
10/12/2020

Graph Regularized Nonnegative Tensor Ring Decomposition for Multiway Representation Learning

Tensor ring (TR) decomposition is a powerful tool for exploiting the low...
10/06/2020

Improving Nonparametric Density Estimation with Tensor Decompositions

While nonparametric density estimators often perform well on low dimensi...
11/01/2017

Tensor Valued Common and Individual Feature Extraction: Multi-dimensional Perspective

A novel method for common and individual feature analysis from exceeding...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Finding information-rich and task-relevant variables hidden behind observation data is a fundamental task in data analysis and has been widely studied in the fields of signal and image processing and machine learning. Although the observation data can be very large, a much lower number of latent variables or components can capture the most significant features of the original data. By revealing such components, we achieve objectives such as dimensionality reduction and feature extraction and obtain a highly relevant and compact representation of high-dimensional data. This important topic has been extensively studied in the last several decades, particularly witnessed by the great success of blind source separation (BSS) techniques

[1]. In these methods, observation data are modeled as a linear combination of the latent components that possess specific diversities such as statistical independence, a temporal structure, sparsity, and smoothness. By properly exploiting such diversities, a large family of matrix-factorization-based methodologies has been proposed and successfully applied to a variety of areas. In many applications, the data are more naturally represented by tensors, e.g., color images, video clips, and fMRI data. The methodologies that matricize the data and then apply matrix factorization approaches give a flattened view

of data and often cause a loss of the internal structure information; hence, it is more favorable to process such data in their own domain, i.e., tensor domain, to obtain a multiple perspective stereoscopic view of data rather than a flattened one. For this reason, tensor decomposition methods have been proposed and widely applied to deal with high-order tensors. As one of the most widely used methods, the Tucker decomposition has been applied to pattern recognition

[2, 3]

, clustering analysis

[4], image denoising [5], etc. and has achieved great success.

Very often, the observation data and latent components are naturally nonnegative, e.g., text, images, spectra, probability matrices, the adjacency matrices of graphs, web data based on impressions and clicks, and financial time series. For these data, the extracted components may lose most of their physical meaning if the nonnegativity is not preserved. In this regard, nonnegative matrix factorization (NMF) has been demonstrated to be a powerful tool to analyze nonnegative matrix data because NMF is able to give physically meaningful and more interpretable results. Particularly, it has the ability of learning the local parts of objects

[6]. As a result, NMF has received extensive study in the last decade [4, 6] and has been found many important applications including clustering analysis [7, 8, 9], sparse coding [10], dependent source separation [11], etc.

Fig. 1: An illustration of how NTD is able to give a local parts-based representation of tensor objects for the PIE face database. By NTD, each face is represented by a linear combination of a set of very sparse basis faces. In contrast to the basis images extracted by NMF, these basis faces possess a multilinear structure (i.e., ).

, , A matrix, the th-column, and the ()th-entry of matrix , respectively. ,

The vector/matrix with all its elements being ones, zeros.

The index set of positive integers no larger than in ascending order, i.e., . Set of th-order nonnegative tensors (matrices) with the size of -by-by-. Nonnegative matrix , i.e., , . An operator yielding a nonnegative matrix of , . , A tensor, the mode- matricization of tensor . , Element-wise (Hadamard) product, division of matrices or tensors. Moreover, we define . Kronecker product of and that yields with entries . Khatri–Rao product of and that yields a matrix . , Number of zeros in , the sparsity defined as . Elements of

are drawn from independent uniform distributions between 0 and 1.

TABLE I: Notation and Definitions.

For nonnegative tensor data analysis, nonnegative Tucker decomposition (NTD) has also gained importance in recent years [4, 12, 13]. NTD not only inherits all of the advantages of NMF but also provides an additional multiway structured representation of data. Fig.1 illustrates how NTD is able to give a parts-based representation of face images included in the PIE database111Available at http://vasc.ri.cmu.edu/idb/html/face/.. In the figure, a sample face image is represented as a linear combination of a set of sparse basis images that possess a multilinear structure. Unconstrained Tucker decompositions are often criticized for the lack of uniqueness and the curse of dimensionality, which indicates that the size of the core tensor increases exponentially with the dimension. Compared with unconstrained Tucker decompositions, NTD is more likely to be unique and provides physically meaningful components. Moreover, the core tensor in NTD is often very sparse, which allows us to discover the most significant links between components and to partially alleviate the curse of dimensionality. Unfortunately, existing algorithms are generally performed by directly applying the NMF update rules without fully exploiting the special multilinear structure of the Tucker model, which in turn suffers from a very high computational complexity in terms of both space and time, especially when the data tensor is large-scale. It is therefore quite crucial to develop more efficient NTD algorithms that are able to yield satisfactory results within a tolerable time. By taking into account that unconstrained Tucker decompositions are significantly faster than NTD, we propose a new framework for efficient NTD that is based on an unconstrained Tucker decomposition of the data tensor in this paper. As such, frequent access to the original big tensor is avoided, thereby leading to a considerably reduced computational complexity for NTD. Although the basic idea of NTD based on a proceeding low (multilinear) rank approximation (LRA) has been briefly introduced in our recent overview paper [14], the detailed derivations are presented in this paper with new results on the uniqueness of NTD.

The rest of the paper is organized as follows. In Section II, the basic notation and NTD models are introduced. In Section III, the first-order NTD algorithms are reviewed. In Section IV, flexible and efficient NTD algorithms based on the low-rank approximation of data are introduced, and unique and sparse NTD is discussed in Section V. Simulations on synthetic and real-world data are presented in Section VI to demonstrate the high efficiency of the proposed algorithms, and conclusions are presented in Section VII.

The notation used in this paper is listed in TABLE I, and more details can be found in [15, 4].

Ii NTD Models

Ii-a Notation and Basic Multilinear Operators

Definitions. For an th-order tensor , we define

  • Fibers. A mode- fiber of tensor is a vector obtained by fixing all indices but the th index, e.g., , by using the MATLAB colon operator.

  • Matricization. The mode- matricization (unfolding) of is an -by- matrix denoted by whose columns consist of all mode- fibers of .

  • Mode- product. The mode- product of and an -by- matrix yields an th-order tensor such that

Useful properties. The following properties concerning the mode- product will be frequently used:

  1. If , then ;

  2. ;

  3. , .

  4. If , then , where

    (1)

    and and are the mode- matricizations of and , respectively.

Note that owning to Property 3), the mode products in can be in any order of . However, in , the Kronecker products must be performed in the inverse order of the index set .

Ii-B Nonnegative Tucker Decomposition Models

Ii-B1 General NTD Model

By Tucker decomposition, a given th-order tensor is approximated as

(2)

where are the factor (component) matrices with , and is the core tensor whose entries reflect the interactions and connections between the components (columns) in different mode matrices. We assume that as high-dimensional data can often be well approximated by its lower-rank representations.

In NTD, both the core tensor and the factor matrices are required to be element-wisely nonnegative. The nonnegativity of the factors brings about two key effects: the resulting representation is purely additive but without subtractions, and the factors are often sparse, as they may contain many zero entries. These two effects equip nonnegative factorization methods with the ability of learning localized parts of objects.

Ii-B2 Population NTD

In the above NTD, if we fix the th factor matrix

to be the identity matrix

(or equivalently, is absorbed into the core tensor such that [4]), we obtain the population NTD model:

(3)

Population NTD is important because it has a broad range of applications in machine learning and signal processing [4]. To understand the key idea of population NTD, consider simultaneously performing NTD on a set of th-order sample tensors with common component matrices. As such, each sample tensor can be represented as [4]

(4)

where is the core tensor associated with , or equivalently,

(5)

where each column of is a vectorized sample, and is just the mode- matricization of the th-order tensor obtained by concatenating all of the samples . As such, all samples are represented as a linear combination of sets of common basis vectors (i.e., the columns of , ), and contains the extracted features.

In the case of , the tensors and in (4) are just matrices, and (4) can be written as

(6)

which has been studied in name of population value decomposition (PVD) [16] without nonnegativity constraints. Hence, population NTD is an extension of PVD for extracting the nonnegative common components from multiblock higher-dimensional data equipped with the extra ability of learning the localized parts of objects [17].

An alternative method of performing such feature extraction tasks, which is referred to as nonnegative matrix factorization (NMF), vectorizes each sample to form a sample matrix . By using NMF, is represented as

(7)

or equivalently, by using tensor notation,

(8)

An intuitive difference between population NTD and NMF is that the basis vectors in the former are the outer product of lower-dimensional vectors, as shown in (5), which has a much lower number of free parameters and gives a kind of multilinear representation. This multilinear representation has been widely exploited to overcome the over-fitting problem in discriminant analysis [3, 18], and it substantially improves the sparsity of basis vectors, which will be discussed later.

Iii Overview of First-Order Methods for NTD

In NTD, we need to minimize the following cost function:

(9)

where with the component matrices and the core tensor , both of which are element-wisely nonnegative.

Following the analysis in [19] and similarly defining , we can straightforwardly obtain the following proposition:

Proposition 1

Let . Then, the infimum

is attained.

Owning to Proposition 1 and the equivalence of different norms, a global optimal solution for the problem in (9) always exists. Below, we focus on the optimization algorithms.

To solve the optimization problem, we generally use a block coordinate descent framework: we minimize the cost function with respect to only the partial parameters (e.g., one factor matrix or even only one column of it) each time while fixing the others. To optimize , we consider an equivalent form of (9) by considering the mode- matricization of and :

(10)

where and is defined as in (1). To optimize , considering the vectorization of and , (9) becomes

(11)

where

(12)

Both (10) and (11) are nonnegative least squares (NLS) problems and have been extensively studied in the context of NMF, including the multiplicative update (MU) algorithm [20], the hierarchical alternating least squares (HALS) method [4], the active-set methods [21, 22], and the NMF algorithm based on the accelerated proximal gradient (APG) [23]. These algorithms only use the first-order information and are free from searching the (learning) step size. To extend these methods to NTD, we need to compute the following respective gradients of with respect to and :

(13)

where , and

(14)

or equivalently,

(15)

On the basis of (13)–(15) and the existing NMF algorithms, a set of first-order NTD algorithms can be developed; for example, the NTD algorithms based on the MU and HALS algorithms have been developed in [13, 12, 24].

The above mentioned algorithms are all based on the gradients in (13)–(15). The major problem is that both and are quite large. For example, it can be verified that the complexity of computing is as high as . Hence, direct implementation of the above methods is extremely time and space demanding, especially for large-scale problems.

Iv NTD Based on Low-Rank Approximations

Iv-a Efficient Computation of Gradients by Using LRA

To reduce the computational complexity, we consider the following two-step approach to perform NTD:

  1. the LRA step. Obtain the LRA of such that

    (16)

    where , and controls the approximation error and is not necessarily equal to .

  2. the NTD step. Perform NTD by minimizing , where is the target nonnegative tensor.

The effects of LRA are twofold: reduce the noise in the observation data and reduce the subsequent computational complexity in terms of both space and time. In fact, consumes much less storage than does when : the former consumes , whereas consumes . For an intuitive comparison, suppose , , and , ; then, the memory consumed by is only 0.014% of that consumed by .

Now, we show how the gradients with respect to , , and can be efficiently computed after is replaced with its low-rank approximation . First, we have

(17)

where is the mode- matricization of the tensor

(18)

Here, can be efficiently computed as it only involves the products of the small core tensor with -by- matrices. Particularly, the memory-efficient (ME) tensor times the matrices proposed in [25] can be applied to compute to further significantly reduce the memory consumption.

Regarding the term , we have

(19)

where is the mode- matricization of the tensor

(20)

Furthermore, from (18), (20), and , we have

(21)

and

(22)

The tensors and can be computed very efficiently, as they only involve multiplications between very small core tensors and matrices. On the basis of the above analysis, the gradients (13) and (15) can be computed as

(23)

We counter the floating-point multiplications to measure the computational complexity on the condition that and , , from which we can see that the LRA versions are significantly faster than the original versions.

With LRA Without LRA

TABLE II: Floating-point Multiplication Counts for the Computation of the Gradients under the Conditions That and , .

Low-rank approximation or unconstrained Tucker decomposition of the data tensor plays a key role in the proposed two-step framework. The high-order singular value decomposition (HOSVD) method

[26]

often serves as the workhorse for this purpose. Although it provides a good trade-off between accuracy and efficiency, it involves the eigenvalue decomposition of the very large matrices

; hence, it is not suitable for very large-scale problems [27]. Its memory efficient variant proposed in [25] is an iterative method and provides improved scalability. CUR decomposition [28] and the MACH method [29], which are respectively based on sampling and sparsification, provide favorable scalability and are suitable for large-scale problems. In [30], a highly scalable Tucker decomposition algorithm was developed on the basis of distributed computing and randomization. All of these methods have assumed that the noise is Gaussian. Otherwise, robust tensor decomposition methods [31, 32]

are recommended if the data tensor is contaminated by outliers. In practice, which one is the best depends on many factors, e.g., whether the data is sparse or dense, the scale of data, the noise distribution, etc.

Iv-B Efficient NTD Algorithms Based on LRA

Iv-B1 LRANTD Based on MU rules (MU-NTD)

the standard MU method updates using

(24)

with a clever choice of step size . As such, the cost function remains nonincreasing and remains nonnegative. As the term (i.e., (19)) may contain some negative elements after LRA, we apply the method proposed in [33] to (24), where the descent direction is replaced by , thereby leading to the following MU formula:

(25)

Similarly, we obtain the MU rule for as

(26)

See Algorithm 1 for the pseudocode of the NTD algorithm based on the MU rules in(25) and (26). (They are quite different from the algorithms proposed in [13, 12, 24] that update the parameters in a very inefficient manner: they update one parameter only once in one main iteration. In our case, however, multiple updates will be used to achieve a sufficient decrease of the cost function to improve the total efficiency, motivated by the work [34] for NMF. Of course, in order to achieve a high efficiency, exact convergence of each subproblem is generally unnecessary during the iterations.)

0:  .
1:  Initialization.
2:  Perform LRA: .
3:  while not converged do
4:     Execute the loop in lines 5–8 for :
5:     while not converged do
6:        Update the tensors and using (18) and (20).
7:        .
8:     end while
9:     while not converged do
10:        .
11:     end while
12:  end while
13:  return  , , .
Algorithm 1 The MU-NTD Algorithm

Iv-B2 NTD Based on HALS (HALS-NTD)

HALS-NTD updates only one column of each time [24] by minimizing

(27)

where , . By using the Lagrange multiplier method [33], we obtain the update rule for :

(28)

where is the entry of the matrix ; and are the th columns of the matrices and , respectively; ; and . The MU rule in (26) can be used to update .

Iv-B3 NTD Based on the Accelerated Proximal Gradient (APG-NTD)

Following the analysis in [23], we have the following:

Proposition 2

Both and in (13), (15) and (23) are Lipschitz continuous with the Lipschitz constants and .

Hence, the APG method can be applied to update and . For example, we use Algorithm 2 to obtain provided that all are fixed. Similarly, we can obtained the update rules for . We call this algorithm APG-NTD; see Algorithm 2.

0:  , .
1:  Initialize , , , .
2:  while not converged do
3:     ,
4:     ,
5:     ,
6:     .
7:  end while
8:  return  .
Algorithm 2 The APG Algorithm for the Core Tensor

Iv-B4 Active Set Method (AS-NTD)

The active set method proposed for NMF in [21, 22] can be applied to NTD to solve (10) and (11). Roughly speaking, these methods involve solving the inverse problems of and under nonnegativity constraints, and among them, block principal pivoting (BPP) achieved the best performance, as multiple columns are updated simultaneously [22]. Active-set-based NMF approaches converge very fast, but their stability was questioned in some cases [23].

Iv-B5 Alternating Least Squares (ALS) and Semi-NTD

Sometimes some component matrices and/or the core tensor are not necessarily nonnegative, which is a natural extension of semi-NMF [35]. These factors can be updated by their least-squares (LS) solutions to the linear equation systems and , which results in

(29)

Note that if we apply an additional nonnegativity projection to (29), a very simple ALS-based NTD algorithm (ALS-NTD) yields

(30)

Similar to the ALS-based NMF method, the ALS-NTD method generally has no guarantee of convergence. However, many experimental results show that this method works quite well when , , and the factors are very sparse.

Iv-C Error Bounds

An important question is how the LRA will affect the accuracy of NTD. The following proposition provides an approximate error bound:

Proposition 3

Let tensor be a low-rank approximation of with . Suppose that and are the optimal NTDs of and , respectively, and . Then

(31)
Proof:

As and are respectively the optimal NTDs of and , we have

(32)

Obviously, if the LRA is exact such that , there is no difference between the direct NTD and that based on LRA. In summary, the quality of LRA could be crucial to achieve satisfactory accuracy.

Iv-D NTD with Missing Values (Weighted NTD)

In practice, some entries of the data tensor could be severely contaminated by noise and hence could not be used or they are simply missing. In such a case, the intrinsic low-rank structure of data often allows the recovery of the missing values by using incomplete data. NTD with missing values can be formulated as the following general weighted NTD problem:

(33)

where the entries of the weight tensor are between 0 and 1. If the entries of can only be either 0 or 1, (33) is the problem of NTD with missing values. Although there have been many methods proposed for tensor/matrix decompositions with missing values that can be straightforwardly extended to NTD, an ad-hoc two-step solution can be applied: in Step 1, weighted Tucker decomposition is performed by minimizing the cost function , where ; then, in Step 2, the NTD is performed by using the completed tensor yielded in Step 1.

Notice that weighted Tucker decomposition approaches also allow us to obtain the low-rank approximations by accessing only randomly sampled entries (fibers) of a high-dimensional tensor, which is a very useful technique to deal with large-scale problems [28]. Although all of the approaches proposed for the missing-values problem and those based on random sampling attempt to find the optimal approximation to the original data by using only partial data, they have a subtle difference: in the first category, the data samples used are fixed, whereas in the second category, the data samples used shall be carefully selected in order to achieve a satisfactory accuracy with a high probability. By using the above two-step framework, NTD can be scaled up for large-scale problems, and the error is governed by the quality of the LRA in Step 1, as stated in Proposition 3.

V Unique and Sparse NTD

Tucker decompositions are often criticized for suffering from two major disadvantages: the curse of dimensionality and the lack of uniqueness. The former means that the size of the core tensor increases exponentially with respect to the order

, whereas the latter is due to the fact that unconstrained Tucker decompositions essentially only estimate a subspace of each mode. In this section, we discuss how nonnegativity can help overcome these two limitations of Tucker decompositions, particularly by incorporating sparsity. To our best knowledge, although several NTD algorithms have been developed

[4], a theoretical analysis of the uniqueness of NTD is still missing.

V-a Uniqueness of NTD

The following notation will be used in the uniqueness analysis:

Nonnegative Rank: The nonnegative rank of a nonnegative matrix , i.e., , is equal to the minimal number of such that , where and . Obviously, .

Multilinear rank and nonnegative multilinear rank: The vector is called the multilinear rank of , where . The vector is called the nonnegative multilinear rank of a nonnegative tensor , if .

Essential uniqueness: We call the NTD essentially unique, if , , holds for any other NTD , where is a permutation matrix, and is a nonnegative diagonal matrix. (On the basis of relationship between NTD and NMF described in (7)–(8), the definition of the essential uniqueness of NMF can be obtained.)

Below, we suppose that is the NTD of with the nonnegative multilinear rank , i.e., , . First, we have the following:

Proposition 4

For any , holds.

Proof:

Note that , where . If there exists such that , we simply let , , and = for all . Then, forms another NTD of with , which contradicts the assumption of .

Corollary 1

Let , and the NTD is essentially unique. If , then .

In Corollary 1, the condition means that is not a trivial matrix that is a product of a permutation matrix and a nonnegative scaling matrix. Following the proof of Proposition 4, the proof of Corollary 1 is obvious.

Proposition 5

If the NTD is essentially unique, then is the unique NMF of matrix for all .

Proof:

Suppose that there exists and a non-trivial matrix such that is another NMF of ; then, and are two different NTDs of , with , = for , and . This contradicts the assumption that the NTD of is essentially unique.

Fig. 2: The evolution of the Fit values versus the iteration number of each algorithm in five runs. In each run, all of the algorithms started from the same initial settings. The data tensor was generated by using , where the entries of and

were drawn from i.i.d. exponential distributions with

. The entries of the noise

were drawn from a standard Gaussian distribution with an SNR of 0 dB.

Proposition 6

For the population NTD , if has an essentially unique NMF with the positive rank equal to , then the population NTD of is essentially unique.

Proof:

As has an essentially unique NMF and , where , both and can actually be essentially uniquely estimated. Without loss of generality, we suppose that

where is an estimate of , is a permutation matrix, and is a nonnegative diagonal matrix. In other words,

(34)

where , and are permutation matrices and diagonal matrices such that and . Below, we only need to show that can be essentially uniquely estimated from , which in turn results in the essential uniqueness of .

Motivated by the method proposed in [36], we appropriately arrange the elements of and reshape it to form a tensor such that