 # On Tensor Train Rank Minimization: Statistical Efficiency and Scalable Algorithm

Tensor train (TT) decomposition provides a space-efficient representation for higher-order tensors. Despite its advantage, we face two crucial limitations when we apply the TT decomposition to machine learning problems: the lack of statistical theory and of scalable algorithms. In this paper, we address the limitations. First, we introduce a convex relaxation of the TT decomposition problem and derive its error bound for the tensor completion task. Next, we develop an alternating optimization method with a randomization technique, in which the time complexity is as efficient as the space complexity is. In experiments, we numerically confirm the derived bounds and empirically demonstrate the performance of our method with a real higher-order tensor.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Tensor decomposition is an essential tool for dealing with data represented as multidimensional arrays, or simply, tensors. Through tensor decomposition, we can determine latent factors of an input tensor in a low-dimensional multilinear space, which saves the storage cost and enables predicting missing elements. Note that, a different multilinear interaction among latent factors defines a different tensor decomposition model, which yields a ton of variations of tensor decomposition. For general purposes, however, either Tucker decomposition  or CANDECOMP/PARAFAC (CP) decomposition  model is commonly used.

In the past three years, an alternative tensor decomposition model, called tensor train (TT) decomposition  has actively been studied in machine learning communities for such as approximating the inference on a Markov random field 

, modeling supervised learning

[19, 24]

, analyzing restricted Boltzmann machine



, and compressing deep neural networks

. A key property is that, for higher-order tensors, TT decomposition provides more space-saving representation called TT format while preserving the representation power. Given an order- tensor (i.e., a -dimensional tensor), the space complexity of Tucker decomposition is exponential in , whereas that of TT decomposition is linear in . Further, on TT format, several mathematical operations including the basic linear algebra operations can be performed efficiently .

Despite its potential importance, we face two crucial limitations when applying this decomposition to a much wider class of machine learning problems. First, its statistical performance is unknown. In Tucker decomposition and its variants, many authors addressed the generalization error and derived statistical bounds (e.g. [28, 27]). For example, Tomioka et al. clarify the way in which using the convex relaxation of Tucker decomposition, the generalization error is affected by the rank (i.e., the dimensionalities of latent factors), dimension of an input, and number of observed elements. In contrast, such a relationship has not been studied for TT decomposition yet. Second, standard TT decomposition algorithms, such as alternating least squares (ALS) [6, 30]

, require a huge computational cost. The main bottleneck arises from the singular value decomposition (SVD) operation to an “unfolding” matrix, which is reshaped from the input tensor. The size of the unfolding matrix is huge and the computational cost grows exponentially in

.

In this paper, we tackle the above issues and present a scalable yet statistically-guaranteed TT decomposition method. We first introduce a convex relaxation of the TT decomposition problem and its optimization algorithm via the alternating direction method of multipliers (ADMM). Based on this, a statistical error bound for tensor completion is derived, which achieves the same statistical efficiency as the convex version of Tucker decomposition does. Next, because the ADMM algorithm is not sufficiently scalable, we develop an alternative method by using a randomization technique. At the expense of losing the global convergence property, the dependency of on the time complexity is reduced from exponential to quadratic. In addition, we show that a similar error bound is still guaranteed. In experiments, we numerically confirm the derived bounds and empirically demonstrate the performance of our method using a real higher-order tensor.

## 2 Preliminaries

### 2.1 Notation

Let be the space of order- tensors, where denotes the dimensionality of the -th mode for . For brevity, we define ; similarly, and

are defined. For a vector

, denotes the -th element of . Similarly, denotes the elements of a tensor . Let denote an -dimensional vector called the mode- fiber. For a vector , denotes the -norm and denotes the max norm. For tensors , an inner product is defined as and denotes the Frobenius norm. For a matrix , denotes the Schatten-1 norm, where is a -th singular value of .

### 2.2 Tensor Train Decomposition

Let us define a tuple of positive integers and an order- tensor for each . Here, we set . Then, TT decomposition represents each element of as follows:

 Xi1,…,iK=[G1]i1,:,:[G2]i2,:,:⋯[GK]iK,:,:. (1)

Note that is an matrix. We define as a set of the tensors, and let be a tensor whose elements are represented by as (1). The tuple controls the complexity of TT decomposition, and it is called a Tensor Train (TT) rank. Note that TT decomposition is universal, i.e., any tensor can be represented by TT decomposition with sufficiently large TT rank .

When we evaluate the computational complexity, we assume the shape of is roughly symmetric. That is, we assume there exist such that for and for .

### 2.3 Tensor Completion Problem

Suppose there exists a true tensor that is unknown, and a part of the elements of is observed with some noise. Let be a set of indexes of the observed elements and be the number of observations. Let be an -th element of for , and denote -th observation from with noise. We consider the following observation model:

 yi=[X∗]j(i)+ϵi, (2)

where

is i.i.d. noise with zero mean and variance

. For simplicity, we introduce observation vector , noise vector , and rearranging operator that randomly picks the elements of . Then, the model (2) is rewritten as follows:

 Y=X(X∗)+E.

The goal of tensor completion is to estimate the true tensor

from the observation vector

. Because the estimation problem is ill-posed, we need to restrict the degree of freedom of

, such as rank. Because the direct optimization of rank is difficult, its convex surrogation is alternatively used [2, 3, 11, 31, 22]. For tensor completion, the convex surrogation yields the following optimization problem [5, 14, 23, 26]:

 minX∈Θ[12n∥Y−X(X)∥2+λn∥X∥s∗], (3)

where is a convex subset of , is a regularization coefficient, and is the overlapped Schatten norm defined as . Here, is the -unfolding matrix defined by concatenating the mode- fibers of . The overlapped Schatten norm regularizes the rank of in terms of Tucker decomposition [16, 28]. Although the Tucker rank of is unknown in general, the convex optimization adjusts the rank depending on .

To solve the convex problem (3), the ADMM algorithm is often employed [1, 26, 28]. Since the overlapped Schatten norm is not differentiable, the ADMM algorithm avoids the differentiation of the regularization term by alternatively minimizing the augmented Lagrangian function iteratively.

## 3 Convex Formulation of TT Rank Minimization

To adopt TT decomposition to the convex optimization problem as (3), we need the convex surrogation of TT rank. For that purpose, we introduce the Schatten TT norm  as follows:

 ∥X∥s,T:=1K−1K−1∑k=1∥Qk(X)∥s:=1K−1K−1∑k=1∑jσj(Qk(X)), (4)

where is a reshaping operator that converts a tensor to a large matrix where the first modes are combined into the rows and the rest modes are combined into the columns. Oseledets et al. shows that the matrix rank of can bound the -th TT rank of , implying that the Schatten TT norm surrogates the sum of the TT rank. Putting the Schatten TT norm into (3), we obtain the following optimization problem:

 minX∈X[12n∥Y−X(X)∥2+λn∥X∥s,T]. (5)

To solve (5), we consider the augmented Lagrangian function , where is the vectorization of , is a reshaped matrices with size , and is a coefficient for constraints. Given initial points , the -th step of ADMM is written as follows:

 x(ℓ+1)=(˜ΩTY+nη1K−1K−1∑k=1(Vk(Z(ℓ)k)−α(ℓ)k))/(1+nηK), Z(ℓ+1)k=proxλn/η(V−1k(x(ℓ+1)+α(ℓ)k)),  k=1,…,K, α(ℓ+1)k=α(ℓ)k+(x(ℓ+1)−Vk(Z(ℓ+1)k)),  k=1,…,K.

Here, is an matrix that works as the inversion mapping of ; is a vectorizing operator of an matrix; is the shrinkage operation of the singular values as , where is the singular value decomposition of ;

is a hyperparameter for a step size. We stop the iteration when the convergence criterion is satisfied (e.g. as suggested by Tomioka

et al.). Since the Schatten TT norm (4) is convex, the sequence of the variables of ADMM is guaranteed to converge to the optimal solution (Theorem 5.1, ). We refer to this algorithm as TT-ADMM.

TT-ADMM requires huge resources in terms of both time and space. For the time complexity, the proximal operation of the Schatten TT norm, namely the SVD thresholding of , yields the dominant, which is time. For the space complexity, we have variables of size , which requires space.

## 4 Alternating Minimization with Randomization

We first consider reducing the space complexity of the ADMM approach. The idea is simple: we only maintain the TT format of the input tensor rather than the input tensor itself. This leads the following optimization problem:

 minG[12n∥Y−X(X(G))∥2+λn∥X(G)∥s,T]. (6)

Remember that is the set of TT components and is the tensor given by the TT format with . Now we only need to store the TT components , which drastically improves the space efficiency.

### 4.1 Randomized TT Schatten norm

Next, we approximate the optimization of the Schatten TT norm. To avoid the computation of exponentially large-scale SVDs in the Schatten TT norm, we employ a technique called the “very sparse random projection” 

. The main idea is that, if the size of a matrix is sufficiently larger than its rank, then its singular values (and vectors) are well preserved even after the projection by a sparse random matrix. This motivates us to use the Schatten TT norm over the random projection.

Preliminary, we introduce tensors for the random projection. Let be the size of the matrix after projection. For each and parameters, let be a tensor whose elements are independently and identically distributed as follows:

 [Πk,1]d1,i1,…,ik=⎧⎪⎨⎪⎩+√s/d1~{}with probability~{}1/2s,0~{}with probability~{}1−1/s,−√s/d1~{}with probability~{}1/2s, (7)

for and . Here, is a hyperparameter controlling sparsity. Similarly, we introduce a tensor that is defined in the same way as . With and , let be a random projection operator whose element is defined as follows:

 [Pk(X)]d1,d2 =I1∑j1=1⋯IK∑jK=1[Πk,1]d1,j1,…,jk[X]j1,…,jK[Πk,2]d2,jk+1,…,jK. (8)

Note that we can compute the above projection by using the facts that has the TT format and the projection matrices are sparse. Let be a set of indexes of non-zero elements of . Then, using the TT representation of , (8) is rewritten as

 [Pk(X(G))]d1,d2= ∑(j1,…,jk)∈π(k)1[Πk,1]d1,j1,…,jk[G1]j1⋯[Gk]jk ∑(jk+1,…,jK)∈π(k)2[Gk]jk+1⋯[GK]jK[Πk,2]d2,jk+1,…,jK,

If the projection matrices have only nonzero elements (i.e., ), the computational cost of the above equation is .

The next theorem guarantees that the Schatten-1 norm of approximates the original one.

###### Theorem 1.

Suppose has TT rank . Consider the reshaping operator in (4), and the random operator as (8) with tensors and defined as (7). If , and all the singular vectors of are well-spread as , we have

 1−ϵRk∥Qk(X)∥s≤∥Pk(X)∥s≤(1+ϵ)∥Qk(X)∥s,

with probability at least .

Note that the well-spread condition can be seen as a stronger version of the incoherence assumption which will be discussed later.

### 4.2 Alternating Minimization

Note that the new problem (6) is non-convex because does not form a convex set on . However, if we fix except for , it becomes convex with respect to . Combining with the random projection, we obtain the following minimization problem:

 minGk[12n∥Y−X(X(G))∥2+λnK−1K−1∑k′=1∥Pk′(X(G))∥s]. (9)

We solve this by the ADMM method for each . Let be the vectorization of , and be a matrix for the randomly projected matrix. The augmented Lagrangian function is then given by , where are the Lagrange multipliers. Starting from initial points , the -th ADMM step is written as follows:

 g(ℓ+1)k=(ΩTΩ/n+ηK−1∑k′=1ΓTk′Γk′)−1(ΩTY/n+1K−1K−1∑k′=1ΓTk′(η˜Vk(W(ℓ)k′)−β(ℓ)k′)), W(ℓ+1)k′=proxλn/η(˜V−1k(Γk′g(ℓ+1)k+β(ℓ)k′)),  k′=1,…,K−1, β(ℓ+1)k′=β(ℓ)k′+(Γk′g(ℓ+1)k−˜Vk(W(ℓ+1)k′)),  k′=1,…,K−1.

Here, is the matrix imitating the mapping , is a vectorizing operator of matrix, and is an matrix of the operator with respect to . Similarly to the convex approach, we iterate the ADMM steps until convergence. We refer to this algorithm as TT-RALS, where RALS stands for randomized ALS.

The time complexity of TT-RALS at the -th iteration is ; the details are deferred to Supplementary material. The space complexity is , where is for and is for the parameters.

## 5 Theoretical Analysis

In this section, we investigate how the TT rank and the number of observations affect to the estimation error. Note that all the proofs of this section are deferred to Supplementary material.

### 5.1 Convex Solution

To analyze the statistical error of the convex problem (5), we assume the incoherence of the reshaped version of .

###### Assumption 2.

(Incoherence Assumption) There exists such that a matrix has orthogonal singular vectors satisfying

 max1≤i≤I

with some . Here, and are linear projections onto spaces spanned by and ; is the natural basis.

Intuitively, the incoherence assumption requires that the singular vectors for the matrix are well separated. This type of assumption is commonly used in the matrix and tensor completion studies [2, 3, 31]. Under the incoherence assumption, the error rate of the solution of (5) is derived.

###### Theorem 3.

Let be a true tensor with TT rank , and let be the minimizer of (3). Suppose that and that Assumption 2 for some is satisfied. If

 n≥Cm′μ2k′max{I≤k′,Ik′<}Rk′log3max{I≤k′,Ik′<}

with a constant , then with probability at least and with a constant ,

 ∥ˆX−X∗∥F≤CXλnKK−1∑k=1√Rk.

Theorem 3 states that the bound for the statistical error gets larger as the TT rank increases. In other words, completing a tensor is relatively easy as long as the tensor has small TT rank. Also, when we set as increases, we can state the consistency of the minimizer.

The result of Theorem 3 is similar to that obtained from the studies on matrix completion [3, 16] and tensor completion with the Tucker decomposition or SVD [28, 31]. Note that, although Theorem 3 is for tensor completion, the result can easily be generalized to other settings such as the tensor recovery or the compressed sensing problems.

### 5.2 TT-RALS Solution

Prior to the analysis, let be the true TT components such that . For simplification, we assume that the elements of are normalized, i.e., , and an matrix reshaped from has a row rank.

In addition to the incoherence property (Assumption 2), we introduce an additional assumption on the initial point of the ALS iteration.

###### Assumption 4.

(Initial Point Assumption) Let be the initial point of the ALS iteration procedure. Then, there exists a finite constant that satisfies

 maxk∈{1,…,K}∥Ginitk−G∗k∥F≤Cγ.

Assumption 4 requires that the initial point is sufficiently close to the true solutions . Although the ALS method is not guaranteed to converge to the global optimum in general, Assumption 4 guarantees the convergence to the true solutions . Now we can evaluate the error rate of the solution obtained by TT-RALS.

###### Theorem 5.

Let be the true tensor generated by with TT rank , and be the solution of TT-RALS at the -th iteration. Further, suppose that Assumption 2 for some and Assumption 4 are satisfied, and suppose that Theorem (1) holds with for . Let be be some constants. If

 n≥Cmμ2k′Rk′max{I≤k′,Ik′<}log3max{I≤k′,Ik′<},

and the number of iterations satisfies , then with probability at least and for ,

 ∥X(ˆG)−X(G∗)∥F≤CA(1+ϵ)λnK−1∑k=1√Rk. (10)

Again, we can obtain the consistency of TT-RALS by setting as increases. Since the setting of corresponds to that of Theorem 3, the speed of convergence of TT-RALS in terms of is equivalent to the speed of TT-ADMM.

By comparing with the convex approach (Theorem 3), the error rate becomes slightly worse. Here, the term in (10) comes from the estimation by the alternating minimization, which linearly increases by . This is because there are optimization problems and their errors are accumulated to the final solution. The term in (10) comes from the random projection. The size of the error can be arbitrary small by controlling the parameters of the random projection and .

## 6 Related Work

To solve the tensor completion problem with TT decomposition, Wang et al. and Grasedyck et al. developed algorithms that iteratively solve minimization problems with respect to for each . Unfortunately, the adaptivity of the TT rank is not well discussed.  assumed that the TT rank is given. Grasedyck et al. proposed a grid search method. However, the TT rank is determined by a single parameter (i.e., ) and the search method lacks its generality. Furthermore, the scalability problem remains in both methods—they require more than space.

Phien et al.(2016)  proposed a convex optimization method using the Schatten TT norm. However, because they employed an alternating-type optimization method, the global convergence of their method is not guaranteed. Moreover, since they maintain directly and perform the reshape of several times, their method requires time.

Table 1 highlights the difference between the existing and our methods. We emphasize that our study is the first attempt to analyze the statistical performance of TT decomposition. In addition, TT-RALS is only the method that both time and space complexities do not grow exponentially in .

## 7 Experiments Figure 1: Synthetic data: the estimation error ∥ˆX−X∗∥F against SRR ∑k√Rk with the order-4 tensor (K=4) and the order-5 tensor (K=5). For each rank and λn, we measure the error by 10 trials with different random seeds, which affect both the missing pattern and the initial points.

### 7.1 Validation of Statistical Efficiency

Using synthetic data, we verify the theoretical bounds derived in Theorems 3 and 5. We first generate TT components ; each component is generated as where each element of

is sampled from the i.i.d. standard normal distribution. Then we generate

by following the generative model (2) with the observation ratio and the noise variance . We prepare two tensors of different size: an order- tensor of size and an order- tensor of size . At the order- tensor, the TT rank is set as where . At the order- tensor, the TT rank is set as where . For estimation, we set the size of and as , which is larger than the true TT rank. The regularization coefficient is selected from . The parameters for random projection are set as and .

Figure 1 shows the relation between the estimation error and the sum of root rank (SRR) . The result of TT-ADMM shows that the empirical errors are linearly related to SSR which is shown by the theoretical result. The result of TT-RALS roughly replicates the theoretical relationship.

### 7.2 Higher-Order Markov Chain for Electricity Data

We apply the proposed tensor completion methods for analyzing the electricity consumption data . The dataset contains time series measurements of household electric power consumption for every minutes from December 2006 to November 2010 and it contains over observations.

The higher-order Markov chain is a suitable method to represent long-term dependency, and it is a common tool of time-series analysis

. Let

be discrete-time random variables take values in a finite set

, and the order- Markov chain describes the conditional distribution of with given as . As increases, the conditional distribution of can include more information from the past observations.

We complete the missing values of -th Markov transition of the electricity dataset. We discretize the value of the dataset into values and set . Next, we empirically estimate the conditional distribution of size using observations. Then, we create by randomly selecting elements from the the conditional distribution and regarding the other elements as missing. After completion by the TT methods, the prediction error is measured. We select hyperparameters using a grid search with cross-validation.

Figure 2 compares the prediction error and the runtime. When , the rank adaptive methods achieve low estimation errors. As increases, however, all the methods except for TT-RALS suffers from the scalability issue. Indeed, at , only TT-RALS works and the others does not due to exhausting memory.

## 8 Conclusion

In this paper, we investigated TT decomposition from the statistical and computational viewpoints. To analyze its statistical performance, we formulated the convex tensor completion problem via the low-rank TT decomposition using the TT Schatten norm. In addition, because the optimization of the convex problem is infeasible, we developed an alternative algorithm called TT-RALS by combining with the ideas of random projection and alternating minimization. Based on this, we derived the error bounds of estimation for both the convex minimizer and the solution obtained by TT-RALS. The experiments supported our theoretical results and demonstrated the scalability of TT-RALS.

## Appendix A Proof of Theorem 1

The theorem is obtained immediately by combining Li et al.  and Mu et al. .

Let be a sparse random projection defined by

 Πij=⎧⎪⎨⎪⎩+√s/kprobability 1/2s,0probability 1−1/s,−√s/kprobability 1/2s. (11)

Then the following theorem holds.

###### Lemma 6 (Lemma 4 in ).

Let be a unit vector. Then and in law with the convergence rate

 |P(√k(Πu)i

Suppose that . Then

 |P(∥Πu∥22∈[1−ϵ,1+ϵ])−P(χ2k∈[k(1−ϵ),k(1+ϵ)])|≤e−kϵ2/4. (13)

By using

 P(χ2k∈[k(1−ϵ),k(1+ϵ)])≤2e−k(ϵ2−ϵ3)/4, (14)

we have

 P(∥Πu∥22∈[1−ϵ,1+ϵ])≤3e−k(ϵ2−ϵ3)/4. (15)

The preservation of norm implies the preservation of the Schatten- norm as follows.

###### Lemma 7 (Restatement of Theorem 1 in ).

Let be an matrix with rank . If and for all singular vectors of , we have

 √(1−ϵ)/r∥Z∥s≤∥ΠZ∥s≤√1+ϵ∥Z∥s. (16)

Now we prove Theorem 1. Let be a matrix obtained by reshaping tensor . Since is a TT of rank , the rank of matrix is at most . By applying Lemma 7 twice, we obtain

 1−ϵRk∥Z∥s≤∥ΠZ∥s≤(1+ϵ)∥Z∥s (17)

with some probability. If for all singular vectors of , the probability is at least since there are singular vectors.

## Appendix B Proof of Theorem 3

###### Proof.

Since is the minimizer of the optimization problem, we have the following basic inequality

 12n∥Y−X(ˆX)∥2+λnK−1∑k=1∥Qk(ˆX)∥s≤12n∥Y−X(ˆX∗)∥2+λnK−1∑k=1∥Qk(X∗)∥s.

Using the relation that

 ∥Y−X(ˆX)∥2 =∥(Y−X(X∗))−(X(ˆX)−X(X∗))∥2 =∥Y−X(X∗)∥2+∥X(ˆX)−X(X∗)∥2−2⟨Y−X(X∗),X(ˆX)−X(X∗)⟩,

we rewrite the basic inequality as

 12n∥X(ˆX)−X(X∗)∥2≤1n⟨X(ˆX)−X(X∗),E⟩+λnK−1∑k=1(∥Qk(X∗)∥s−∥Qk(ˆX)∥s).

Define the error . Applying the Hölder’s inequality, we have

 1n⟨X(Δ),E⟩=1n⟨Δ,X∗(E)⟩=1n1K−1K−1∑k=1⟨Qk(Δ),X∗(E)⟩ ≤1n1K−1K−1∑k=1∥E∥∞∥Qk(Δ)∥s≤λnK−1K−1∑k=1∥Qk(Δ)∥s,

where is an adjoint operator of and the last inequality holds by the setting of . Also, the triangle inequality and the linearity of yield

 ∥Qk(X∗)∥s−∥Qk(ˆX)∥s≤∥Qk(Δ)∥s.

Then, we bound the inequality as

 12n∥X(Δ)∥2 ≤λnK−1K−1∑k=1∥Qk(Δ)∥s+λnK−1K−1∑k=1∥Qk(Δ)∥s=2λnK−1K−1∑k=1∥Qk(Δ)∥s.

To bound , we apply the result of Lemma 1 in  and Lemma 2 in . Along with proof of the lemmas, we obtain the property that a rank of is bounded by , thus the Cauchy-Schwartz inequality implies

 ∥Qk(Δ)∥s≤√2Rk∥Qk(Δ)∥