# Legendre Tensor Decomposition

We present a novel nonnegative tensor decomposition method, called Legendre decomposition, which factorizes an input tensor into a multiplicative combination of parameters. Thanks to the well-developed theory of information geometry, the reconstructed tensor is unique and always minimizes the KL divergence from an input tensor. We empirically show that Legendre decomposition can more accurately reconstruct tensors than nonnegative CP and Tucker decompositions.

• 20 publications
• 3 publications
• 13 publications
04/12/2021

### Characterization of Decomposition of Matrix Multiplication Tensors

In this paper, the canonical polyadic (CP) decomposition of tensors that...
10/27/2021

### Nonnegative Tucker Decomposition with Beta-divergence for Music Structure Analysis of audio signals

Nonnegative Tucker Decomposition (NTD), a tensor decomposition model, ha...
10/12/2020

### Graph Regularized Nonnegative Tensor Ring Decomposition for Multiway Representation Learning

Tensor ring (TR) decomposition is a powerful tool for exploiting the low...
01/12/2018

### A unifying Perron-Frobenius theorem for nonnegative tensors via multi-homogeneous maps

Inspired by the definition of symmetric decomposition, we introduce the ...
12/31/2021

### Computing the dominant eigenpair of an essentially nonnegative tensor via a homotopy method

The theory of eigenvalues and eigenvectors is one of the fundamental and...
05/22/2021

### Tensor products of coherent configurations

A Cartesian decomposition of a coherent configuration X is defined as a ...
01/02/2020

### On Large-Scale Dynamic Topic Modeling with Nonnegative CP Tensor Decomposition

There is currently an unprecedented demand for large-scale temporal data...

## 1 Introduction

Matrix and tensor decomposition is a fundamental technique in machine learning to analyze data represented in the form of multi-dimensional arrays, which is used in a wide range of applications such as computer vision

(Vasilescu and Terzopoulos, 2002, 2007), recommender systems (Symeonidis, 2016), signal processing (Cichocki et al., 2015), and neuroscience (Beckmann and Smith, 2005). The current standard approaches include NMF (nonnegative matrix factorization) (Lee and Seung, 1999, 2001) for matrices and CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970) or Tucker decomposition (Tucker, 1966) for tensors. CP decomposition compresses an input tensor into a sum of rank-one components and Tucker decomposition approximates an input tensor by a core tensor multiplied by matrices. To date, matrix and tensor decomposition has been extensively analyzed and there are a number of variations of such decompositions (Kolda and Bader, 2009), where the common goal is to approximate an given tensor by a smaller number of components, or parameters, in an efficient manner.

Despite the recent advances of decomposition techniques, a statistical theory that can systematically define decompositions for any

order tensors including vectors and matrices is still under development. Moreover, it is well known that CP and Tucker tensor decompositions include non-convex optimization and the global convergence is not guaranteed. Although there are a number of extensions to transform the problem into convex

(Liu et al., 2013; Tomioka and Suzuki, 2013)

, one needs additional assumptions on data, such as a bounded variance.

Here we present a new paradigm of matrix and tensor decomposition, called Legendre decomposition, based on the information geometry (Amari, 2016)

, which solves the above open problems of matrix and tensor decompositions. In our formulation, an arbitrary order tensor is treated as a probability vector, or a discrete probability distribution, in a statistical manifold as long as it is nonnegative, and Legendre decomposition is realized as a

projection of the input tensor onto a submanifold of reconstructable tensors. The key to introduce the formulation is to use the partial order (Davey and Priestley, 2002; Gierz et al., 2003) of indices, which allows us to treat any order tensors as a probability distribution in the information geometric framework.

The well-developed theory of information geometry leads to the following remarkable properties of our formulation: (1) the reconstructed tensor is always unique and is the minimizer of the Kullback–Leibler (KL) divergence from the original input tensor; (2) decomposition can be formulated as convex optimization, hence we can directly use gradient descent that always guarantees the global convergence, and it can be further speeded up by natural gradient as demonstrated in our experiments; (3) our formulation is flexible as it can decompose partial tensors with removing arbitrary entries beforehand, which are for examples zero or missing entries. Moreover, our formulation has a close relationship with statistical models, and can be interpreted as an extension of Boltzmann machine learning (Ackley et al., 1985)

. This gives a new insight into the relationship between tensor decomposition and graphical models including Boltzmann machines

(Chen et al., 2018; Yılmaz et al., 2011; Yılmaz and Cemgil, 2012). In addition, we show that the proposed formulation belongs to the exponential family, where parameters used in our decomposition are natural parameters and used as constraints of decomposition are expectations of the exponential family.

The reminder of this paper is organized as follows. We first introduce the process of the Legendre decomposition in Section 2. We formulate the decomposition in Section 2.1, describe algorithms in Section 2.2, and discuss the relationship with other statistical models in Section 2.3. Then we present theoretical support of Legendre decomposition in Section 3. We empirically examine performance of our method in Section 4, and summarize our contribution in Section 5.

## 2 The Legendre Decomposition

We introduce Legendre decomposition for matrices, which will be generalized to tensor decomposition later in this section (see Figure 1 for the overview of our method).

### 2.1 Formulation

Given a nonnegative matrix . To treat as a probability distribution in our formulation, we normalize by dividing each entry by the sum of all entries, yielding . In the following, we always work with a normalized matrix . We denote by .

We formulate Legendre decomposition, which always find the best approximation of . Our strategy is to choose an index set as a decomposition basis, where we assume for a technical reason, and approximate the normalized matrix by a multiplicative combination of parameters associated with . For example, one can choose or to decompose into a pair of vectors composed of the first row and the first column. Mathematically, we define that a matrix is fully decomposable with if each entry with is represented in the form of

 yij=   ∏\mathclap(k,l)∈↓(i,j)  exp(θkl),↓(i,j)=\Set(k,l)∈Bk≤i and l≤j, (1)

using parameters for each . The domain of indices corresponds to the outcome space if is viewed as a discrete probability distribution and should be satisfied. While setting is natural, our formulation allows us to use any subset as long as holds. Hence, for example, we can remove unnecessary indices such as missing or zero entries of from .

Our method finds the fully decomposable matrix with a basis that satisfies the condition

 ηij=^ηij∀(i,j)∈B, (2)

where the constraint variable for a matrix is defined as

 ηij= ∑\mathclap(k,l)∈↑(i,j)qkl,↑(i,j)=\Set(k,l)∈Ωk≥i and l≥j

and is obtained by replacing with in the above equation, which is determined by the input matrix . For better understanding, we illustrate and in Figure 2. Since the parameter and the constraint are connected via the Legendre transformation, which is the key theoretical property to our method and will be introduced in Equation (6) in Section 3, we call our decomposition method Legendre decomposition. The only requirement on is that for all , which can be easily satisfied if we choose a basis such that with the set of indices of nonzero entries of .

Thanks to the dually flat structure of the set of normalized matrices shown in Section 3, for any input matrix , the fully decomposable matrix satisfying Equation (2) has the following three desirable theoretical properties:

1. always exists,

2. is unique, and

3. is the best approximation in the sense of the Kullback–Leibler (KL) divergence, that is,

 Q=argminR∈SBDKL(P,R),SB=\SetR∈Rm×n≥0R is fully decomposable with B.

Note that each matrix is treated as a probability distribution with the outcome space to define the KL divergence, i.e.,

 DKL(P,Q)=∑(i,j)∈Ωxijlogxijyij.
###### Example 1.

There are two extreme cases for a choice of a basis : If is the singleton , a fully decomposable is always uniform, that is, for all . In contrast, if , any input itself becomes decomposable, hence it always follows that for all , where there is no dimension reduction.

The Legendre decomposition can be viewed as low rank approximation of not an input matrix but its entry-wise logarithm . This is why the rank of with the fully decomposable matrix coincides with the parameter matrix such that if and otherwise, which is obtained by filling zeros to entries not in . Then we have

 logqij=∑\mathclap(k,l)∈↓(i,j)tkl,

meaning that the rank of coincides with . Therefore if we use a decomposition basis which includes only rows (or columns), always holds.

The straightforward generalization from matrices to multi-dimensional arrays, or tensors, is immediately possible. Given a nonnegative th-order tensor . To simplify the notation, we write the entry by with . For a decomposition basis with , we introduce a fully decomposable tensor represented as

 qv=∏u∈↓vexp(θu),↓v=\Setu∈Bu≤v, (3)

with parameters , where means that , , , for and . Let is the set of fully decomposable tensors with . Legendre decomposition finds the tensor such that

 ηv=^ηv,∀v∈B, (4)

where is given as

 ηv=∑u∈↑vqu,↑v=\Setu∈Ωu≥v (5)

and is obtained by replacing with in the above equation. Equivalent to the matrix decomposition case, the tensor satisfying Equation (4) always exists and is unique, and minimizes the KL divergence, that is,

 Q=argminR∈SBDKL(P,R).

The rank of coincides with the parameter tensor such that if and otherwise.

### 2.2 Algorithm

Here we present our two gradient-based continuous optimization algorithms to solve the above KL divergence minimization. Since the KL divergence is convex with respect to , the standard gradient descent shown in Algorithm 1 can always find the global optimum, where is a learning rate. Starting with some initial parameter , the algorithm iteratively updates for each until convergence. The gradient of is obtained as

 ∂∂θvDKL(P,Q)=^ηv−ηv,

which will be proved in Equation (7). The time complexity of each iteration is as that of computing from (line 1 in Algorithm 1) is and computing from (line 1 in Algorithm 1) is . Thus the total complexity is with the number of iterations until convergence.

Although gradient descent is an efficient approach, in Legendre decomposition, we need to repeat “decoding” from and “encoding” to in each iteration, which may lead to the loss of efficiency if the number of iterations is large. To reduce the number of iterations to gain more efficiency, we propose to use natural gradient (Amari, 1998), which is the second order optimization method, shown in Algorithm 2. Again, since the KL divergence is convex with respect to , natural gradient can always find the global optimum. This is why our natural gradient algorithm is an instance of the Bregman algorithm onto a convex region, which is well known to always converge to the global solution (Censor and Lent, 1981). Let , , and . In each update of the current to , the natural gradient method uses the relationship:

 Δη=−GΔθ,{Δη=η−^η,Δθ=θnext−θ

which leads to the update formula

 θnext=θ−G−1Δη,

where is the Fisher information matrix given in Equation (8). Note that natural gradient coincides with Newton’s method in our case as the Fisher information matrix corresponds to the (negative) Hessian matrix as

 ∂2∂θvi∂θvjDKL(P,Q)=−∂ηvi∂θvj=−gij.

The time complexity of each iteration is , where is needed to compute from and to compute the inverse of , resulting in the total complexity with the number of iterations until convergence.

### 2.3 Relationship to Statistical Models

We demonstrate interesting relationships with Legendre decomposition and statistical models, including the exponential family and the Boltzmann (Gibbs) distributions, and show that our decomposition method can be viewed as generalization of Boltzmann machine learning (Ackley et al., 1985). Although the connection between tensor decomposition and graphical model has been analyzed by Chen et al. (2018); Yılmaz et al. (2011); Yılmaz and Cemgil (2012), our analysis adds a new insight as we focus on not the graphical model itself but the outcome space of distributions generated by the model.

#### 2.3.1 Exponential family

We show that the set of normalized tensors

 S=\SetP∈RI1×I2×⋯×IN>0∑v∈Ωpv=1,

is included in the exponential family. Note that if , where all tensors are fully decomposable. Since the exponential family is defined as

 p(x,θ)=exp(∑θiki(x)+r(x)−A(θ))

for natural parameters , our model in Equation (3) is clearly in the exponential family with

 pv =∏u∈↓vexp(θu)=exp(∑u∈Ωθ(u)ku(v)+ψ(θ)),

where

 ψ(θ)=−θ(1,…,1),ku(v) ={1if (1,…,1)

and . Thus used in our method corresponds to natural parameters of the exponential family. Moreover, we can obtain by taking the expectation of :

 E[ku(v)]=∑v∈Ωku(v)pv=∑v∈↑upv=ηu,

Thus Legendre decomposition of is understood to finding of fully decomposable that has the same expectation with with respect to a basis .

#### 2.3.2 Boltzmann Machines

A Boltzmann machine is represented as an undirected graph with a vertex set and an edge set , where we assume that without loss of generality. A Boltzmann machine defines a probability distribution , where each probability of an -dimensional binary vector is given as

 p(x)=1Z(θ)∏\mathclapi∈Vexp(θixi)∏\mathclap{i,j}∈Eexp(θijxixj),

where is a bias, is a weight, and is a partition function.

To translate a Boltzmann machine into our formulation, let and suppose that

 B(V) =\Set(ia1,…,iaN)∈Ωa∈V, ial={2if l=a,1otherwise, B(E) =\Set(iab1,…,iabN)∈Ω{a,b}∈E, iabl={2if l∈{a,b},1otherwise.

Then it is clear that the set of probability distributions that can be represented by the Boltzmann machine is exactly the same as with and , that is, the set of fully decomposable th-order tensors defined by Equation (3) with the basis . Moreover, let a given th-order tensor

be an empirical distribution estimated from data, where each

is the probability for a binary vector . The tensor obtained by Legendre decomposition with coincides with the distribution learned by the Boltzmann machine . The condition in Equation (4) corresponds to the well-known learning equation of Boltzmann machines, where and correspond to the expectation of the data and model distributions, respectively.

Therefore our Legendre decomposition is a generalization of Boltzmann machine learning in the following three aspects:

1. The domain is not limited to binary but can be ordinal, i.e., is extended to for any .

2. The basis with which parameters are associated is not limited to but any subset of , which means that higher-order interactions (Sejnowski, 1986) can be included.

3. The outcome space of probability distributions is not limited to but can be any subset as long as , which enables us to perform efficient computation with removing unnecessary entries such as missing values.

Hidden variables are often used in Boltzmann machines to increase the representation power such as restricted Boltzmann machines (RBMs)

(Smolensky, 1986; Hinton, 2002) and deep Boltzmann machines (DBMs) (Salakhutdinov and Hinton, 2009, 2012). In our Legendre decomposition, including a hidden variable corresponds to including an additional dimension. Hence if we include hidden variables, the fully decomposition tensor has the order of . This is an interesting extension to our method and a ongoing research topic, but it is not a focus of this paper since our main aim is to find a lower dimensional representation of a given tensor .

## 3 Theory on Legendre Decomposition

To theoretically analyze our proposed Legendre decomposition, we rewrite it using information geometry and prove its properties. In particular, we use the log-linear model proposed in (Sugiyama et al., 2017), which is an extension of information geometry on hierarchy of distributions (Amari, 2001; Nakahara and Amari, 2002). Since it has not been applied to tensor decomposition yet, this paper is the first to apply the log-linear model to tensor decomposition.

### 3.1 Information Geometry on Log-Linear Model

Let

be a discrete distribution of a random variable

with the possible outcome space , in which probability for each and . The outcome space is required to be a poset (partially ordered set), where a partial order satisfies the following three properties for all : (1) (reflexivity), (2) , (antisymmetry), and (3) , (transitivity). We assume that the least element (bottom) always exists, where for all . We denote by .

On the poset , two functions of the zeta function and the Möbius function play the central role (Rota, 1964)(Kung et al., 2009, Chapter 3.1). The zeta function is defined as if and otherwise, and the Möbius function is its convolutional inverse, that is,

 ∑v∈Ωζ(u,v)μ(v,w)=∑v∈Ωμ(u,v)ζ(v,w)=δ(u,w),

where is the Kronecker delta such that if and otherwise. The Möbius function can be explicitly defined in an inductive manner as if , , and otherwise.

In Legendre decomposition for th-order tensors, it is clear that the outcome space is a poset with respect to the order between indices and . Hence each distribution directly corresponds to a (normalized) tensor in our formulation.

The log-linear model on a poset introduced by Sugiyama et al. (2017) is defined as

 logpv=∑u∈Ωζ(u,v)θu,

which corresponds to the fully decomposable tensor in Equation (3) with as , and is dually given as

 θv=∑u∈Ωμ(u,v)logpu.

Moreover, is introduced as

 ηv=∑u∈Ωζ(v,u)pu=∑u∈↑vpu,pv=∑u∈Ωμ(v,u)ηu,

corresponding to the constraint variables in Equation (5).

The set of distributions always becomes the dually flat statistical manifold (Sugiyama et al., 2017), which is the central structure in information geometry (Amari, 2016, Chapter 6), where two parameters and are dual coordinate systems of connected with the Legendre transformation, that is,

 θ=∇φ(η),η=∇ψ(θ) (6)

with two convex functions

 ψ(θ)=−θ⊥=−logp⊥,φ(η)=∑v∈Ωpvlogpv,

and and are orthogonal:

 E[∂logpw∂θu∂logpw∂ηv]=δ(u,v).

Thus it follows that

 ∂∂θvDKL(P,Q)=∂∂θv∑u∈Ωpulogqu=∂∂θv∑u∈Ωpu∑\mathclapw∈↓u∖{⊥}θw−∂∂θv∑u∈Ωpuψ(θ)=^ηv−ηv. (7)

Moreover, the Riemannian metric such that

 g(θ)=∇∇ψ(θ),g(η)=∇∇φ(η)

is the gradient of or and coincides with the Fisher information matrix, which is analytically obtained as

 guv(θ) =∂ηu∂θv=E[∂logpw∂θu∂logpw∂θv]=∑w∈Ωζ(u,w)ζ(v,w)pw−ηuηv, (8) guv(η) =∂θu∂ηv=E[∂logpw∂ηu∂logpu∂ηv]=∑w∈Ωμ(w,u)μ(w,v)p−1w (9)

for all . We can use the gradient in our natural gradient method (Algorithm 2).

Let us consider the following two submanifolds of :

 ST =\SetP∈Sθv=Tv for all v∈ΩT, SH =\SetP∈Sηv=Hv for all v∈ΩH

specified by two constraints with . The former submanifold has constraints on the coordinate while the latter has those on . It is known in information geometry that is -flat and is -flat, respectively (Amari, 2016, Chapter 2.4). If and are satisfied, the intersection is always a singleton , which means that the distribution satisfying and always uniquely exists. Then the Pythagorean theorem holds with respect to the KL divergence: For any and , we have

 DKL(P,R) =DKL(P,Q)+DKL(Q,R). (10)

### 3.2 Legendre Decomposition as Projection

Here we newly formulate our Legendre decomposition as projection of a tensor onto a submanifold. Let be the submanifold such that

 SB=\SetQ∈Sθv=0 for all v∈Ω∖B,

which coincides with the set of fully decomposable tensors and is an -flat submanifold as it has constraints on the coordinate. Given a (normalized) tensor , the fully decomposable tensor is Legendre decomposition of if and only if satisfies the following conditions.

 {θv=0if v∈Ω∖B,ηv=^ηvif v∈B, (11)

where is determined by . The first condition ensures that is fully decomposable and the second condition is the requirement in Equation (4).

Since the second condition corresponds to the -flat submanifold:

 S′B=\SetQ∈Sηv=^ηv for % all v∈B,

the Legendre decomposition can be interpreted as finding the intersection given . This is known as -projection of onto the submanifold in information geometry (Figure 3). The dually flat structure of with the dual coordinate systems and guarantees that always exists and is unique, and is the minimizer of the KL divergence (Amari, 2009, Theorem 3):

 Q=argminR∈SBDKL(P,R).

In contrast, if some fully decomposable tensor is given, finding the tensor satisfying Equation (11) is called -projection of onto the submanifold . In practice, Algorithm 1 and Algorithm 2 perform this -projection because the number of parameters to be optimized is in -projection while in -projection, and usually holds. Global convergence of the proposed algorithms is always guaranteed by the convexity of submanifolds and .

While we assume that a given tensor in the above projection, happens if is not included by the set of indices of nonzero entries. Nevertheless, our Legendre decomposition still valid if for all is satisfied, and it can obtain the unique that minimizes the KL divergence from . This is why, for any , there exists such that and for all , which means that also holds for the tensor obtained by the -projection of onto .

## 4 Experiments

We empirically examine the efficiency and the effectiveness of Legendre decomposition using synthetic and real-world datasets. We used Amazon Linux AMI release 2017.09 and ran all experiments on 2.3 GHz Intel Xeon CPU E7-8880 v3 and 256 GB of memory. The Legendre decomposition was implemented in C++ and compiled with icpc 18.0.0.

Throughout the experiments, we use the decomposition basis in the form of

 B=B1∪B2,{B1={(i,j,1)∣i≤l or j≤l},B2={(i,1,k)∣i≤l or k≤l},

which is determined by the single number , whose shape is illustrated in Figure 1. The cardinality corresponds to the number of parameters used in the decomposition. We use to vary the number of parameters for decomposition.

### 4.1 Results on Synthetic Data

First we compare our two algorithms, gradient descent in Algorithm 1 and natural gradient in Algorithm 2, to evaluate the efficiency of these optimization algorithms. We randomly generate a third order tensor with the size

from the uniform distribution and obtain the running time and the number of iterations with increasing the number of parameters

. In Algorithm 2, we take the outer loop (from line 1 to 1) as one iteration for fair comparison and fix the learning rate .

Results are plotted in Figure 4, which clearly show that natural gradient is dramatically faster than gradient descent. When the number of parameters is around , natural gradient is more than six orders of magnitude faster than gradient descent. The speed up comes from reduction of iterations. Natural gradient requires only two or three iterations until convergence in all cases, while gradient descent requires more than iterations to get the same result. In the following, we consistently use natural gradient.

### 4.2 Results on Real Data

Next we examine the effectiveness of Legendre decomposition on real-world datasets of third-order tensors. We compare Legendre decomposition with two standard nonnegative tensor decomposition techniques, nonnegative Tucker decomposition (Kim and Choi, 2007) and nonnegative CANDECOMP/PARAFAC (CP) decomposition (Shashua and Hazan, 2005), to examine the effectiveness of tensor decomposition. We use the TensorLy implementation (Kossaifi et al., 2016) for nonnegative Tucker and CP decompositions. In nonnegative Tucker decomposition, we always employ rank- Tucker decomposition with the single number and we use rank- decomposition in nonnegative CP decomposition. Thus rank- Tucker decomposition has parameters and rank- CP decomposition has parameters. We evaluate the quality of decomposition by the RMSE (Root Mean Squared Error) between the input and the reconstructed tensors.

First we examine Legendre decomposition and two competing methods on the face image dataset111This dataset is originally distributed at http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html and also available from the R rTensor package (https://CRAN.R-project.org/package=rTensor).. We pick up the first entry from the fourth mode (corresponds to lighting) from the dataset, resulting in a single third-order tensor with the size of , where the first two modes correspond to image pixels and the third mode to individuals. For every decomposition method, we gradually increase and check the performance in terms of RMSE.

Results are plotted in Figure 5. It clearly shows that Legendre decomposition is superior to competing methods if the number of parameters is small (up to ), and is competitive with nonnegative Tucker decomposition for larger number of parameters. The reason might be that Legendre decomposition always offers the optimal solution in terms of the KL divergence.

Next, we use the MNIST dataset (LeCun et al., 1998)

, which consists of a collection of images of handwritten digits and used as the standard benchmark datasets in a number of recent studies such as deep learning. We pick up the first

images for each digit, resulting in third-order tensors with the size of , where the first two modes correspond to image pixels. Again, for every decomposition method, we vary the number of parameters by increasing and evaluate the performance in terms of RMSE.

Results are plotted in Figure 6. For all digits except for “1”, Legendre decomposition always shows the best scores, and it has also the best score for “1” if the number of parameters is smaller than . This is why our method can obtain the best approximation of the given tensor in terms of the KL divergence. In contrast, nonnegative CP decomposition shows always the worst score. The reason might be that it always decomposes to the sum of rank-1 tensors and cannot treat different rank tensors.

## 5 Conclusion

In this paper, we have proposed Legendre decomposition with incorporating the tensor structure into information geometry. A given tensor is converted into two parameters and via the Legendre transformation, and the optimization is performed in the parameter space instead of directly treating tensors. We have theoretically showed the desired properties of the Legendre decomposition, namely, the well-definedness, the uniqueness, and the optimality, which always minimizes the KL divergence from the input tensor. We have also showed the connection between Legendre decomposition and Boltzmann machine learning.

We have experimentally showed that Legendre decomposition can more accurately reconstruct input tensors than two standard tensor decomposition methods, nonnegative Tucker decomposition and nonnegative CP decomposition, with the same number of parameters. Since the shape of the decomposition basis is arbitrary, Legendre decomposition still has a potential to achieve more accurate decomposition. For example, one can incorporate the domain knowledge to set such that specific entries of the input tensor are known to dominate the other entries.

Our work opens the door to both of further theoretical investigation of information geometric algorithms for tensor analysis and a number of practical applications such as missing value imputation of tensors and assessing statistical significance of parameters in decomposition.

## References

• Ackley et al. (1985) D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147–169, 1985.
• Amari (1998) S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
• Amari (2001) S. Amari. Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory, 47(5):1701–1711, 2001.
• Amari (2009) S. Amari. Information Geometry and Its Applications: Convex Function and Dually Flat Manifold, pages 75–102. Springer, 2009.
• Amari (2016) S. Amari. Information Geometry and Its Applications. Springer, 2016.
• Beckmann and Smith (2005) C. F. Beckmann and S. M. Smith.

Tensorial extensions of independent component analysis for multisubject FMRI analysis.

NeuroImage, 25(1):294–311, 2005.
• Censor and Lent (1981) Y. Censor and A. Lent. An iterative row-action method for interval convex programming. Journal of Optimization Theory and Applications, 34(3):321–353, 1981.
• Chen et al. (2018) J. Chen, S. Cheng, H. Xie, L. Wang, and T. Xiang. Equivalence of restricted Boltzmann machines and tensor network states. Physical Review B, 97:085104, Feb 2018.
• Cichocki et al. (2015) A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32(2):145–163, 2015.
• Davey and Priestley (2002) B. A. Davey and H. A. Priestley. Introduction to Lattices and Order. Cambridge University Press, 2 edition, 2002.
• Gierz et al. (2003) G. Gierz, K. H. Hofmann, K. Keimel, J. D. Lawson, M. Mislove, and D. S. Scott. Continuous Lattices and Domains. Cambridge University Press, 2003.
• Harshman (1970) R. A. Harshman. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis. Technical report, UCLA Working Papers in Phonetics, 1970.
• Hinton (2002) G. E. Hinton.

Training products of experts by minimizing contrastive divergence.

Neural Computation, 14(8):1771–1800, 2002.
• Kim and Choi (2007) Y. D. Kim and S. Choi. Nonnegative Tucker decomposition. In

2007 IEEE Conference on Computer Vision and Pattern Recognition

, pages 1–8, June 2007.
• Kolda and Bader (2009) T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
• Kossaifi et al. (2016) J. Kossaifi, Y. Panagakis, and M. Pantic. TensorLy: Tensor learning in Python. arXiv:1610.09555, 2016.
• Kung et al. (2009) J. P. S. Kung, G.-C. Rota, and C. H. Yan. Combinatorics: The Rota Way. Cambridge University Press, 2009.
• LeCun et al. (1998) Y. LeCun, C. Cortes, and C. J. C. Burges.

The MNIST database of handwritten digits, 1998.

• Lee and Seung (1999) D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
• Lee and Seung (2001) D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information processing Systems 13, pages 556–562, 2001.
• Liu et al. (2013) J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208–220, 2013.
• Nakahara and Amari (2002) H. Nakahara and S. Amari. Information-geometric measure for neural spikes. Neural Computation, 14(10):2269–2316, 2002.
• Rota (1964) G.-C. Rota. On the foundations of combinatorial theory I: Theory of Möbius functions. Z. Wahrseheinlichkeitstheorie, 2:340–368, 1964.
• Salakhutdinov and Hinton (2009) R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In

Proceedings of the 12th International Conference on Artificial Intelligence and Statistics

, pages 448–455, 2009.
• Salakhutdinov and Hinton (2012) R. Salakhutdinov and G. E. Hinton. An efficient learning procedure for deep Boltzmann machines. Neural Computation, 24(8):1967–2006, 2012.
• Sejnowski (1986) T. J. Sejnowski. Higher-order Boltzmann machines. In AIP Conference Proceedings, volume 151, pages 398–403, 1986.
• Shashua and Hazan (2005) A. Shashua and T. Hazan. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22nd International Conference on Machine Learning, pages 792–799, 2005.
• Smolensky (1986) P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart, J. L. McClelland, and PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, pages 194–281. MIT Press, 1986.
• Sugiyama et al. (2017) M. Sugiyama, H. Nakahara, and K. Tsuda. Tensor balancing on statistical manifold. In Proceedings of the 34th International Conference on Machine Learning, pages 3270–3279, 2017.
• Symeonidis (2016) P. Symeonidis. Matrix and tensor decomposition in recommender systems. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 429–430, 2016.
• Tomioka and Suzuki (2013) R. Tomioka and T. Suzuki. Convex tensor decomposition via structured schatten norm regularization. In C.j.c. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1331–1339. 2013.
• Tucker (1966) L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966.
• Vasilescu and Terzopoulos (2002) M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: TensorFaces. In Proceedings of The 7th European Conference on Computer Vision (ECCV), volume 2350 of LNCS, pages 447–460, 2002.
• Vasilescu and Terzopoulos (2007) M. A. O. Vasilescu and D. Terzopoulos. Multilinear (tensor) image synthesis, analysis, and recognition [exploratory dsp]. IEEE Signal Processing Magazine, 24(6):118–123, 2007.
• Yılmaz et al. (2011) K. Y. Yılmaz, A. T. Cemgil, and U. Simsekli. Generalised coupled tensor factorisation. In Advances in Neural Information Processing Systems 24, pages 2151–2159. 2011.
• Yılmaz and Cemgil (2012) Y. K. Yılmaz and A. T. Cemgil. Algorithms for probabilistic latent tensor factorization. Signal Processing, 92(8):1853–1863, 2012.