I Introduction
Tensors^{1}^{1}1The term has different meaning in Physics, however it has been widely adopted across various disciplines in recent years to refer to what was previously known as a multiway array. (of order higher than two) are arrays indexed by three or more indices, say – a generalization of matrices, which are indexed by two indices, say for (row, column). Matrices are twoway arrays, and there are three and higherway arrays (or higherorder) tensors.
Tensor algebra has many similarities but also many striking differences with matrix algebra – e.g., lowrank tensor factorization is essentially unique under mild conditions; determining tensor rank is NPhard, on the other hand, and the best lowrank approximation of a higher rank tensor may not even exist. Despite such apparent paradoxes and the learning curve needed to digest tensor algebra notation and data manipulation, tensors have already found many applications in signal processing (speech, audio, communications, radar, biomedical), machine learning (clustering, dimensionality reduction, latent factor models, subspace learning), and well beyond. Psychometrics (loosely defined as mathematical methods for the analysis of personality data) and later Chemometrics (likewise, for chemical data) have historically been two important application areas driving theoretical and algorithmic developments. Signal processing followed, in the 90’s, but the real spark that popularized tensors came when the computer science community (notably those in machine learning, data mining, computing) discovered the power of tensor decompositions, roughly a decade ago [1, 2, 3]. There are nowadays many hundreds, perhaps thousands of papers published each year on tensorrelated topics. Signal processing applications include, e.g., unsupervised separation of unknown mixtures of speech signals [4] and codedivision communication signals without knowledge of their codes [5]; and emitter localization for radar, passive sensing, and communication applications [6, 7]. There are many more applications of tensor techniques that are not immediately recognized as such, e.g., the analytical constant modulus algorithm [8, 9]
. Machine learning applications include face recognition, mining musical scores, and detecting cliques in social networks – see
[10, 11, 12] and references therein. More recently, there has been considerable work on tensor decompositions for learning latent variable models, particularly topic models [13], and connections between orthogonal tensor decomposition and the method of moments for computing the Latent Dirichlet Allocation (LDA – a widely used topic model).
After two decades of research on tensor decompositions and applications, the senior coauthors still couldn’t point their new graduate students to a single “point of entry” to begin research in this area. This article has been designed to address this need: to provide a fairly comprehensive and deep overview of tensor decompositions that will enable someone having taken first graduate courses in matrix algebra and probability to get started doing research and/or developing related algorithms and software. While no single reference fits this bill, there are several very worthy tutorials and overviews that offer different points of view in certain aspects, and we would like to acknowledge them here. Among them, the highlycited and clearlywritten tutorial [14] that appeared 7 years ago in SIAM Review
is perhaps the one closest to this article. It covers the basic models and algorithms (as of that time) well, but it does not go deep into uniqueness, advanced algorithmic, or estimationtheoretic aspects. The target audience of
[14] is applied mathematics (SIAM). The recent tutorial [11] offers an accessible introduction, with many figures that help ease the reader into threeway thinking. It covers most of the bases and includes many motivating applications, but it also covers a lot more beyond the basics and thus stays at a high level. The reader gets a good roadmap of the area, without delving into it enough to prepare for research. Another recent tutorial on tensors is [15], which adopts a more abstract point of view of tensors as mappings from a linear space to another, whose coordinates transform multilinearly under a change of bases. This article is more suited for people interested in tensors as a mathematical concept, rather than how to use tensors in science and engineering. It includes a nice review of tensor rank results and a brief account of uniqueness aspects, but nothing in the way of algorithms or tensor computations. An overview of tensor techniques for largescale numerical computations is given in [16, 17], geared towards a scientific computing audience; see [18] for a more accessible introduction. A gentle introduction to tensor decompositions can be found in the highly cited Chemometrics tutorial [19] – a bit outdated but still useful for its clarity – and the more recent book [20]. Finally, [21] is an upcoming tutorial with emphasis on scalability and data fusion applications – it does not go deep into tensor rank, identifiability, decomposition under constraints, or statistical performance benchmarking.None of the above offers a comprehensive overview that is sufficiently deep to allow one to appreciate the underlying mathematics, the rapidly expanding and diversifying toolbox of tensor decomposition algorithms, and the basic ways in which tensor decompositions are used in signal processing and machine learning – and they are quite different. Our aim in this paper is to give the reader a tour that goes ‘under the hood’ on the technical side, and, at the same time, serve as a bridge between the two areas. Whereas we cannot include detailed proofs of some of the deepest results, we do provide insightful derivations of simpler results and sketch the line of argument behind more general ones. For example, we include a onepage selfcontained proof of Kruskal’s condition when one factor matrix is full column rank, which illuminates the role of Kruskalrank in proving uniqueness. We also ‘translate’ between the signal processing (SP) and machine learning (ML) points of view. In the context of the canonical polyadic decomposition (CPD), also known as parallel factor analysis (PARAFAC), SP researchers (and Chemists) typically focus on the columns of the factor matrices , , and the associated rank1 factors of the decomposition (where denotes the outer product, see section IIC), because they are interested in separation. ML researchers often focus on the rows of , , , because they think of them as parsimonious latent space representations. For a user item context ratings tensor, for example, a row of is a representation of the corresponding user in latent space, and likewise a row of (
) is a representation of the corresponding item (context) in the same latent space. The inner product of these three vectors is used to predict that user’s rating of the given item in the given context. This is one reason why ML researchers tend to use inner (instead of outer) product notation. SP researchers are interested in model identifiability because it guarantees separability; ML researchers are interested in identifiability to be able to interpret the dimensions of the latent space. In coclustering applications, on the other hand, the rank1 tensors
capture latent concepts that the analyst seeks to learn from the data (e.g., cliques of users buying certain types of items in certain contexts). SP researchers are trained to seek optimal solutions, which is conceivable for small to moderate data; they tend to use computationally heavier algorithms. ML researchers are nowadays trained to think about scalability from day one, and thus tend to choose much more lightweight algorithms to begin with. There are many differences, but also many similarities and opportunities for crossfertilization. Being conversant in both communities allows us to bridge the ground between and help SP and ML researchers better understand each other.Ia Roadmap
The rest of this article is structured as follows. We begin with some matrix preliminaries, including matrix rank and lowrank approximation, and a review of some useful matrix products and their properties. We then move to rank and rank decomposition for tensors. We briefly review bounds on tensor rank, multilinear (mode) ranks, and relationship between tensor rank and multilinear rank. We also explain the notions of typical, generic, and border rank, and discuss why lowrank tensor approximation may not be wellposed in general. Tensors can be viewed as data or as multilinear operators, and while we are mostly concerned with the former viewpoint in this article, we also give a few important examples of the latter as well. Next, we provide a fairly comprehensive account of uniqueness of lowrank tensor decomposition. This is the most advantageous difference when one goes from matrices to tensors, and therefore understanding uniqueness is important in order to make the most out of the tensor toolbox. Our exposition includes two steppingstone proofs: one based on eigendecomposition, the other bearing Kruskal’s mark (“downconverted to baseband” in terms of difficulty). The Tucker model and multilinear SVD come next, along with a discussion of their properties and connections with rank decomposition. A thorough discussion of algorithmic aspects follows, including a detailed discussion of how different types of constraints can be handled, how to exploit data sparsity, scalability, how to handle missing values, and different loss functions. In addition to basic alternating optimization strategies, a host of other solutions are reviewed, including gradient descent, line search, GaussNewton, alternating direction method of multipliers, and stochastic gradient approaches. The next topic is statistical performance analysis, focusing on the widelyused CramérRao bound and its efficient numerical computation. This section contains novel results and derivations that are of interest well beyond our present context – e.g., can also be used to characterize estimation performance for a broad range of constrained matrix factorization problems. The final main section of the article presents motivating applications in signal processing (communication and speech signal separation, multidimensional harmonic retrieval) and machine learning (collaborative filtering, mixture and topic modeling, classification, and multilinear subspace learning). We conclude with some pointers to online resources (toolboxes, software, demos), conferences, and some historical notes.
Ii Preliminaries
Iia Rank and rank decomposition for matrices
Consider an matrix , and let the number of linearly independent columns of , i.e., the dimension of the range space of , . is the minimum such that , where is an basis of , and is and holds the corresponding coefficients. This is because if we can generate all columns of , by linearity we can generate anything in , and viceversa. We can similarly define the number of linearly independent rows of , which is the minimum such that , where is and is . Noting that
where stands for the th column of , we have
where and . It follows that , and , so the three definitions actually coincide – but only in the matrix (twoway tensor) case, as we will see later. Note that, per the definition above, is a rank1 matrix that is ‘simple’ in the sense that every column (or row) is proportional to any other column (row, respectively). In this sense, rank can be thought of as a measure of complexity. Note also that , because obviously , where
is the identity matrix.
IiB Lowrank matrix approximation
In practice is usually fullrank, e.g., due to measurement noise, and we observe , where is lowrank and represents noise and ‘unmodeled dynamics’. If the elements of are sampled from a jointly continuous distribution, then will be full rank almost surely – for the determinant of any square submatrix of is a polynomial in the matrix entries, and a polynomial that is nonzero at one point is nonzero at every point except for a set of measure zero. In such cases, we are interested in approximating with a lowrank matrix, i.e., in
The solution is provided by the truncated SVD of , i.e., with , set , or , where denotes the matrix containing columns to of . However, this factorization is nonunique because , for any nonsingular matrix , where . In other words: the factorization of the approximation is highly nonunique (when , there is only scaling ambiguity, which is usually inconsequential). As a special case, when (noisefree) so , lowrank decomposition of is nonunique.
IiC Some useful products and their properties
In this section we review some useful matrix products and their properties, as they pertain to tensor computations.
Kronecker product: The Kronecker product of () and () is the matrix
The Kronecker product has many useful properties. From its definition, it follows that . For an matrix , define
i.e., the vector obtained by vertically stacking the columns of . By definition of it follows that .
Consider the product , where is , is , and is . Note that
Therefore, using and linearity of the operator
This is useful when dealing with linear least squares problems of the following form
where .
Khatri–Rao product: Another useful product is the Khatri–Rao (columnwise Kronecker) product of two matrices with the same number of columns (see [20, p. 14] for a generalization). That is, with and , the Khatri–Rao product of and is . It is easy to see that, with being a diagonal matrix with vector on its diagonal (we will write , and , where we have implicitly defined operators and to convert one to the other), the following property holds
which is useful when dealing with linear least squares problems of the following form
It should now be clear that the Khatri–Rao product is a subset of columns from . Whereas contains the ‘interaction’ (Kronecker product) of any column of with any column of , contains the Kronecker product of any column of with only the corresponding column of .
Additional properties:

(associative); so we may simply write as . Note though that , so the Kronecker product is noncommutative.

(note order, unlike ).

, where , stand for conjugation and Hermitian (conjugate) transposition, respectively.

(the mixed product rule). This is very useful – as a corollary, if and are square nonsingular, then it follows that , and likewise for the pseudoinverse. More generally, if is the SVD of , and is the SVD of , then it follows from the mixed product rule that is the SVD of . It follows that

.

, for square , .

, for square , .
The Khatri–Rao product has the following properties, among others:

(associative); so we may simply write as . Note though that , so the Khatri–Rao product is noncommutative.

(mixed product rule).
Tensor (outer) product: The tensor product or outer product of vectors and is defined as the matrix with elements , . Note that . Introducing a third vector , we can generalize to the outer product of three vectors, which is an threeway array or thirdorder tensor with elements . Note that the elementwise definition of the outer product naturally generalizes to three and higherway cases involving more vectors, but one loses the ‘transposition’ representation that is familiar in the twoway (matrix) case.
Iii Rank and rank decomposition for tensors: CPD / PARAFAC
We know that the outer product of two vectors is a ‘simple’ rank1 matrix – in fact we may define matrix rank as the minimum number of rank1 matrices (outer products of two vectors) needed to synthesize a given matrix. We can express this in different ways: if and only if (iff) is the smallest integer such that for some and , or, equivalently, , .
A rank1 thirdorder tensor of size is an outer product of three vectors: , , , and ; i.e., – see Fig. 1. A rank1 th order tensor is likewise an outer product of vectors: , , ; i.e., . In the sequel we mostly focus on thirdorder tensors for brevity; everything naturally generalizes to higherorder tensors, and we will occasionally comment on such generalization, where appropriate.
The rank of tensor is the minimum number of rank1 tensors needed to produce as their sum – see Fig. 2 for a tensor of rank three. Therefore, a tensor of rank at most can be written as
where , , and . It is also customary to use , so . For brevity, we sometimes also use the notation to denote the relationship .
Let us now fix and look at the frontal slab of . Its elements can be written as
where we note that the elements of the first row of weigh the rank1 factors (outer products of corresponding columns of and ). We will denote for brevity. Hence, for any ,
Applying the vectorization property of it now follows that
and by parallel stacking, we obtain the matrix unfolding (or, matrix view)
(1) 
We see that, when cast as a matrix, a thirdorder tensor of rank admits factorization in two matrix factors, one of which is specially structured – being the Khatri–Rao product of two smaller matrices. One more application of the vectorization property of yields the vector
where is an vector of all 1’s. Hence, when converted to a long vector, a tensor of rank is a sum of structured vectors, each being the Khatri–Rao / Kronecker product of three vectors (in the threeway case; or more vectors in higherway cases).
In the same vain, we may consider lateral or horizontal slabs^{2}^{2}2A warning for Matlab aficionados: due to the way that Matlab stores and handles tensors, one needs to use the ‘squeeze’ operator, i.e., , and ., e.g.,
Hence
(2) 
and similarly^{3}^{3}3One needs to use the ‘squeeze’ operator here as well. , so
(3) 
Iiia Lowrank tensor approximation
We are in fact ready to get a first glimpse on how we can go about estimating , , from (possibly noisy) data . Adopting a least squares criterion, the problem is
where is the sum of squares of all elements of (the subscript in stands for Frobenius (norm), and it should not be confused with the number of factors in the rank decomposition – the difference will always be clear from context). Equivalently, we may consider
Note that the above model is nonconvex (in fact trilinear) in , , ; but fixing and , it becomes (conditionally) linear in , so that we may update
and, using the other two matrix representations of the tensor, update
and
until convergence. The above algorithm, widely known as Alternating Least Squares (ALS) is a popular way of computing approximate lowrank models of tensor data. We will discuss algorithmic issues in depth at a later stage, but it is important to note that ALS is very easy to program, and we encourage the reader to do so – this exercise helps a lot in terms of developing the ability to ‘think threeway’.
IiiB Bounds on tensor rank
For an matrix , we know that , and almost surely, meaning that rankdeficient real (complex) matrices are a set of Lebesgue measure zero in . What can we say about tensors ? Before we get to this, a retrospective on the matrix case is useful. Considering where is and is , the size of such parametrization (the number of unknowns, or degrees of freedom (DoF) in the model) of is^{4}^{4}4Note that we have taken away DoF due to the scaling / counterscaling ambiguity, i.e., we may always multiply a column of and divide the corresponding column of with any nonzero number without changing . . The number of equations in is , and equationsversusunknowns considerations suggest that of order may be needed – and this turns out being sufficient as well.
For thirdorder tensors, the DoF in the lowrank parametrization is^{5}^{5}5Note that here we can scale, e.g., and at will, and counterscale , which explains the . , whereas the number of equations is . This suggests that may be needed to describe an arbitrary tensor of size , i.e., that thirdorder tensor rank can potentially be as high as . In fact this turns out being sufficient as well. One way to see this is as follows: any frontal slab can always be written as , with and having at most columns. Upon defining , , and (where is an identity matrix of size , and is a vector of all 1’s of size ), we can synthesize as . Noting that and have at most columns, it follows that we need at most columns in , , . Using ‘role symmetry’ (switching the names of the ‘ways’ or ‘modes’), it follows that we in fact need at most columns in , , , and thus the rank of any threeway array is bounded above by . Another (cleaner but perhaps less intuitive) way of arriving at this result is as follows. Looking at the matrix unfolding
and noting that is and is , the issue is what is the maximum inner dimension that we need to be able to express an arbitrary matrix on the left (corresponding to an arbitrary tensor ) as a Khatri–Rao product of two , matrices, times another matrix? The answer can be seen as follows:
and thus we need at most columns per column of , which has columns – QED.
This upper bound on tensor rank is important because it spells out that tensor rank is finite, and not much larger than the equationsversusunknowns bound that we derived earlier. On the other hand, it is also useful to have lower bounds on rank. Towards this end, concatenate the frontal slabs one next to each other
since . Note that is , and it follows that must be greater than or equal to the dimension of the column span of , i.e., the number of linearly independent columns needed to synthesize any of the columns of . By role symmetry, and upon defining
we have that . is the mode1 or modeA rank of , and likewise and are the mode2 or modeB and mode3 or modeC ranks of , respectively. is sometimes called the column rank, the row rank, and the fiber or tube rank of . The triple is called the multilinear rank of .
At this point it is worth noting that, for matrices we have that column rank = row rank = rank, i.e., in our current notation, for a matrix (which can be thought of as an thirdorder tensor) it holds that , but for nontrivial tensors , , and are in general different, with . Since , , , it follows that for matrices but can be for tensors.
Now, going back to the first way of explaining the upper bound we derived on tensor rank, it should be clear that we only need rank1 factors to describe any given frontal slab of the tensor, and so we can describe all slabs with at most rank1 factors; with a little more thought, it is apparent that is enough. Appealing to role symmetry, it then follows that , where . Dropping the explicit dependence on for brevity, we have
IiiC Typical, generic, and border rank of tensors
Consider a tensor
whose elements are i.i.d., drawn from the standard normal distribution
( in Matlab). The rank of over the real field, i.e., when we consideris [22]
This is very different from the matrix case, where with probability 1. To make matters more (or less) curious, the rank of the same is in fact 2 with probability 1 when we instead consider decomposition over the complex field, i.e., using . As another example [22], for ,
To understand this behavior, consider the case. We have two slabs, and . For to have , we must be able to express these two slabs as
for some real or complex matrices , , and , depending on whether we decompose over the real or the complex field. Now, if , then both and are nonsingular matrices, almost surely (with probability 1). It follows from the above equations that , , , and must all be nonsingular too. Denoting , , it follows that , and substituting in the second equation we obtain , i.e., we obtain the eigenproblem
It follows that for over , the matrix should have two realeigenvalues; but complex conjugate eigenvalues do arise with positive probability. When they do, we have over , but over – and it turns out that over is enough.
We see that the rank of a tensor for decomposition over
is a random variable that can take more than one value with positive probability. These values are called
typical ranks. For decomposition over the situation is different: with probability 1, so there is only one typical rank. When there is only one typical rank (that occurs with probability 1 then) we call it generic rank.All these differences with the usual matrix algebra may be fascinating – and they don’t end here either. Consider
where , with , where stands for the inner product. This tensor has rank equal to 3, however it can be arbitrarily well approximated [23] by the following sequence of ranktwo tensors (see also [14]):
so
has rank equal to 3, but border rank equal to 2 [15]. It is also worth noting that contains two diverging rank1 components that progressively cancel each other approximately, leading to everimproving approximation of . This situation is actually encountered in practice when fitting tensors of border rank lower than their rank. Also note that the above example shows clearly that the lowrank tensor approximation problem
is illposed in general, for there is no minimum if we pick equal to the border rank of – the set of tensors of a given rank is not closed. There are many ways to fix this illposedness, e.g., by adding constraints such as elementwise nonnegativity of [24, 25] in cases where is elementwise nonnegative (and these constraints are physically meaningful), or orthogonality [26] – any applicationspecific constraint that prevents terms from diverging while approximately canceling each other will do. An alternative is to add norm regularization to the cost function, such as . This can be interpreted as coming from a Gaussian prior on the sought parameter matrices; yet, if not properly justified, regularization may produce artificial results and a false sense of security.
Some useful results on maximal and typical rank for decomposition over are summarized in Tables III, III, III – see [14, 27] for more results of this kind, as well as original references. Notice that, for a tensor of a given size, there is always one typical rank over , which is therefore generic. For tensors, this generic rank is the value that can be expected from the equationsversusunknowns reasoning, except for the socalled defective cases (i) (assuming w.l.o.g. that the first dimension is the largest), (ii) the thirdorder case of dimension , (iii) the thirdorder cases of dimension , , and (iv) the fourthorder cases of dimension , , where it is 1 higher ^{6}^{6}6In fact this has been verified for , with the probability that a defective case has been overlooked less than , the limitations being a matter of computing power [28].. Also note that the typical rank may change when the tensor is constrained in some way; e.g., when the frontal slabs are symmetric, we have the results in Table III, so symmetry may restrict the typical rank. Also, one may be interested in symmetric or asymmetric rank decomposition (i.e., symmetric or asymmetric rank1 factors) in this case, and therefore symmetric or regular rank. Consider, for example, a fully symmetric tensor, i.e., one such that , i.e., its value is invariant to any permutation of the three indices (the concept readily generalizes to way tensors ). Then the symmetric rank of over is defined as the minimum such that can be written as , where the outer product involves copies of vector , and . It has been shown that this symmetric rank equals almost surely except in the defective cases , where it is 1 higher [29]. Taking as a special case, this formula gives . We also remark that constraints such as nonnegativity of a factor matrix can strongly affect rank.
Size  Maximum attainable rank over 

Size  Typical ranks over 

,  
, 
Size  Typical ranks,  Typical ranks, 

partial symmetry  no symmetry  
Given a particular tensor , determining is NPhard [30]. There is a wellknown example of a tensor^{7}^{7}7See the insert entitled Tensors as bilinear operators. whose rank (border rank) has been bounded between and ( and
, resp.), but has not been pinned down yet. At this point, the reader may rightfully wonder whether this is an issue in practical applications of tensor decomposition, or merely a mathematical curiosity? The answer is not blackandwhite, but rather nuanced: In most applications, one is really interested in fitting a model that has the “essential” or “meaningful” number of components that we usually call the (useful signal) rank, which is usually much less than the actual rank of the tensor that we observe, due to noise and other imperfections. Determining this rank is challenging, even in the matrix case. There exist heuristics and a few more disciplined approaches that can help, but, at the end of the day, the process generally involves some trialanderror.
An exception to the above is certain applications where the tensor actually models a mathematical object (e.g., a multilinear map) rather than “data”. A good example of this is Strassen’s matrix multiplication tensor – see the insert entitled Tensors as bilinear operators. A vectorvalued (multipleoutput) bilinear map can be represented as a thirdorder tensor, a vectorvalued trilinear map as a fourthorder tensor, etc. When working with tensors that represent such maps, one is usually interested in exact factorization, and thus the mathematical rank of the tensor. The border rank is also of interest in this context, when the objective is to obtain a very accurate approximation (say, to within machine precision) of the given map. There are other applications (such as factorization machines, to be discussed later) where one is forced to approximate a general multilinear map in a possibly crude way, but then the number of components is determined by other means, not directly related to notions of rank.
Consider again the three matrix views of a given tensor in (3), (2), (1). Looking at in (1), note that if is full column rank and so is , then . Hence this matrix view of is rankrevealing. For this to happen it is necessary (but not sufficient) that , and , so has to be small: . Appealing to role symmetry of the three modes, it follows that is necessary to have a rankrevealing matricization of the tensor. However, we know that the (perhaps unattainable) upper bound on is , hence for matricization to reveal rank, it must be that the rank is really small relative to the upper bound. More generally, what holds for sure, as we have seen, is that .
Comments
There are no comments yet.