Tensors111The term has different meaning in Physics, however it has been widely adopted across various disciplines in recent years to refer to what was previously known as a multi-way array. (of order higher than two) are arrays indexed by three or more indices, say – a generalization of matrices, which are indexed by two indices, say for (row, column). Matrices are two-way arrays, and there are three- and higher-way arrays (or higher-order) tensors.
Tensor algebra has many similarities but also many striking differences with matrix algebra – e.g., low-rank tensor factorization is essentially unique under mild conditions; determining tensor rank is NP-hard, on the other hand, and the best low-rank approximation of a higher rank tensor may not even exist. Despite such apparent paradoxes and the learning curve needed to digest tensor algebra notation and data manipulation, tensors have already found many applications in signal processing (speech, audio, communications, radar, biomedical), machine learning (clustering, dimensionality reduction, latent factor models, subspace learning), and well beyond. Psychometrics (loosely defined as mathematical methods for the analysis of personality data) and later Chemometrics (likewise, for chemical data) have historically been two important application areas driving theoretical and algorithmic developments. Signal processing followed, in the 90’s, but the real spark that popularized tensors came when the computer science community (notably those in machine learning, data mining, computing) discovered the power of tensor decompositions, roughly a decade ago [1, 2, 3]. There are nowadays many hundreds, perhaps thousands of papers published each year on tensor-related topics. Signal processing applications include, e.g., unsupervised separation of unknown mixtures of speech signals  and code-division communication signals without knowledge of their codes ; and emitter localization for radar, passive sensing, and communication applications [6, 7]. There are many more applications of tensor techniques that are not immediately recognized as such, e.g., the analytical constant modulus algorithm [8, 9]
. Machine learning applications include face recognition, mining musical scores, and detecting cliques in social networks – see[10, 11, 12] and references therein. More recently, there has been considerable work on tensor decompositions for learning latent variable models, particularly topic models 
, and connections between orthogonal tensor decomposition and the method of moments for computing the Latent Dirichlet Allocation (LDA – a widely used topic model).
After two decades of research on tensor decompositions and applications, the senior co-authors still couldn’t point their new graduate students to a single “point of entry” to begin research in this area. This article has been designed to address this need: to provide a fairly comprehensive and deep overview of tensor decompositions that will enable someone having taken first graduate courses in matrix algebra and probability to get started doing research and/or developing related algorithms and software. While no single reference fits this bill, there are several very worthy tutorials and overviews that offer different points of view in certain aspects, and we would like to acknowledge them here. Among them, the highly-cited and clearly-written tutorial  that appeared 7 years ago in SIAM Review
is perhaps the one closest to this article. It covers the basic models and algorithms (as of that time) well, but it does not go deep into uniqueness, advanced algorithmic, or estimation-theoretic aspects. The target audience of is applied mathematics (SIAM). The recent tutorial  offers an accessible introduction, with many figures that help ease the reader into three-way thinking. It covers most of the bases and includes many motivating applications, but it also covers a lot more beyond the basics and thus stays at a high level. The reader gets a good roadmap of the area, without delving into it enough to prepare for research. Another recent tutorial on tensors is , which adopts a more abstract point of view of tensors as mappings from a linear space to another, whose coordinates transform multilinearly under a change of bases. This article is more suited for people interested in tensors as a mathematical concept, rather than how to use tensors in science and engineering. It includes a nice review of tensor rank results and a brief account of uniqueness aspects, but nothing in the way of algorithms or tensor computations. An overview of tensor techniques for large-scale numerical computations is given in [16, 17], geared towards a scientific computing audience; see  for a more accessible introduction. A gentle introduction to tensor decompositions can be found in the highly cited Chemometrics tutorial  – a bit outdated but still useful for its clarity – and the more recent book . Finally,  is an upcoming tutorial with emphasis on scalability and data fusion applications – it does not go deep into tensor rank, identifiability, decomposition under constraints, or statistical performance benchmarking.
None of the above offers a comprehensive overview that is sufficiently deep to allow one to appreciate the underlying mathematics, the rapidly expanding and diversifying toolbox of tensor decomposition algorithms, and the basic ways in which tensor decompositions are used in signal processing and machine learning – and they are quite different. Our aim in this paper is to give the reader a tour that goes ‘under the hood’ on the technical side, and, at the same time, serve as a bridge between the two areas. Whereas we cannot include detailed proofs of some of the deepest results, we do provide insightful derivations of simpler results and sketch the line of argument behind more general ones. For example, we include a one-page self-contained proof of Kruskal’s condition when one factor matrix is full column rank, which illuminates the role of Kruskal-rank in proving uniqueness. We also ‘translate’ between the signal processing (SP) and machine learning (ML) points of view. In the context of the canonical polyadic decomposition (CPD), also known as parallel factor analysis (PARAFAC), SP researchers (and Chemists) typically focus on the columns of the factor matrices , , and the associated rank-1 factors of the decomposition (where denotes the outer product, see section II-C), because they are interested in separation. ML researchers often focus on the rows of , , , because they think of them as parsimonious latent space representations. For a user item context ratings tensor, for example, a row of is a representation of the corresponding user in latent space, and likewise a row of (
) is a representation of the corresponding item (context) in the same latent space. The inner product of these three vectors is used to predict that user’s rating of the given item in the given context. This is one reason why ML researchers tend to use inner (instead of outer) product notation. SP researchers are interested in model identifiability because it guarantees separability; ML researchers are interested in identifiability to be able to interpret the dimensions of the latent space. In co-clustering applications, on the other hand, the rank-1 tensorscapture latent concepts that the analyst seeks to learn from the data (e.g., cliques of users buying certain types of items in certain contexts). SP researchers are trained to seek optimal solutions, which is conceivable for small to moderate data; they tend to use computationally heavier algorithms. ML researchers are nowadays trained to think about scalability from day one, and thus tend to choose much more lightweight algorithms to begin with. There are many differences, but also many similarities and opportunities for cross-fertilization. Being conversant in both communities allows us to bridge the ground between and help SP and ML researchers better understand each other.
The rest of this article is structured as follows. We begin with some matrix preliminaries, including matrix rank and low-rank approximation, and a review of some useful matrix products and their properties. We then move to rank and rank decomposition for tensors. We briefly review bounds on tensor rank, multilinear (mode-) ranks, and relationship between tensor rank and multilinear rank. We also explain the notions of typical, generic, and border rank, and discuss why low-rank tensor approximation may not be well-posed in general. Tensors can be viewed as data or as multi-linear operators, and while we are mostly concerned with the former viewpoint in this article, we also give a few important examples of the latter as well. Next, we provide a fairly comprehensive account of uniqueness of low-rank tensor decomposition. This is the most advantageous difference when one goes from matrices to tensors, and therefore understanding uniqueness is important in order to make the most out of the tensor toolbox. Our exposition includes two stepping-stone proofs: one based on eigendecomposition, the other bearing Kruskal’s mark (“down-converted to baseband” in terms of difficulty). The Tucker model and multilinear SVD come next, along with a discussion of their properties and connections with rank decomposition. A thorough discussion of algorithmic aspects follows, including a detailed discussion of how different types of constraints can be handled, how to exploit data sparsity, scalability, how to handle missing values, and different loss functions. In addition to basic alternating optimization strategies, a host of other solutions are reviewed, including gradient descent, line search, Gauss-Newton, alternating direction method of multipliers, and stochastic gradient approaches. The next topic is statistical performance analysis, focusing on the widely-used Cramér-Rao bound and its efficient numerical computation. This section contains novel results and derivations that are of interest well beyond our present context – e.g., can also be used to characterize estimation performance for a broad range of constrained matrix factorization problems. The final main section of the article presents motivating applications in signal processing (communication and speech signal separation, multidimensional harmonic retrieval) and machine learning (collaborative filtering, mixture and topic modeling, classification, and multilinear subspace learning). We conclude with some pointers to online resources (toolboxes, software, demos), conferences, and some historical notes.
Ii-a Rank and rank decomposition for matrices
Consider an matrix , and let the number of linearly independent columns of , i.e., the dimension of the range space of , . is the minimum such that , where is an basis of , and is and holds the corresponding coefficients. This is because if we can generate all columns of , by linearity we can generate anything in , and vice-versa. We can similarly define the number of linearly independent rows of , which is the minimum such that , where is and is . Noting that
where stands for the -th column of , we have
where and . It follows that , and , so the three definitions actually coincide – but only in the matrix (two-way tensor) case, as we will see later. Note that, per the definition above, is a rank-1 matrix that is ‘simple’ in the sense that every column (or row) is proportional to any other column (row, respectively). In this sense, rank can be thought of as a measure of complexity. Note also that , because obviously , where
is the identity matrix.
Ii-B Low-rank matrix approximation
In practice is usually full-rank, e.g., due to measurement noise, and we observe , where is low-rank and represents noise and ‘unmodeled dynamics’. If the elements of are sampled from a jointly continuous distribution, then will be full rank almost surely – for the determinant of any square submatrix of is a polynomial in the matrix entries, and a polynomial that is nonzero at one point is nonzero at every point except for a set of measure zero. In such cases, we are interested in approximating with a low-rank matrix, i.e., in
The solution is provided by the truncated SVD of , i.e., with , set , or , where denotes the matrix containing columns to of . However, this factorization is non-unique because , for any nonsingular matrix , where . In other words: the factorization of the approximation is highly non-unique (when , there is only scaling ambiguity, which is usually inconsequential). As a special case, when (noise-free) so , low-rank decomposition of is non-unique.
Ii-C Some useful products and their properties
In this section we review some useful matrix products and their properties, as they pertain to tensor computations.
Kronecker product: The Kronecker product of () and () is the matrix
The Kronecker product has many useful properties. From its definition, it follows that . For an matrix , define
i.e., the vector obtained by vertically stacking the columns of . By definition of it follows that .
Consider the product , where is , is , and is . Note that
Therefore, using and linearity of the operator
This is useful when dealing with linear least squares problems of the following form
Khatri–Rao product: Another useful product is the Khatri–Rao (column-wise Kronecker) product of two matrices with the same number of columns (see [20, p. 14] for a generalization). That is, with and , the Khatri–Rao product of and is . It is easy to see that, with being a diagonal matrix with vector on its diagonal (we will write , and , where we have implicitly defined operators and to convert one to the other), the following property holds
which is useful when dealing with linear least squares problems of the following form
It should now be clear that the Khatri–Rao product is a subset of columns from . Whereas contains the ‘interaction’ (Kronecker product) of any column of with any column of , contains the Kronecker product of any column of with only the corresponding column of .
(associative); so we may simply write as . Note though that , so the Kronecker product is non-commutative.
(note order, unlike ).
, where , stand for conjugation and Hermitian (conjugate) transposition, respectively.
(the mixed product rule). This is very useful – as a corollary, if and are square nonsingular, then it follows that , and likewise for the pseudo-inverse. More generally, if is the SVD of , and is the SVD of , then it follows from the mixed product rule that is the SVD of . It follows that
, for square , .
, for square , .
The Khatri–Rao product has the following properties, among others:
(associative); so we may simply write as . Note though that , so the Khatri–Rao product is non-commutative.
(mixed product rule).
Tensor (outer) product: The tensor product or outer product of vectors and is defined as the matrix with elements , . Note that . Introducing a third vector , we can generalize to the outer product of three vectors, which is an three-way array or third-order tensor with elements . Note that the element-wise definition of the outer product naturally generalizes to three- and higher-way cases involving more vectors, but one loses the ‘transposition’ representation that is familiar in the two-way (matrix) case.
Iii Rank and rank decomposition for tensors: CPD / PARAFAC
We know that the outer product of two vectors is a ‘simple’ rank-1 matrix – in fact we may define matrix rank as the minimum number of rank-1 matrices (outer products of two vectors) needed to synthesize a given matrix. We can express this in different ways: if and only if (iff) is the smallest integer such that for some and , or, equivalently, , .
A rank-1 third-order tensor of size is an outer product of three vectors: , , , and ; i.e., – see Fig. 1. A rank-1 -th order tensor is likewise an outer product of vectors: , , ; i.e., . In the sequel we mostly focus on third-order tensors for brevity; everything naturally generalizes to higher-order tensors, and we will occasionally comment on such generalization, where appropriate.
The rank of tensor is the minimum number of rank-1 tensors needed to produce as their sum – see Fig. 2 for a tensor of rank three. Therefore, a tensor of rank at most can be written as
where , , and . It is also customary to use , so . For brevity, we sometimes also use the notation to denote the relationship .
Let us now fix and look at the frontal slab of . Its elements can be written as
where we note that the elements of the first row of weigh the rank-1 factors (outer products of corresponding columns of and ). We will denote for brevity. Hence, for any ,
Applying the vectorization property of it now follows that
and by parallel stacking, we obtain the matrix unfolding (or, matrix view)
We see that, when cast as a matrix, a third-order tensor of rank admits factorization in two matrix factors, one of which is specially structured – being the Khatri–Rao product of two smaller matrices. One more application of the vectorization property of yields the vector
where is an vector of all 1’s. Hence, when converted to a long vector, a tensor of rank is a sum of structured vectors, each being the Khatri–Rao / Kronecker product of three vectors (in the three-way case; or more vectors in higher-way cases).
In the same vain, we may consider lateral or horizontal slabs222A warning for Matlab aficionados: due to the way that Matlab stores and handles tensors, one needs to use the ‘squeeze’ operator, i.e., , and ., e.g.,
and similarly333One needs to use the ‘squeeze’ operator here as well. , so
Iii-a Low-rank tensor approximation
We are in fact ready to get a first glimpse on how we can go about estimating , , from (possibly noisy) data . Adopting a least squares criterion, the problem is
where is the sum of squares of all elements of (the subscript in stands for Frobenius (norm), and it should not be confused with the number of factors in the rank decomposition – the difference will always be clear from context). Equivalently, we may consider
Note that the above model is nonconvex (in fact trilinear) in , , ; but fixing and , it becomes (conditionally) linear in , so that we may update
and, using the other two matrix representations of the tensor, update
until convergence. The above algorithm, widely known as Alternating Least Squares (ALS) is a popular way of computing approximate low-rank models of tensor data. We will discuss algorithmic issues in depth at a later stage, but it is important to note that ALS is very easy to program, and we encourage the reader to do so – this exercise helps a lot in terms of developing the ability to ‘think three-way’.
Iii-B Bounds on tensor rank
For an matrix , we know that , and almost surely, meaning that rank-deficient real (complex) matrices are a set of Lebesgue measure zero in . What can we say about tensors ? Before we get to this, a retrospective on the matrix case is useful. Considering where is and is , the size of such parametrization (the number of unknowns, or degrees of freedom (DoF) in the model) of is444Note that we have taken away DoF due to the scaling / counter-scaling ambiguity, i.e., we may always multiply a column of and divide the corresponding column of with any nonzero number without changing . . The number of equations in is , and equations-versus-unknowns considerations suggest that of order may be needed – and this turns out being sufficient as well.
For third-order tensors, the DoF in the low-rank parametrization is555Note that here we can scale, e.g., and at will, and counter-scale , which explains the . , whereas the number of equations is . This suggests that may be needed to describe an arbitrary tensor of size , i.e., that third-order tensor rank can potentially be as high as . In fact this turns out being sufficient as well. One way to see this is as follows: any frontal slab can always be written as , with and having at most columns. Upon defining , , and (where is an identity matrix of size , and is a vector of all 1’s of size ), we can synthesize as . Noting that and have at most columns, it follows that we need at most columns in , , . Using ‘role symmetry’ (switching the names of the ‘ways’ or ‘modes’), it follows that we in fact need at most columns in , , , and thus the rank of any three-way array is bounded above by . Another (cleaner but perhaps less intuitive) way of arriving at this result is as follows. Looking at the matrix unfolding
and noting that is and is , the issue is what is the maximum inner dimension that we need to be able to express an arbitrary matrix on the left (corresponding to an arbitrary tensor ) as a Khatri–Rao product of two , matrices, times another matrix? The answer can be seen as follows:
and thus we need at most columns per column of , which has columns – QED.
This upper bound on tensor rank is important because it spells out that tensor rank is finite, and not much larger than the equations-versus-unknowns bound that we derived earlier. On the other hand, it is also useful to have lower bounds on rank. Towards this end, concatenate the frontal slabs one next to each other
since . Note that is , and it follows that must be greater than or equal to the dimension of the column span of , i.e., the number of linearly independent columns needed to synthesize any of the columns of . By role symmetry, and upon defining
we have that . is the mode-1 or mode-A rank of , and likewise and are the mode-2 or mode-B and mode-3 or mode-C ranks of , respectively. is sometimes called the column rank, the row rank, and the fiber or tube rank of . The triple is called the multilinear rank of .
At this point it is worth noting that, for matrices we have that column rank = row rank = rank, i.e., in our current notation, for a matrix (which can be thought of as an third-order tensor) it holds that , but for nontrivial tensors , , and are in general different, with . Since , , , it follows that for matrices but can be for tensors.
Now, going back to the first way of explaining the upper bound we derived on tensor rank, it should be clear that we only need rank-1 factors to describe any given frontal slab of the tensor, and so we can describe all slabs with at most rank-1 factors; with a little more thought, it is apparent that is enough. Appealing to role symmetry, it then follows that , where . Dropping the explicit dependence on for brevity, we have
Iii-C Typical, generic, and border rank of tensors
Consider a tensor
whose elements are i.i.d., drawn from the standard normal distribution( in Matlab). The rank of over the real field, i.e., when we consider
This is very different from the matrix case, where with probability 1. To make matters more (or less) curious, the rank of the same is in fact 2 with probability 1 when we instead consider decomposition over the complex field, i.e., using . As another example , for ,
To understand this behavior, consider the case. We have two slabs, and . For to have , we must be able to express these two slabs as
for some real or complex matrices , , and , depending on whether we decompose over the real or the complex field. Now, if , then both and are nonsingular matrices, almost surely (with probability 1). It follows from the above equations that , , , and must all be nonsingular too. Denoting , , it follows that , and substituting in the second equation we obtain , i.e., we obtain the eigen-problem
It follows that for over , the matrix should have two realeigenvalues; but complex conjugate eigenvalues do arise with positive probability. When they do, we have over , but over – and it turns out that over is enough.
We see that the rank of a tensor for decomposition over
is a random variable that can take more than one value with positive probability. These values are calledtypical ranks. For decomposition over the situation is different: with probability 1, so there is only one typical rank. When there is only one typical rank (that occurs with probability 1 then) we call it generic rank.
All these differences with the usual matrix algebra may be fascinating – and they don’t end here either. Consider
has rank equal to 3, but border rank equal to 2 . It is also worth noting that contains two diverging rank-1 components that progressively cancel each other approximately, leading to ever-improving approximation of . This situation is actually encountered in practice when fitting tensors of border rank lower than their rank. Also note that the above example shows clearly that the low-rank tensor approximation problem
is ill-posed in general, for there is no minimum if we pick equal to the border rank of – the set of tensors of a given rank is not closed. There are many ways to fix this ill-posedness, e.g., by adding constraints such as element-wise non-negativity of [24, 25] in cases where is element-wise non-negative (and these constraints are physically meaningful), or orthogonality  – any application-specific constraint that prevents terms from diverging while approximately canceling each other will do. An alternative is to add norm regularization to the cost function, such as . This can be interpreted as coming from a Gaussian prior on the sought parameter matrices; yet, if not properly justified, regularization may produce artificial results and a false sense of security.
Some useful results on maximal and typical rank for decomposition over are summarized in Tables III, III, III – see [14, 27] for more results of this kind, as well as original references. Notice that, for a tensor of a given size, there is always one typical rank over , which is therefore generic. For tensors, this generic rank is the value that can be expected from the equations-versus-unknowns reasoning, except for the so-called defective cases (i) (assuming w.l.o.g. that the first dimension is the largest), (ii) the third-order case of dimension , (iii) the third-order cases of dimension , , and (iv) the fourth-order cases of dimension , , where it is 1 higher 666In fact this has been verified for , with the probability that a defective case has been overlooked less than , the limitations being a matter of computing power .. Also note that the typical rank may change when the tensor is constrained in some way; e.g., when the frontal slabs are symmetric, we have the results in Table III, so symmetry may restrict the typical rank. Also, one may be interested in symmetric or asymmetric rank decomposition (i.e., symmetric or asymmetric rank-1 factors) in this case, and therefore symmetric or regular rank. Consider, for example, a fully symmetric tensor, i.e., one such that , i.e., its value is invariant to any permutation of the three indices (the concept readily generalizes to -way tensors ). Then the symmetric rank of over is defined as the minimum such that can be written as , where the outer product involves copies of vector , and . It has been shown that this symmetric rank equals almost surely except in the defective cases , where it is 1 higher . Taking as a special case, this formula gives . We also remark that constraints such as nonnegativity of a factor matrix can strongly affect rank.
|Size||Maximum attainable rank over|
|Size||Typical ranks over|
|Size||Typical ranks,||Typical ranks,|
|partial symmetry||no symmetry|
Given a particular tensor , determining is NP-hard . There is a well-known example of a tensor777See the insert entitled Tensors as bilinear operators. whose rank (border rank) has been bounded between and ( and
, resp.), but has not been pinned down yet. At this point, the reader may rightfully wonder whether this is an issue in practical applications of tensor decomposition, or merely a mathematical curiosity? The answer is not black-and-white, but rather nuanced: In most applications, one is really interested in fitting a model that has the “essential” or “meaningful” number of components that we usually call the (useful signal) rank, which is usually much less than the actual rank of the tensor that we observe, due to noise and other imperfections. Determining this rank is challenging, even in the matrix case. There exist heuristics and a few more disciplined approaches that can help, but, at the end of the day, the process generally involves some trial-and-error.
An exception to the above is certain applications where the tensor actually models a mathematical object (e.g., a multilinear map) rather than “data”. A good example of this is Strassen’s matrix multiplication tensor – see the insert entitled Tensors as bilinear operators. A vector-valued (multiple-output) bilinear map can be represented as a third-order tensor, a vector-valued trilinear map as a fourth-order tensor, etc. When working with tensors that represent such maps, one is usually interested in exact factorization, and thus the mathematical rank of the tensor. The border rank is also of interest in this context, when the objective is to obtain a very accurate approximation (say, to within machine precision) of the given map. There are other applications (such as factorization machines, to be discussed later) where one is forced to approximate a general multilinear map in a possibly crude way, but then the number of components is determined by other means, not directly related to notions of rank.
Consider again the three matrix views of a given tensor in (3), (2), (1). Looking at in (1), note that if is full column rank and so is , then . Hence this matrix view of is rank-revealing. For this to happen it is necessary (but not sufficient) that , and , so has to be small: . Appealing to role symmetry of the three modes, it follows that is necessary to have a rank-revealing matricization of the tensor. However, we know that the (perhaps unattainable) upper bound on is , hence for matricization to reveal rank, it must be that the rank is really small relative to the upper bound. More generally, what holds for sure, as we have seen, is that .