1 Tensorial independent component analysis
In modern data analysis an increasingly common and a natural assumption is that the covariance matrices are Kroneckerstructured: the random vector
has a covariance matrix where , are positivedefinite andis the Kronecker product. In order for an estimator of
to be adequate, it should utilize this special structure, see e.g. Leng and Pan (2018); Roś et al. (2016); Srivastava et al. (2008); Werner et al. (2008); Wiesel (2012). In particular, if the Kronecker structure is ignored, the amount of parameters is inflated from to .Basic properties of the Kronecker product ensure that any zeromean random vector x
, with the Kronecker covariance structure, admits a natural representation as a random matrix by means of the matrix locationscatter model,
where the original random vector is obtained from vectorization, , and the latent random matrix Z is standardized such that . In order to find interesting structures, the uncorrelatedness assumption for the elements of Z is occasionally not enough. In Virta et al. (2017) the uncorrelatedness assumption was replaced by the assumption of full statistical independence and the locationscatter model was extended to the matrixvalued independent component model,
(1) 
where the invertible
are unknown parameters and the latent tensor
Z is assumed to have mutually independent, marginally standardized components. The estimation procedure of a latent vector with independent components, using only the information provided by observed linear combinations of the components, is referred to as independent component analysis (ICA). See Comon and Jutten (2010); Hyvärinen et al. (2004); Ilmonen and Paindaveine (2011); Miettinen et al. (2015); Nordhausen and Oja (2018); Samarov and Tsybakov (2004) for different approaches under classic multivariate settings. Since the location has no effect on the estimation of the socalled mixing matrices and , we can without loss of generality assume that Z is centered. Under the additional assumption that at most one entire row of Zhas a multivariate normal distribution and at most one entire column of
Z has a multivariate normal distribution, it can be shown that the latent Z is identifiable up to the order, joint scaling and signs of its rows and columns (Virta et al., 2017). This pair of assumptions is less restrictive than its counterpart in standard vectorvalued ICA, which allows a maximum of one normal component in the latent vector (Comon and Jutten, 2010). On the contrary, the assumptions of the matrix model allow Z to contain up to normal components, if the nonnormal elements are suitably located.Vectorizing Eq. (1) yields a Kroneckerstructured standard independent component model
where , . Again, here any reasonable ICA approach should take the special form of the mixing matrix into account. To our knowledge, no structured estimation methods have yet been developed for standard vectorvalued ICA in this setting. However, the problem has been considered under the matrix representation of Eq. (1) in Virta et al. (2017, 2018), where two classical ICA procedures, fourth order blind identification (FOBI) (Cardoso, 1989) and joint diagonalization of eigenmatrices (JADE) (Cardoso and Souloumiac, 1993), are extended to TFOBI and TJADE in order to solve Eq. (1). Even though TJADE provides a definite improvement in efficiency with respect to TFOBI and the classical multivariate versions of the procedures, it is relatively expensive to compute. Consequently, the main goal of this paper is to address this issue. We provide a computationally more powerful, alternative TJADEbased estimator, which retains the desirable statistical properties of TJADE. In particular, the estimator retains the consistency and the limiting distribution of TJADE.
We wish to point out that the statistical models for matrix data we are about to discuss are markedly different from the ones usually encountered when discussing arrayvalued data in engineering contexts. The relevant engineering literature is populated mostly with different tensor decompositions, the most popular ones being the Tucker decomposition and the CPdecomposition, which extend the singular value decomposition to higher order structures in different ways, see
Cichocki et al. (2015); Kolda and Bader (2009) for comprehensive reviews. A fundamental difference between the classical tensor decompositions and our current model is that the former rarely incorporate the concept of sample in the formulation. As such, tensor decompositions generally reduce also the dimension of the observation space which is very unusual for classical statistical methods. These fundamentally different objectives also prevent any simple, meaningful comparisons, and therefore we have refrained from including comparisons to classical tensor decompositions in this work.The rest of the paper is structured as follows. In Section 2, we review the existing methodology on tensorial independent component analysis and, in particular, TJADE on which our proposed extension will be based. In Section 3, we formulate TJADE in the case of matrixvalued observations and thoroughly establish its limiting behaviour along with the necessary assumptions. The theoretical results are extended to a general tensorvalued model in Section 4 using the technique of matricization. Section 5 explores the finitesample performance of the proposed method under both, simulations and a video data example, and in Section 6 we provide a short summary and list ideas for future work. Technical proofs are presented in the Appendix A.
2 Tensorial JADE
We begin by briefly describing the theory behind TFOBI and TJADE. Everything in the following is formulated on the population level for a random matrix . In practice, one would obtain a sample of matrices, , from the distribution of X and the expected values below should be replaced by sample means.
Assuming that the random matrix follows the model in Eq. (1), the first step is to simultaneously standardize both, the rows and the columns of the matrix, using the left and right covariance matrices of X,
We denote the unique symmetric inverse square root of the positivedefinite matrix S by . The standardized variable satisfies for some and some orthogonal matrices, , see Virta et al. (2017). The unknown constant of proportionality is a result of the joint scaling of the model in Eq. (1) being left unfixed. After the standardization, solving the IC problem is reduced to the estimation of the orthogonal matrices , a task commonly addressed in ICA using higherorder cumulants. The TFOBI and TJADE procedures also utilize the higherorder cumulants. In TFOBI, a Fisher consistent (under mild assumptions, see Section 3) estimator, , for the inverse of the matrix is constructed such that
where the columns of
are the eigenvectors of the matrix
Thus, provides a solution for the lefthand side of the IC model in Eq. (1) (Virta et al., 2017).
The TJADE procedure utilizes a set of matrices, , referred to as the set of cumulant matrices, such that
(2) 
where is the Kronecker delta, , , are the standard basis vectors of and . The left covariance matrix of the standardized matrix satisfies and is included in Eq. (2) solely for the estimation of the constant of proportionality . The authors in Virta et al. (2018) proved that the joint diagonalizer of is under mild assumptions, see Section 3
, equal to the orthogonal matrix
up to the order and signs of its columns. The joint diagonalizer of the set is defined as any orthogonal matrix that minimizes(3) 
where is equal to with the diagonal elements set to zero.
The joint diagonalizer defines a coordinate system in which the linear transformations
, , have offdiagonal elements as close to zero as possible. There exists several algorithms for optimizing Eq. (3), the most popular being the Jacobirotation technique, for details see Belouchrani et al. (1997); Illner et al. (2015). After the estimation of the joint diagonalizer , an estimated inverse for the matrix is obtained as the TJADEfucntional, .The results of this paper are derived only for the lefthand side of the matrixvalued model. The righthand side of the matrixvalued model can be solved exactly as the lefthand side. One can simply take the transpose of X and proceed as with the lefthand side. Only the sizes of the cumulant and transformation matrices change from to . Moreover, matricization allows us to extend the estimators beyond the matrixvalued model to arbitrarydimensional tensorvalued IC models, see Virta et al. (2017, 2018). Matricization allows us to hide a considerable amount of the unpleasant notation related to tensor algebra, see Section 4. In total, it is sufficient to present the results in matrix form and only for the lefthand side of the model in Eq. (1).
When the dimension is equal to one, the TJADE procedure for the lefthand side of the model is equivalent to the standard JADE for vectorvalued data. Extensive comparisons between JADE and TJADE are conducted in Virta et al. (2016, 2017, 2018) with the conclusion that the latter is uniformly superior to the former under the Kroneckerstructured IC model. Moreover, the tensorial version is computationally significantly faster. Consider a tensor of th order with all dimensions of size . Standard JADE requires a single joint diagonalization of matrices that are of size , whereas TJADE requires joint diagonalizations of matrices that are of size . In essence, adding dimensions to a tensor has a multiplicative effect on the number of operations the classic vectorial methods require and merely an additive effect on the tensorial methods. However, even with its considerable advantages over JADE, running the TJADE procedure is slow for large tensors.
To obtain a faster method, we approach the problem in the spirit of Miettinen et al. (2013) where a faster version of JADE, JADE, is derived. The modification can be described very succinctly: instead of diagonalizing the entire set of cumulant matrices , we diagonalize only a specific subset of them, chosen such that the desirable statistical properties of TJADE are still carried over to the extension. Since the subset of cumulant matrices can be chosen separately in each direction of the tensor, the JADE approach provides even more significant improvements in tensorial settings compared to its original use in improving JADE. Note that similar ideas as in Miettinen et al. (2013) were used already in Cardoso (1999) to formulate shifted blocks for blind separation (SHIBBS), where only those cumulant matrices of regular JADE with matching indices are diagonalized.
3 Tensorial Jade
In this section we propose a novel extension of the TJADE procedure. We formulate the extension, TJADE, such that it retains the following three key properties of TJADE. The first of these properties is the ability to solve the tensor independent component model, manifesting either as Fisher consistency or consistency, depending on whether we are at the population or sample level, respectively. The second property is orthogonal equivariance under arbitrary data. ICAestimators are customarily expected to have some form of equivariance, which makes the generalization of limiting properties more straightforward (Miettinen et al., 2015; Virta et al., 2017). The third desired property is the limiting distribution of TJADE, the currently most efficient known tensorial ICA method. Next, we establish these properties onebyone. As mentioned in the previous section, all results derived for the lefthand side of the model also hold for the righthand side, prompting us to consider only the former in the following.
We define matrix independent component functionals, the extension of independent component functionals (Miettinen et al., 2015) to matricial ICA.
Definition 1.
A matrixvalued functional is a matrix independent component (IC) functional if

for all that follow the matrix IC model of Eq. (1),

for all and all orthogonal , ,
where two matrices satisfy if for some , some diagonal matrix with diagonal elements equal to and some permutation matrix .
The first condition in Definition 1 requires that a matrix IC functional must be able to solve the lefthand side of the model in Eq. (1) (Fisher consistency). The second condition essentially states that the functional cancels out any orthogonal transformations on the observed matrices (orthogonal equivariance). As a particularly useful consequence of the latter, the limiting distribution of a matrix IC functional under trivial mixing, , , instantly generalizes to any orthogonal mixing as well.
Let be the vector of the row means of the elementwise kurtoses, , of the elements of Z.
Assumption 1.
At most one element of equals zero.
Assumption 2 ().
The multiplicities of the elements of are at most .
The TFOBI functional is a matrix IC functional in the sense of Definition 1 if Assumption 2 is satisfied and the TJADE functional is a matrix IC functional if Assumption 1 is satisfied. Naturally, the column mean analogues of the assumptions are required to separate the righthand side of the model in Eq. (1).
The same comparison as was done between the normality assumptions in Section 1 holds analogously between Assumption 1, Assumption 2
and their vectorial counterparts allowing maximally one zerokurtosis component or only distinct kurtoses, respectively. The main implication is that in matrix ICA numerous latent components may have identical kurtoses as long as their row means (and column means when separating the righthand side of the model) satisfy the necessary requirements. The assumptions also satisfy the following set of relations,
where means “implies”. Moreover, we have also .
In order to speed up TJADE such that the properties in Definition 1 are retained, we proceed as in Miettinen et al. (2013), and instead of diagonalizing the set , we diagonalize only those members of it which satisfy , for some predefined value of the tuning parameter . This discarding can be motivated in two ways. Firstly, all except the repeated index matrices, , vanish asymptotically. Every matrix , , implying that with increasing sample size all the separation information is eventually contained in the repeated index matrices. Secondly, by assuming that the values in are in decreasing order (this is guaranteed by our next modification) the th row of Z is the most difficult to separate from its immediate neighboring rows and the separation information between them is contained precisely in the matrices and where is close to .
Analogously to JADE, we use the TFOBIalgorithm to obtain an initial value for the functional. This ensures that even after the previous modification, the functional remains ortohogonally equivariant. The following definition and theorem formalize our resulting novel method, called TJADE.
Definition 2.
Fix . The TJADE functional is
where is the TFOBI functional, is the TFOBIsolution for X and the orthogonal matrix is the joint diagonalizer of the set of matrices
Theorem 1.
Theorem 1 provides the Fisher consistency and the orthogonal equivariance for the TJADE functional. The assumptions that Theorem 1 requires are interesting since they provide an interpretation for the tuning parameter — the parameter is the maximal number of allowed kurtosis mean multiplicities. The values and correspond to the extreme cases where all the kurtosis means have to be distinct (as in TFOBI) and where no assumptions are made on the multiplicities of the nonzero kurtosis means (as in TJADE). Thus, TJADE can be seen as a middle ground between TFOBI and TJADE. As the assumptions, also the methods can be ordered according to the strictness of the assumptions they require,
where “” is read as “makes at least as many assumptions as” and “” as “makes the same assumptions as”.
Assumptions 1 and 2 not only provide Theorem 1 but are also sufficient to guarantee the two remaining desired properties of TJADE, consistency and the same limiting distribution as that of TJADE. These asymptotic properties are formalized in the following two theorems. Remarkably as a special case, when , the latter also proves a previously unsolved conjecture posed about the limiting behavior of vectorial JADE in Miettinen et al. (2015).
Theorem 2.
Let Assumptions 1 and 2 hold for some fixed and let be an i.i.d. sequence from the matrix IC model in Eq. (1) with identity mixing matrices, . Assume that the population quantity X
has finite eight moments. Then, for all
, there exists a consistent sequence of TJADEestimators (indexed by ). That is,Theorem 3.
Under the assumptions of Theorem 2, we have for all that,
where is the TJADEestimator. The notation
refers to a sequence of random matrices that converges in probability to the zero matrix.
Note that, does not generally hold as the latter estimator utilizes the preliminary TFOBIstep while the former does not. The limiting distribution of is in Virta et al. (2018)
shown to be multivariate normal and closed form expressions for its limiting variances are also given therein. By the orthogonal equivariance of matrix independent component functionals (property (ii) of Definition
1) the limiting results of Theorem 3 generalize to any orthogonal mixing matrices. Note that the original JADE in Miettinen et al. (2013) is affine equivariant (equivariant under all coordinate system changes, not just orthogonal). The problem of achieving affine equivariance in the context of tensorial ICA is discussed in Virta et al. (2018). There it was conjectured that tensorial ICA cannot be affine equivariant.The two limiting theorems above show that, under suitable assumptions, TJADE indeed has all the desirable properties listed at the beginning of this section. This can be summarized by saying that TJADE makes a tradeoff between assumptions and computational load: with the price of added assumptions, we obtain a method with the same limiting efficiency as TJADE, but with significantly lighter computational burden. As the claim about efficiency holds only asymptotically, we conduct a simulation study (in Section 5) to compare the finitesample efficiency of the estimators.
4 A note on tensorial ICA
In this section, we formulate the general tensorial IC model and discuss how it can be reduced to the matricial IC model. We begin with a short review of the basic concepts of multilinear algebra.
An th order tensor is an dimensional array containing a total of elements and has a total of modes or ways, that is, directions from which we can view it. For example, a matrix () can be viewed either through its columns or through its rows. Two complementary ways of dividing a tensor into a disjoint collection of smaller tensors are called the mode vectors and the mode faces. In the former, we choose an index and have the values of the indices fixed and let the th index vary over its range. Each fixed combination of the indices then yields a single dimensional vector and the collection of all such vectors is called the set of mode vectors of . On the other hand, if we fix the value of the th index and let the others vary over their ranges, we get a total of tensors of order and size , called the mode faces of . Illustrations of both, mode vectors and mode faces, , in the case of a 3dimensional tensor, are shown in Figures 1 and 2.
Two algebraic operations can now be defined in terms of the mode vectors. Let and for some fixed . The mode multiplication of from the th mode by is defined elementwise as,
The mode multiplication is easily understood: is obtained by multiplying each mode vector of from the left by and collecting the resulting vectors back into an dimensional tensor in the same order. The mode multiplications from distinct modes are commutative and we use the shorthand . In the case of matrices, , the simultaneous multiplication from both modes simply gives .
To show the connection between the matricial and tensorial IC methods, we still need the concept of mode matricization. For a fixed mode , the mode matricization of is obtained by taking the mode vectors of and collecting them horizontally into a wide matrix. The arrangement order of the mode vectors has no effect on our further developments, and we choose to use the cyclical ordering as in De Lathauwer et al. (2000). In this case the relationship,
(4) 
holds. For more comprehensive introduction to multilinear algebra, see Cichocki et al. (2009); De Lathauwer et al. (2000).
We are now sufficiently equipped to examine the connection between the matricial ICA and the tensorial ICA. The zeromean th order random tensor is said to follow the tensorial independent component model if
(5) 
where the random tensor has mutually independent, marginally standardized components and , , are invertible (Virta et al., 2017). We further assume that for each mode at most one mode face of consists solely of normally distributed components. The objective of tensorial independent component analysis is to estimate given a random sample of observed tensors from the distribution of .
We fix the mode and consider the mode matricization of the model in Eq. (5). It now follows from Eq. (4) that
(6) 
As Kronecker products of invertible matrices are themselves invertible, a comparison to Eq. (1) now reveals that Eq. (6) is in the form of a matrix IC model with replacing and taking the role of . In addition, satisfies all the assumptions of the matrix IC model. (To be precise, we only have one half of the normality assumption. However, that is sufficient for the identifiability of , our current parameter of interest.) Thus, using Eq. (6), we can estimate exactly as in the matricial case. In the tensorial case, the kurtosis means in the vector in Assumption 1 and in Assumption 2 are computed over the rows of , or equivalently, over the mode faces of . All our theoretical results, such as orthogonal equivariance (a Kronecker product of orthogonal matrices is itself orthogonal) and the asymptotic variances, hold fully under the tensorial IC model, as long as the relevant kurtosis assumptions are satisfied.
5 Simulations and examples
In this section, we illustrate the finite sample properties of the tensorvalued procedures TFOBI, TJADE and the novel TJADE, which are implemented in the Rpackage tensorBSS (Virta et al., 2016). For comparison, we also consider the classical vectorvalued versions of these estimators, denoted by VFOBI, VJADE and VJADE, as implemented in the JADE package (Miettinen et al., 2017). In the classical procedures, the tensorvalued observations are first vectorized and the algorithms are then applied to the resulting data matrices. Rcode for running the examples and obtaining the used datasets is available in the supplementary material.
Next, we consider a collection of observed i.i.d. tensors , generated from the tensorial independent component model, such that , for every . Let be the theoretical mixing matrices and let be the corresponding unmixing estimates produced by one of the tensorvalued procedures. We denote the Kronecker products of the matrices as and . The vectorvalued procedures produce a single unmixing estimate, denoted also by . Note that the estimates produced by the vectorial and the tensorial methods are comparable, since in both cases the matrix estimates the inverse of the compound matrix .
The socalled gain matrix is defined as
. The unmixing is considered successful if the gain matrix is close to the identity matrix, up to the order and signs of its rows. We quantify this closeness using the performance measure called minimum distance (MD) index
(Ilmonen et al., 2010). The MD index is formulated as follows,(7) 
where and is the set of all matrices with exactly one nonzero element in each row and column. The range of the MD index is , where the value corresponds to the case of the gain matrix being exactly a permuted and scaled identity matrix, i.e. the estimate is perfectly accurate. Additionally, the limiting distribution of the MD index can be obtained from the limiting distribution of the corresponding IC functional , see (Ilmonen et al., 2010, Theorem 1) and (Virta et al., 2017, Theorem 6).
In simulation studies, where multiple iterations are performed under identical conditions, the asymptotic value for the mean of the transformed MD index can be obtained using the limiting variances of the applied IC functionals, see Virta et al. (2017) for further details. The convergence towards the theoretical limiting values given by Theorem 3 can then be demonstrated by visualizing the theoretical limits alongside , where is the number of iterations for the sample size and is the estimated gain matrix for the corresponding th iteration.
Before presenting the simulations, we discuss shortly the importance of the tuning parameter , which in TJADE can be chosen separately for each mode. In the following, TJADE is used to refer to TJADE with the value for the tuning parameter in the th mode, . Based on Theorem 3, the computationally cheapest but still asymptotically optimal choice is the largest multiplicity of the kurtosis means in the current mode. This means that we have generally three choices for each mode: we may use the full TJADE if the mode is short; we can use TJADE for some small if the mode is long; or we can choose not to separate the mode at all if it is not expected to contain any relevant information, as might be the case, e.g., for the color dimension of a sample of images.
5.1 Finitesample efficiency, setting 1
We begin by demonstrating that the finite sample performance of TJADE is in line with the asymptotic results given in Theorem 3. In the first setting, we consider simulated collections of i.i.d. matrices of size , . The components of every are simulated independently from the following distributions,
where C, E, U and N denote independent replicates from the , standard exponential, uniform and normal distribution, respectively, all scaled to have zero mean and unit variance.
In this simulation setting, none of the theoretical row or column kurtosis means are zero. However, the theoretical mean kurtoses are the same for the first two rows and the first two columns. Thus, the preferable TJADE procedure here is 22TJADE, . Note that in this setting the requirements of TFOBI are not fullfilled.
When the observations are vectorized, the length nine vectors contain a single normally distributed component, two chisquaredistributed elements, three elements from the exponential distribution and three elements from the uniform distribution. The vector now contains one element (the normally distributed component) with theoretical kurtosis 0, making vectorial ICA viable. The most natural
VJADE procedure here is 3VJADE. The assumptions of VFOBI are violated.The simulation was performed using 13 different sample sizes, , with repetitions per sample size. To evaluate the equivariance properties and the effect of nonorthogonal mixing, we mixed the observations at each repetition using (i) identity mixing: , (ii) orthogonal mixing: and , where and are random orthogonal matrices uniformly sampled at each repetition with respect to the Haar measure and (iii) normal mixing, where and were filled at each repetition with random elements from .
To evaluate the effect of the mixing matrix and the limiting distributions, we present, in Figure 3, the transformed MD values for 3VJADE, VJADE, 22TJADE and TJADE and the corresponding theoretical limit values for the cases available. The sample sizes , , are omitted from Figure 3, since the sample size is large enough for the convergence of both the vectorial and the tensorial methods.
Figure 3 clearly shows that for the vectorial methods, we have affine invariance and thus all of the three curves are identical. For the tensorial methods, we have only orthogonal invariance and hereby the curve corresponding to the normal mixing differs from the other two. Nevertheless, the benefit of applying the tensorial methods over the vectorized methods is impressive. Furthermore, Figure 3 also illustrates that all the methods converge to the theoretic values and that the TJADE versions have the same limiting values as TJADE.
We next, under the same setting as above, present the results for seven additional procedures that violate some of the required assumptions: VFOBI, TFOBI, 1VJADE, 2VJADE, 11TJADE, 12TJADE and 21TJADE. Note that TFOBI is the initial step in the TJADE procedure and hereby the comparison between TJADE and TFOBI illustrates the added benefit of the additional rotation after the TFOBIsolution. Likewise, the same holds between VFOBI and VJADE. The resulting mean values of the transformed MD indices are presented in Figure 4, where methods that have almost identical performance are presented using the same colors.
In Figure 4, TFOBI and VFOBI diverge at an exponential rate. Furthermore, the TJADE procedures that have either or less than the number of distinct kurtosis values, are not converging to anything reasonable, even at sample sizes greater than . However, in this simulation setting, the VJADE procedure seems to allow slight deviations from the required assumptions. It seems that 2VJADE converges to an asymptotic value that is at least close to that of VJADE, see Section 6 for further discussion.
To summarize this part of the simulation study: TJADE works as expected, when the theoretical conditions are met, and the convergence to the asymptotic value is relatively fast. Furthermore, even though TJADE is not affine invariant, its performance is better under all mixing scenarios, when compared to the affine invariant vectorial counterparts.
5.2 Finitesample efficiency, setting 2
In our second simulation, we illustrate unmixing under a tensorial setting. We consider simulated collections of i.i.d. tensors of size , . Let denote the th 3mode face of . The components of every are then simulated independently from the following distributions,
where the different distributions are denoted as in Section 5.1.
All of the mean kurtoses over the different tensor faces are nonzero and none of the theoretical mean kurtoses are the same for the 1mode faces. Moreover, two of them are the same for the 2mode faces and three of them are the same for the 3mode faces. Hereby, the preferable TJADE here is TJADE, . The vectorized versions of the observations contain several normal components and thus the assumptions for the vectorial methods are not satisfied here.
The simulation was performed using 11 different sample sizes, , and the simulation was repeated 2000 times for each sample size. We considered the same three mixing scenarios as in Section 5.1 and generated the mixing matrices in the same way, with the distinction that here we have three mixing matrices instead of two. We performed the unmixing using VFOBI, TFOBI, 1VJADE, 2VJADE, VJADE, TJADE and 11 different versions of TJADE. The resulting mean values of the transformed MD index, , where and , are presented in Figure 5. The orthogonal mixing is omitted from Figure 5, since the tensorial methods are orthogonally invariant and the vectorial methods are affine invariant. The performance curves under the orthogonal mixing would be identical to those under the identity mixing, similarly as in Section 5.1.
The TJADE performs as expected for the values of that satisfy the assumptions and the convergence towards the theoretical limit is considerably fast. The vectorial methods fail completely in this example. Interestingly, TJADE has relatively nice performance when the elementwise deviation between and (1,2,3) is not too large. See Section 6 for further discussion.
5.3 Timing comparison
The results from Sections 5.1 and 5.2 illustrate that the performances of TJADE and suitably chosen versions of TJADE are very similar. Next, we quantify the significant improvement in computational speed that TJADE provides when compared to TJADE. In the timing comparison, we consider a simulated collection of i.i.d. matrices of size , , such that components of every are simulated independently from the following distributions,
where denotes the squared distribution with degress of freedom, and the width of the matrix is the varying parameter in this simulation setting. We used parameter values and the sample size . We considered the same procedures as in Section 5.1: VFOBI, TFOBI, 1VJADE, 2VJADE, 3VJADE, VJADE, 11TJADE, 12TJADE, 21TJADE, 22TJADE, TJADE and recorded the mean running times over a total of 5 iterations. The time it took R to vectorize the tensors was also considered as a part of the vectorized procedures. However, the time the vectorizing took, was negligible. We used two alternative stopping criteria for the methods that involve joint diagonalization, that is, for all the methods except for TFOBI and VFOBI. A single iteration was stopped, if either the converge tolerance of the Jacobi rotation based joint diagonalization was less than or if the required tolerance was not satisfied after 100 iterations.
The average running times in minutes as a function of the dimension are presented in Figure 6 and methods that have almost identical computing times, are presented using the same colors. Figure 6 clearly illustrates the superior computation speed of TJADE, when compared to either TJADE or any of the vectorized counterparts. The timing comparison was conducted on Ubuntu 16.04.4 LTS with Intel^{®} Xeon^{®} CPU E31230 v5 with 3.40GHz and 64GB.
5.4 Video example
We applied TJADE to the WaterSurface surveillance video (Li et al., 2004) that is viewable at http://pages.cs.wisc.edu/~jiaxu/projects/gosus/supplement/ and available to download as an .Rdata file at https://wis.kuleuven.be/stat/robust/Programs/DO/dovideodatardata. The video has already been used as an example for blind source separation (Virta and Nordhausen, 2017a, b). Each frame of the video is of size with the height , width and a threevariate color channel (RGB). The total video consists of such frames, making our data a sample of size of random third order tensors in . The data constituting a single continuous surveillance video, the observations are naturally not independent and the assumptions of tensorial independent component analysis are not fully satisfied. However, ICA is known to be robust against deviations from the independence assumption and applying it to sets of dependent data with success is common practice. We thus expect TJADE to successfully extract components of interest from our current data.
The video shows a beach scene with little to no motion until frame 480, when a man enters the scene from the left, staying in the picture for the remainder of the video. Figure 7
shows frames 50 and 550 of the video, illustrating moments before and after the man enters the scene. Our objective with the surveillance video is to find lowdimensional components that allow us to pinpoint the most obvious change point in the video, namely, the man’s entrance. As change points are most likely captured in the data as outliers of some form, it is natural to seek the component of interest among those with high kurtosis values.
We proceed as follows: we run TJADE on the data with different choices of the tuning parameters, find the component with the highest absolute kurtosis, and plot its time course to visually assess whether it captures the change point. The component found with TJADE is shown in Figure 8, where means that we do not unmix the supposedly uninformative color dimension at all. The time series is instantly seen to capture the desired time point as the spike coincides with the first frames the man spends in the scene. The running time of the method was 39 minutes on a Windows server with Intel^{®} Xeon^{®} CPU R5 2440 with 2.40GHz and 64GB. Applying TJADE gave almost identical results with the increased running time of one hour and 54 minutes. However, the original TJADE proved to be very slow. The running time of the algorithm was over five days. Concluding, the example shows that TJADE can be used to reliably extract information from data where TJADE can not be applied due to its extremely high computational cost. In several real world applications, e.g. in crime scene investigation, waiting for days is not an option.
6 Discussion
We proposed a spedup version of the most efficient currently studied method of tensorial independent component analysis, TJADE. Under easily interpretable additional assumptions, the extension, TJADE, is asymptotically equally efficient to TJADE, while simultaneously exhibiting significantly lower computational cost. A large part of this efficiency is preserved also for samples of finite size.
An interesting future research question is to derive the theoretic behavior of TJADE when Assumption 2 is violated. Based on our simulations, even when the value of is chosen to be too small or too large, TJADE and JADE can still work. This can be seen as a safety net for the users of TJADE. The simulations suggest that the performance deteriorates the further down one goes from the optimal .
Acknowledgements
Niko Lietzén gratefully acknowledges financial support from the Emil Aaltonen Foundation (grant 170156). The work of Klaus Nordhausen was supported by the CRoNoS COST Action IC1408. All authors acknowledge the computational resources provided by the Aalto ScienceIT project.
Appendix A Appendix
We denote the sequence of i.i.d. observations as , such that e.g. denotes the left sample covariance matrix estimated from the sample .
Before the proofs of the main results, we establish three auxiliary lemmas.
Lemma 1.
The minimization problem is equivalent to the maximization problem , where ,
Proof of Lemma 1.
Note that for all ,
Thus, is equivalent to
∎
Lemma 2.
For all and any orthogonal matrix , the sample and the population cumulant matrices satisfy,
(8)  
where denotes the sample .
Proof of Lemma 2.
Both results follow by the same arguments and we prove only the former. Since is a scalar and , the lefthand side of Eq. (8) can be written as
(9) 
where the first term can be written as
(10) 
where the scaled expected value is the first term in the definition of the matrix and has the representation
(11) 
Plugging Eq. (11) into Eq. (10) we obtain
(12) 
Furthermore, plugging Eq. (12) into Eq. (9) gives us,
(13) 
Elementwise examination reveals that,
This concludes the proof. ∎
Lemma 3.
Let be a sequence of estimators (indexed by ) such that , where denotes convergence in distribution, is some matrixvalued distribution and
where and the values , , are distinct and in a strictly decreasing order (). Let be the sequence of eigenvector matrices, where the columns are the eigenvectors of the matrices . Partition into blocks as
in a similar way to . Then
Proof of Lemma 3.
The sequence of eigenvectors satisfies the eigenequation
(14) 
where
is a sequence of diagonal matrices containing the estimated eigenvalues. Then,
(15) 
where the second equality holds since as a result of the compactness of the space of orthogonal matrices and since by Prohorov’s theorem.