Modern data collection capability has led to massive quantity of time series. High dimensional time series observed in tensor form are becoming more and more commonly seen in various fields such as economics, finance, engineering, environmental sciences, medical research and others. For example, Figure 1 shows the monthly import-export volume time series of four categories of products (Chemical, Food, Machinery and Electronic, and Footwear and Headwear) among six countries (US, Canada, Mexico, Germany, UK and France) from January 2001 to December 2016. At each time point, the observations can be arranged into a three-dimensional tensor, with the diagonal elements for each product category unavailable. This is part of a larger data set with 15 product categories and 22 countries which we will study in detail in Section 7.1
. Univariate time series deals with one item in the tensor (e.g. Food export series of US to Canada). Panel time series analysis focuses on the co-movement of one row (fiber) in the tensor (e.g. Food export of US to all other countries). Vector time series analysis also focuses on the co-movement of one fiber in the tensor (e.g. Export of US to Canada in all product categories).Wang et al. (2019); Chen and Chen (2019) and Chen et al. (2019) studied matrix time series. Their analysis deals with a matrix slice of the tensor (e.g. the import-export activities between all the countries in one product category). In this paper we develop a factor model for the analysis of the entire tensor time series simultaneously.
The import-export network belongs to the general class of dynamic transport (traffic) network. The focus of such a network is the volume of traffic on the links between the nodes on the network. The availability of complex and diverse network data, recorded over periods of time and in very large scales, brings new opportunities with challenges (Aggarwal and Subbian, 2014). For example, weekly activities in different forms (e.g. text messages, email, phone conversations, and personal interactions) and on different topics (politics, food, travel, photo, emotions, etc) among friends on a social network form a transport network similar to the import-export network, but as a four-dimensional tensor time series. The number of passengers flying between a group of cities with a group of airlines in different classes (economy or business) on different days of the week can be represented as a five-dimensional tensor time series. In Section 7.2 we will present a second example on taxi traffic patterns in New York city. With the city being divided into 69 zones, we study the volume of passenger pickups and drop-offs by taxis among the zones, at different hours during the day as a daily time series of a tensor.
Note that most developed statistical inference methods in network analysis are often confined to static network data such as social network (Snijders, 2006; Hanneke et al., 2010; Kolaczyk and Csárdi, 2014; Ji and Jin, 2016; Zhao et al., 2012). Of course most networks are dynamic in nature. One important challenge is to develop stochastic models/processes that capture the dynamic dependence and dynamic changes of a network.
Besides dynamic traffic networks, tensor time series are also observed in many other applications. For example, in economics, many economic indicators such as GDP, unemployment rate and inflation index are reported quarterly by many countries, forming a matrix-valued time series. Functional MRI produces a sequence of 3-dimensional brain images (forming 3-dimensional tensors) that changes with different stimulants. Temperature and salinity levels observed at a regular grid of locations and a set of different depth in the ocean form 3-dimensional tensors and are observed over time.
Such tensor systems are often very large. Thirty economic indicators from 30 countries yield total 900 individual time series. Import-export volume of 15 product categories among 20 countries makes up almost 6,000 individual time series. FMRI images often consist of hundreds of thousands of voxels observed over time.
The aim of this paper is to develop a factor model to systematically study the dynamics of tensor systems by jointly modeling the entire tensor simultaneously, while preserving the tensor structure and the time series structure. This is different from the more conventional time series analysis which deals with scalar or vector observations (Box and Jenkins, 1976; Brockwell and Davis, 1991; Shumway and Stoffer, 2002; Tsay, 2005; Tong, 1990; Fan and Yao, 2003; Härdle et al., 1997; Tsay and Chen, 2018) and multivariate time series analysis (Hannan, 1970; Lütkepohl, 1993), panel time series analysis (Baltagi, 2005; Hsiao, 2003; Geweke, 1977; Sargent and Sims, 1977) and spatial-temporal modelling (Bennett, 1979; Cressie, 1993; Stein, 1999; Stroud et al., 2001; Woolrich et al., 2004; Handcock and Wallis, 1994; Mardia et al., 1998; Wikle and Cressie, 1999; Wikle et al., 1998; Irwin et al., 2000).
We mainly focus on the cases when the tensor dimension is large. When dealing with many time series simultaneously, dimension reduction is one of main approaches to extract common information from the data without being overwhelmed by the idiosyncratic variations. One of the most powerful tools for dimension reduction in time series analysis is the dynamic factor model in which ’common’ information is summarized into a small number of factors and the co-movement of the time series is assumed to be driven by these factors and their inherited dynamic structures (Forni et al., 2000; Stock and Watson, 2012; Bai and Ng, 2008; Chamberlain, 1983; Peña and Box, 1987; Pan and Yao, 2008). We will follow this approach in our development.
The tensor factor model in this paper is similar to the matrix factor model studied in Wang et al. (2019). Specifically, we use a Tucker decomposition type of formation to relate the high-dimensional tensor observations to a low-dimensional latent tensor factor that is assumed to vary over time. Two estimation approaches, named TIPUP and TOPUP, are studied. Asymptotic properties of the estimators are investigated, which provides a comparison between the two estimation methods. The estimation procedure used in Wang et al. (2019) in the matrix setting is essentially the TOPUP procedure. We show that the convergence rate they obtained for the TOPUP can be improved. On the other hand, the TIPUP has a faster rate than the TOPUP, under a mildly more restrictive condition on the level of signal cancellation. The developed theoretical properties also cover the cases where the dimensions of the tensor factor increase with the dimension of the observed tensor time series.
The paper is organized as follows. Section 2 contains some preliminary information on the approach of factor models that we will adopt and the basic notations of tensor analysis. Section 3 introduces a general framework of factor models for large tensor time series, which is assumed to be the sum of a signal part and a noise part. The signal part has a multi-linear factor form, consisting of a low-dimensional tensor that varies over time, and a set of fixed loading matrices in a Tucker decomposition form. Section 4 discusses two general estimation procedures. Their theoretical properties are shown in Section 5. In section 6 we present some simulation studies to demonstrate the performance of the estimation procedures. Two applications are presented in Section 7 to illustrate the model and its interpretations.
2 Preliminary: dynamic factor models and foundation of tensor
In this section we briefly review the linear factor model approach to panel time series data and tensor data analysis. Both serve as a foundation of our approach to tensor time series.
Let be a set of panel time series. Dynamic factor model assumes
where is a set of unobserved latent factor time series with dimension ; The row vector , treated as unknown and deterministic, is called factor loading of the -th series. The collection of all is called the loading matrix . The idiosyncratic noise is assumed to be uncorrelated with the factors in all leads and lags. Both and are unobserved hence some further model assumptions are needed. Two different types of model assumptions are adopted in the literature. One type of models assumes that a common factor must have impact on ‘most’ (defined asymptotically) of the time series, but allows the idiosyncratic noise to have weak cross-correlations and weak autocorrelations (Geweke, 1977; Sargent and Sims, 1977; Forni et al., 2000; Stock and Watson, 2012; Bai and Ng, 2008; Stock and Watson, 2006; Bai and Ng, 2002; Hallin and Liška, 2007; Chamberlain, 1983; Chamberlain and Rothschild, 1983). Under such sets of assumptions, principle component analysis (PCA) of the sample covariance matrix is typically used to estimate the space spanned by the columns of the loading matrix, with various extensions. Another type of models assumes that the factors accommodate all dynamics, making the idiosyncratic noise ‘white’ with no autocorrelation but allowing substantial contemporary cross-correlation among the error process (Peña and Box, 1987; Pan and Yao, 2008; Lam et al., 2011; Lam and Yao, 2012; Chang et al., 2018). The estimation of the loading space is done by an eigen analysis based on the non-zero lag autocovariance matrices. In this paper we adopt the second approach in our model development.
The key feature of the factor model is that all co-movements of the data are driven by the common factor and the factor loading provides a link between the underlying factors and the -th series . This approach has three major benefits: (i) It achieves great reduction in model complexity (i.e. the number of parameters) as the autocovariance matrices are now determined by the loading matrix and the much smaller autocovariance matrix of the factor process ; (ii) The hidden dynamics (the co-movements) become transparent, leading to clearer and more insightful understanding. This is especially important when the co-movement of the time series is complex and difficult to discover without proper modeling of the full panel; (iii) The estimated factors can be used as input and instrumental variables in models in downstream data analyses, providing summarized and parsimonious information of the whole series.
In the following we briefly review tensor data analysis without involving time series (equivalently at a fixed time point), mainly for the purpose of fixing the notation in our later discussion. For more detailed information, see Kolda and Bader (2009).
A tensor is a multidimensional array, a generalization of a matrix. The order of a tensor is the number of dimensions, also known as the number of modes. Fibers of a tensor are the higher order analogue of matrix rows and columns, which can be obtained by fixing all but one of the modes. For example, a matrix is a tensor of order , and a matrix column is a mode-1 fiber and a matrix row is a mode-2 fiber.
Consider an order- tensor . Following Kolda and Bader (2009), the -mode product of with a matrix is an order- tensor of size and will be denoted by . Elementwise, . Similarly, the -mode product of an order-K tensor with a vector is an order- tensor of size and denoted by . Elementwise, . Let and . The mode-k unfolding matrix is a matrix by assembling all mode-k fibers as columns of the matrix. One may also stack a tensor into a vector. Specifically, is a vector in formed by stacking mode- fibers of in the order of modes .
are two major extensions of the matrix singular value decomposition (SVD) to tensors of higher order. Recall that the SVD of a matrixof rank has two equivalent forms: , which decomposes a matrix into a sum of rank-one matrices, and , where and are orthonormal matrices of size and spanning the column and row spaces of respectively, and is an diagonal matrix with positive singular values on its diagonal. In parallel, CP decomposes an order- tensor into a sum of rank one tensors, , where “” represents the tensor product. The vectors , are not necessarily orthogonal to each other, which differs from the matrix SVD. The Tucker decomposition boils down to orthonormal matrices containing basis vectors spanning -mode fibers of the tensor, a potentially much smaller ‘core’ tensor and the relationship
Note that the core tensor is similar to the in the middle of matrix SVD but now it is not necessarily diagonal.
3 A Tensor Factor Model
In tensor times series, the observed tensors would depend on and be denoted by as a series of order- tensors. By absorbing time, we may stack into an order- tensor , with time as the -th mode, referred to as the time-mode. We assume the following decomposition
where is the dynamic signal component and
is a white noise part. In the second expression (3), and are the corresponding signal and noise components of , respectively. We assume that the noise are uncorrelated (white) across time, following Lam and Yao (2012).
In this model, all dynamics are contained in the signal component . We assume that is in a lower-dimensional space and has certain multilinear decomposition. We further assume that any component in this multilinear decomposition that involves the time-mode is random and dynamic, and will be called a factor component (depending on its order, it will be called a scalar factor , a vector factor , a matrix factor , or a tensor factor ), which when concatenated along the time-mode forms a higher order object, such as . Any components of other than are assumed to be deterministic and will be called the loading components.
Although it is tempting to directly model with standard tensor decomposition approaches to find its lower dimensional structure, the dynamics and dependency in the time direction (auto-dependency) are important and should be treated differently. Traditional tensor decomposition using tensor SVD/PCA on ignores the special role of the time-mode and the covariance structure in the time direction, and treats the signal as deterministic (Richard and Montanari, 2014; Anandkumar et al., 2014; Hopkins et al., 2015; Sun et al., 2016). Such a direct approach often leads to inferior inference results as our preliminary results have demonstrated (Wang et al., 2019). In our approach, the component in the time direction is considered as latent and random. As a result, our model assumptions and interpretations, and their corresponding estimation procedures and theoretical properties are significantly different.
In the following we propose a specific model for tensor time series, based on a decomposition similar to Tucker decomposition. Specifically, we assume that
where is itself a tensor times series of dimension with small and are loading matrices. We assume without loss of generality in the sequel that is of rank .
Model (4) resembles a Tucker-type decomposition similar to (2) where the core tensor is the factor term and the loading matrices are constant matrices, whose column spaces are identifiable. The core tensor is usually much smaller than in dimension. It drives all the comovements of individual time series in . For matrix time series, model (4) becomes . The matrix version of Model (4) was considered in Wang et al. (2019), which also provided several model interpretations. Most of their interpretations can be extended to the tensor factor model. In this paper we consider more general model settings and more powerful estimation procedures.
As is a linear mapping from to . It can be written as a matrix acting on vectors as in
where is the Kronecker product, , , and is the tensor stacking operator as described in Section 2. While is often used to denote the Kronecker product, we shall avoid this usage as is preserved to denote the tensor product in this paper. For example, in the case of with observation , is a tensor of order four, not a matrix of dimension , as we would need to consider the model-2 unfolding of as a matrix. The Kronecker expression exhibits the same form as in the factor model for panel time series except that the loading matrix of size in the vector factor model is assumed to have a Kronecker product structure of matrices of much smaller sizes (). Hence the tensor factor model reduces the number of parameters in the loading matrices from in the stacked vector version to , a very significant dimension reduction. The dimension reduction comes from the assumption imposed on the loading matrices.
It would be tempting to assume the orthonormality of as in SVD and Tucker decomposition. However, in the high-dimensional setting, this may not be compatible in general with the assumption that is a “regular” factor series with unit order of magnitude, which we may also want to impose; The magnitude of would have to be absorbed into either or in model (4). In addition, the orthonormality of would be incompatible with the expression of the strength of the factor in terms of the norm of as in the literature (Bai and Ng, 2008; Lam and Yao, 2012; Wang et al., 2019). Thus, we shall consider general to preserve flexibility.
Let be the SVD of . The tensor time series in (4) can be then written as . In the special case where the dimensional series can be viewed as a properly normalized factor for some signal strength parameter , we may absorb , and into and write (4) as
Remark: In our theoretical development, we do not impose any specific structure for the dynamics of the relatively low-dimensional factor process , except conditions on the spectrum norm and singular values of certain matrices in the unfolding of the average of the cross-product . As , these conditions on would hold when the condition numbers of are bounded, e.g. model (5), and parallel conditions on the spectrum norm and singular values in the unfolding of hold through the consistency of the averages. For fixed , such consistency for the low-dimensional has been extensively studied in the literature with many options such as various mixing conditions.
Remark: The above tensor factor model does not assume any structure on the noises except that the noise process is white. The estimation procedures we use do not require any additional structure. But in many cases it benefits to allow specific structures for the contemporary cross-correlation of the elements of . For example, one may assume where all elements in are i.i.d. . Hence each of the can be viewed as the common covariance matrix of mode- fiber in the tensor . More efficient estimators may be constructed to utilize such a structure but is out of the scope of this paper.
4 Estimation procedures
. On the other hand, despite such inherent difficulties, many heuristic techniques are widely used and often enjoy great successes in practice.Richard and Montanari (2014) and Hopkins et al. (2015), among others, have considered a rank-one spiked tensor model as a vehicle to investigate the requirement of signal-to-noise ratio for consistent estimation under different constraints of computational resources, where for some deterministic unit vectors and all entries of are iid standard normal. As shown by Richard and Montanari (2014), in the symmetric case where , can be estimated consistently by the MLE when . Similar to the case of spiked PCA (Koltchinskii et al., 2011; Negahban and Wainwright, 2011), it can be shown that the rate achieved by the MLE is minimax among all estimators when is treated as deterministic. However, at the same time it is also unsatisfactory as the MLE of is NP hard to compute even in this simplest rank one case. Additional discussion of this and some other key differences between matrix and tensor estimations can be found in recent studies of related tensor completion problems (Barak and Moitra, 2016; Yuan and Zhang, 2016, 2017; Xia and Yuan, 2017; Zhang et al., 2019).
A commonly used heuristic to overcome this computational difficulty is tensor unfolding. In the following we proposed two estimation methods that are based on a marriage of tensor unfolding and the use of lagged cross-product, the tensor version of the autocovariance. This is due to the dynamic and random nature of the latent factor process, and the whiteness assumption on the error process.
As in all factor models, due to ambiguity, we will only estimate the linear spaces spanned by the loading matrices with an orthonormal representation of the loading spaces, or equivalently, only estimate the orthogonal projection matrix to such spaces.
The lagged cross-product operator, which we denote by , can be viewed as the -tensor
. We consider two estimation methods based on the sample version of ,
The orthogonal projection to the column space of is
It is the -th principle space of the tensor time series in (4). As for all ,
with the notation and for . Once consistent estimates are obtained for , the estimation of other aspects of can be carried out based on the low-rank projection of (6),
as if the low-rank tensor time series is observed. For the estimation of , we propose two methods, and both methods can be written in terms of the mode- matrix unfolding of as follows.
(i) TOPUP method: We define a order-5 tensor as
where is the tensor product and is the index for the 5-th mode. Let with . As is a matrix, is of dimension , so that is a matrix. Let
where PLSVD stands for the orthogonal projection to the span of the first left singular vectors of a matrix. We estimate the projection by with a proper . When is given,
This is a product of two matrices, with on the left.
We note that the left singular vectors of
are the same as the eigenvectors in the PCA of thenonnegative-definite matrix
which can be viewed as the sample version of
It follows from (10) that has a sandwich formula with on the left and on the right.
As is assumed to be of rank , its column space is identical to that of in (10) or that of in (15) as long as they are also of rank . Thus is identifiable from the population version of TOPUP. However, further identification of the lagged cross-product operator by the TOPUP would involve parameters specific to the TOPUP approach. For example, if we write where is orthonormal, the TOPUP estimator (9) is designed to estimate as the left singular matrix of . Even then, the singular vector is identifiable only up to the sign through the projections and provided a sufficiently large gap between the -th, the -th and the -th singular values of the matrix relative to the TOPUP estimation error.
For the ease of discussion, we consider for example the case of and with stationary factor where is a 3-way tensor, and its lag- () autocovariance is a 6-way tensor with dimensions and elements . For the estimation of the column space of the loading matrix , we write
The TOPUP can be then described in terms of the PCA as follows. Replacing with its sample version
and through eigenvalue decomposition, we can estimate the top-eigenvectors of , which form a representative of the estimated space spanned by . Representative sets of eigenvectors of and can be obtained similarly. This procedure uses the outer-product of all (time shifted) mode-1 fibers of the observed tensor . Then, after taking the squares, it sums over the other modes. By considering positive lags , we explicitly utilize the assumption that the noise process is white, hence avoiding having to deal with the contemporary covariance structure of , as it disappears in for all . We also note that while the PCA of in (14) is equivalent to the SVD in (9) for the estimation of , it can be computationally more efficient to perform the SVD directly in many cases.
We call this TOPUP (Time series Outer-Product Unfolding Procedure) as the tensor product in the matrix unfolding in (8) is a direct extension of the vector outer product, which is actually used in the equivalent formulation in (16). This reduces to the algorithm in Wang et al. (2019) for matrix time series.
(ii) TIPUP method: The TIPUP (Time series Inner-Product Unfolding Procedure) can be simply described as the replacement of the tensor product in (8) with the inner product:
which is treated as a matrix of dimension . The estimator is then defined as
Again TIPUP is expected to yield consistent estimates of in (7) as
where is the -tensor with elements at and , and is the inner product summing over indices other than .
We use the superscript to indicate the TIPUP counterpart of TOPUP quantities, e.g.
is the sample version of
We note that by (19) is again sandwiched between and . For and ,
As in the case of the TOPUP, for the estimation of the auto-covariance operator beyond , the TIPUP would only identify parameters specific to the approach. For example, the TIPUP estimator (18) aims to estimate with being the left singular matrix of , e.g. the eigen-matrix of in (23). This is evidently different from the projection to the rank eigen-space of in (15), in view of (16) and (23).
Remark: The differences between the TOPUP and TIPUP are two folds. First, the TOPUP for estimating the column space of uses the auto-cross-covariance between all the mode-1 fibers in and all the mode-1 fibers in , with all possible combinations of , while the TIPUP only uses the auto-cross-covariance between the mode-1 fibers in and their corresponding mode-1 fibers in , with all combinations of only. Hence the TIPUP uses less cross-covariance terms in the estimation. Second, the TOPUP ’squares’ every auto-cross-covariance matrices first (i.e. in (16)) before the summation, while the TIPUP does the summation of first, before taking the square as in (23). Because the TOPUP takes the squares first, every term in the summation of the middle part of (16) is semi-positive definite. Hence if the sum of a subset of them is full rank, then the middle part is full rank and the column space of and will be the same. On the other hand, the TIPUP takes the summation first, hence runs into the possibility that some of the auto-covariance matrices cancel out each other, making the sum not full rank. However, the summation first approach averages out more noises in the sample version while the TOPUP accumulates more noises by taking the squares first. The TOPUP also has more terms – although it amplifies the signal, it amplifies the noise as well. The detailed asymptotic convergence rates of both methods represented in Section 5 reflect the differences. In Section 6 we show a case in which some of the auto-covariance matrices cancel each other. We note that complete cancellation does not occur often and can often be avoided by using a larger in estimation, though partial cancellation can still have impact on the performance of TIPUP in finite samples.
Remark: TOPUP and TIPUP: One can construct iterative procedures based on the TOPUP and TIPUP respectively. Note that if a version of and are given with , can be estimated via the TOPUP or TIPUP using . Intuitively the performance improves since is of much lower dimension than as and . With the results of the TOPUP and TIPUP as the starting points, one can alternate the estimation of given other estimated loading matrices until convergence. They have similar flavor as tensor power methods. Numerical experiments show that the iterative procedures do indeed outperform the simple implementation of the TOPUP and TIPUP. However, their asymptotic properties require more detailed analysis and are out of the scope of this paper. The benefit of such iteration has been shown in tensor completion (Xia and Yuan, 2017) among others.
5 Theoretical Results
Here we present some results of the theoretical properties of the proposed estimation methods. Recall that the loading matrix is not orthonormal in general, and our aim is to estimate the projection in (7) to the column space of . We shall consider the theoretical properties of the estimators under the following two conditions:
Condition A: are independent Gaussian tensors conditionally on the entire process of . In addition, we assume that for some constant , we have
where is the conditional expectation given and .
Condition A, which holds with equality when has iid entries, allows the entries of to have a range of dependency structures and different covariance structures for different . Under Condition A, we develop a general theory to describe the ability of the TOPUP and TIPUP estimators to average out the noise . It guarantees consistency and provides convergence rates in the estimation of the principle space of the signal , or equivalently the projection , under proper conditions on the magnitude and certain singular value of the lagged cross-product of , allowing the ranks to grow as well as in a sequence of experiments with .
We then apply our general theory in two specific scenarios. The first scenario, also the simpler, is described in the following condition on the factor series .
Condition B: The process is weak stationary,
with fixed and fixed expectation for the lagged cross-products
, such that
converges to in probability.
In the second scenario, described in Conditions C-1 and C-2 below, conditions on the signal process are expressed in terms of certain factor strength or related quantities. For the vector factor model (1), Lam et al. (2011) showed that the convergence rate of the corresponding TOPUP estimator is , when singular and . Here singular denotes (any and all) positive singular values of , and is often referred to as the strength of the factors (Bai and Ng, 2002; Doz et al., 2011; Lam et al., 2011). It reflects the signal to noise ratio in the factor model. When , singular hence the information contained in the signal increases linearly with the dimension . In this case the factors are often said to be ’strong’ and the convergence rate is . When