 # Autoregressive Models for Matrix-Valued Time Series

In finance, economics and many other fields, observations in a matrix form are often generated over time. For example, a set of key economic indicators are regularly reported in different countries every quarter. The observations at each quarter neatly form a matrix and are observed over many consecutive quarters. Dynamic transport networks with observations generated on the edges can be formed as a matrix observed over time. Although it is natural to turn the matrix observations into a long vector, and then use the standard vector time series models for analysis, it is often the case that the columns and rows of the matrix represent different types of structures that are closely interplayed. In this paper we follow the autoregressive structure for modeling time series and propose a novel matrix autoregressive model in a bilinear form that maintains and utilizes the matrix structure to achieve a greater dimensional reduction, as well as more interpretable results. Probabilistic properties of the models are investigated. Estimation procedures with their theoretical properties are presented and demonstrated with simulated and real examples.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multivariate time series is a classical area in time series analysis, and has been extensively studied in the literature (see Hannan, 1970; Lütkepohl, 2005; Tsay, 2014; Tiao and Box, 1981, for an overview). Recently there has been an emerging interest in modeling high dimensional time series. Roughly speaking these works fall into two major categories: (i) vector autoregressive modeling with regularization (Davis et al., 2012; Basu and Michailidis, 2015; Guo et al., 2015; Han et al., 2015, 2016; Nicholson et al., 2015; Song and Bickel, 2011; Kock and Callot, 2015; Negahban and Wainwright, 2011; Nardi and Rinaldo, 2011, among others), and (ii) statistical or dynamic factor models (Bai and Ng, 2002; Forni et al., 2005; Lam et al., 2011; Lam and Yao, 2012; Fan et al., 2013; Wang et al., 2019, among others). In most of these studies, the multiple observations at each time point are treated as a vector.

Although it has been conventional to treat multiple observations as a vector, often the inter-relationship among the time series provides certain structure. For example, Hallin and Liška (2011) studied subpanel structures in multivariate time series, and Tsai and Tsay (2010) considered group constraints among the time series. When the time series are collected under the intersections of two classifications, they naturally form matrices. Figure 1: Time series of four economic indicators from five countries.

In Figure 1, we plot four economic indicators from five countries, resulting in a matrix observed at each time point. In this example, the rows and columns correspond to different classifications (economic indicators and countries). Univariate time series analysis would deal with individual time series separately (e.g. US interest rate, or UK GDP). Panel time series analysis deals with one row at a time (e.g. interest rates of the five countries), or one column at a time (e.g. all economic indicators of US). Obviously every time series is related to all other time series in the matrix and we wish to model them jointly. It is reasonable to assume that the same economic indicator from different countries (the rows) form a strong relationship, and at the same time economic indicators from the same country (the columns) also naturally move together closely. Hence there is a strong structure in the relationship among the time series. If the matrices are concatenated into vectors, the underlying structure is lost with significant impacts on the model complexity and interpretations.

There has been a recent interest in modeling time series with matrix-valued observations. For example, Wang et al. (2019) considered a factor model for matrix time series. In this paper we propose to model the matrix-valued time series under the autoregressive framework with a bilinear form. Specifically, in this model, the conditional mean of the matrix observation at time is obtained by multiplying the previous observed matrix at time from both left and right by two autoregressive coefficient matrices. It can be extended to involve the previous observed matrices to form an order autoregressive model. If it involves previously observed matrices, we call it the matrix autoregressive model of order , with the acronym MAR(). Compared with the traditional vector autoregressive models, our approach has two advantages: (i) it keeps the original matrix structure, and its two coefficient matrices have corresponding interpretations; and (ii) it reduces the number of parameters in the model significantly.

Similar bilinear models have been used in regression settings. For instance, Wang et al. (2018) considered the regression with matrix-valued covariates for low-dimensional data. Zhao and Leng (2014) studied the bilinear regression with sparse coefficient vectors under high-dimensional setting. Zhou et al. (2013) and Raskutti et al. (2015)

mainly addressed the multi-linear regression with general tensor covariates.

Since the error term in the matrix AR model is also a matrix, its (internal) covariances form a 4-dimensional tensor. Here we also consider to exploit the matrix structure to reduce the dimensionality of this covariance tensor, by separating the row-wise and column-wise dependencies of the error matrix.

In this paper we investigate some probabilistic properties of the proposed model. Several estimators for MAR(1) model are developed, with different computing algorithms, under different assumptions on the error covariance structure. Their asymptotic properties are investigated. We also compare the efficiencies of the estimators. In addition, the finite sample performances of the estimators are demonstrated through simulation studies. The matrix time series of four economic indicators from five countries, shown in Figure 1, is analyzed in detail.

The rest of the paper is organized as follows. We introduce the autoregressive model for matrix-valued time series in Section 2, along with some of its probabilistic properties. The estimation procedures are presented in Section 3. Statistical inferences and the asymptotic properties of the estimators will be considered in Section 4. Numerical studies are carried out in Section 5. Section 6 contains a short summary. All the proofs are collected in Section 7.

## 2 Autoregressive Model for Matrix-Valued Time Series

Consider a time series of length , in which at each time , a matrix is observed. Here we use in boldface to emphasize the fact that it is a matrix. Let be the vectorization of a matrix by stacking its columns. The traditional vector autoregressive model (VAR) of order 1 is directly applicable for . That is,

 vec(Xt)=Φvec(Xt−1)+vec(Et). (1)

It is immediately seen that the roles of rows and columns are mixed in the VAR model in (1). Using the example shown in Figure 1, the VAR model in (1) fails to recognize the strong connections within the columns (same country) and within the rows (same indicator). The (large) coefficient matrix does not have any assumed structure; and the model does not fully utilize the matrix structure, or any prior knowledge of the potential relationship among the time series. The coefficient matrix is also very difficult to interpret.

To overcome the drawback of the direct VAR modeling that requires vectorization, and to take advantage of the original matrix structure, we propose the matrix autoregressive model (of order 1), denoted by MAR(1), in the form

 Xt=AXt−1B′+Et, (2)

where and are and autoregressive coefficient matrices, and is a

matrix white noise. Clearly the model can be extended to an order

model in the form

 Xt=A1Xt−1B′1+⋯+ApXt−pB′p+Et.

We will defer the interpretations of and in (2) to Section 2.1.

Kronecker products can be used to obtain convenient representations of many linear matrix equations that will arise in the subsequent discussions. In fact, using the Kronecker product, the MAR(1) model in (2) can be represented in the form of a vector autoregressive model

 vec(Xt)=(B⊗A)vec(Xt−1)+vec(Et), (3)

where denotes the matrix Kronecker product. A thorough discussion of the Kronecker product and its relationship with linear matrix equations can be found in Chapter 4 of Horn and Johnson (1994). In Proposition 6 of Section 7, we collect some basic properties of the Kronecker product. The representation (3) means that the MAR(1) model can be viewed as a special case of the classical VAR(1) model in (1), with its autoregressive coefficient matrix given by a Kronecker product. On the other hand, comparing (1) and (3), we see that the MAR(1) requires coefficients as the entries of and , while a general VAR(1) needs coefficients for . Of course the latter can be much larger when both and are large.

There is an obvious identifiability issue with the MAR(1) model in (2), regarding the two coefficient matrices and . The model remains unchanged if the two matrices and are divided and multiplied by the same nonzero constant respectively. To avoid ambiguity, we make the convention that is normalized so its Frobenius norm is one. On the other hand, the uniqueness holds for the Kronecker product .

The error matrix sequence is assumed to be a matrix white noise, i.e. there is no correlation between and as long as . But is still allowed to have concurrent correlations among its own entries. As a matrix, its covariances form a 4-dimensional tensor, which is difficult to express. In the following we will discuss it in the form of , a matrix. In its simplest form, we may assume the entries of are independent so that is a diagonal matrix, and in a general form, we may allow its entries to have arbitrary correlations. We also consider a structured covariance matrix

 Cov(vec(Et))=Σc⊗Σr, (4)

where and are and symmetric positive definite matrices. Under normality, this is equivalent to assuming , where all the entries of

are independent, and following the standard Normal distribution. Therefore,

corresponds to row-wise covariances and introduces column-wise covariances.

Remarks:   There are many possible extensions of the model. For example, the model can be extended to have multiple lag-one autoregressive terms. That is

This is still an order-1 autoregressive model, but with more parallel terms. In the stacked vector form, it corresponds to

Such a structure provides more flexibility to capture the different interactions among rows and columns of the matrix time series, though it becomes more challenging, since there is obviously a more severe identifiability issue.

In this paper we focus on MAR(1) model in all our discussions. Extensions will be investigated elsewhere.

### 2.1 Model interpretations

The MAR(1) model is not a straightforward model. A thorough discussion of the interpretations of its coefficient matrices is needed. Here we offer interpretations from three different angles.

First, in model (2), the left matrix reflects row-wise interactions, and the right matrix introduces column-wise dependence, and therefore the conditional mean in (2) combines the row-wise and column-wise interactions. It is easier to see how the coefficient matrices and reflects the row and column structures by looking at a few special cases.

To isolate the effect from the bilinear form in (2), let us assume . Then the model reduces to

 Xt=Xt−1B′+Et.

Consider the example shown in Figure 1, using columns for countries and rows for economic indicators. The conditional expectation of the first column of is given by

 USAUSADEUCAN ⎛⎜ ⎜ ⎜ ⎜⎝IntGDPProdCPI⎞⎟ ⎟ ⎟ ⎟⎠t=b11⎛⎜ ⎜ ⎜ ⎜⎝IntGDPProdCPI⎞⎟ ⎟ ⎟ ⎟⎠t−1+b12⎛⎜ ⎜ ⎜ ⎜⎝IntGDPProdCPI⎞⎟ ⎟ ⎟ ⎟⎠t−1+⋯+b1n⎛⎜ ⎜ ⎜ ⎜⎝IntGDPProdCPI⎞⎟ ⎟ ⎟ ⎟⎠t−1,

which means that at time , the conditional expectation of an economic indicator of one country is a linear combinations of the same indicator from all countries at , and this linear combination is the same for different indicators. Therefore, this model (and ) captures the column-wise interactions, i.e. interactions among the countries. However, the interactions are refrained within each indicator. There are no interactions among the indicators.

On the other hand, if we let in model (2), then a similar interpretation can be obtained, where the matrix reflects the row-wise interactions, i.e. interactions among the economic indicators within each country. There are no interactions among the countries.

Second, we can interpret the model from a row-wise and column-wise VAR model point of view. For example, if , then the model becomes

 Xt=AXt−1+Et.

In this case, each column of follows

 Xt,⋅j=AXt−1,⋅j+Et,⋅j, \ \ j=1,…n.

That is, each column of follows the same VAR(1) model of dimension . Specifically, for the first two columns in the example of Figure 1, the models are

 ~{}~{}~{}~{}USAUSAUSA ~{}~{}~{}~{}DEUDEUDEU ⎛⎜ ⎜ ⎜ ⎜⎝IntGDPProdCPI⎞⎟ ⎟ ⎟ ⎟⎠t=A⎛⎜ ⎜ ⎜ ⎜⎝IntGDPProdCPI⎞⎟ ⎟ ⎟ ⎟⎠t−1+⎛⎜ ⎜ ⎜ ⎜⎝e\_Inte\_GDPe\_Prode\_CPI⎞⎟ ⎟ ⎟ ⎟⎠t and ⎛⎜ ⎜ ⎜ ⎜⎝IntGDPProdCPI⎞⎟ ⎟ ⎟ ⎟⎠t=A⎛⎜ ⎜ ⎜ ⎜⎝IntGDPProdCPI⎞⎟ ⎟ ⎟ ⎟⎠t−1+⎛⎜ ⎜ ⎜ ⎜⎝e\_Inte\_GDPe\_Prode\_CPI⎞⎟ ⎟ ⎟ ⎟⎠t

In other words, for each country, its economic indicators follow a VAR(1) model (of its own past) of dimension ; and different countries would follow the same VAR(1) model.

If , then

 Xt=Xt−1B′+Et.

In this case, each row of (same indicator from different countries) would follow a VAR(1) model. And the coefficient matrices corresponding to different rows would be the same.

Obviously both models and are too restrictive. It is difficult to reason that Germany’s economic indicators follow the same model as the US’s, and there is no interaction between Germany and US. There are two possible ways to add flexibility. One can assume an additive interaction structure to make the model as

 Xt=AXt−1+Xt−1B′+Et,

which is essentially a special case of the multi-term model in (5) with and and . Or we can assume a one-term multiplicative interaction structure, which leads to MAR(1). Of course, one can also use

 Xt=A1Xt−1+Xt−1B′1+A2Xt−1B′2+Et,

similar to the model with main effects plus two-way interactions. In this paper we choose to work on MAR(1) in (2).

The third way to interpret MAR(1) is through a defined hierarchical structure. Let . It would be the prediction of (or the conditional mean) if . Since each column of is based on the linear combination of all columns of with no row (indicator) interaction, we can view each entry in as the globally adjusted indicator. For example, is a linear combination of the GDPs of all countries at time . Next, we consider . This would be the prediction of if the model is . It replaces each entry (indicator) in by its corresponding globally adjusted indicator in . Each entry in is a linear combination of the adjusted indicators from the same country. For example, is a linear combination of , etc. It can be viewed as a second adjustment by other indicators (within the same country). Putting everything together, we have

 Xt=Zt−1+Et=AYt−1+Et=AXt−1B′+Et.

Note that, if follows the MAR(1) in (2), each entry in follows

 xt,ij=m∑k1=1n∑k2=1aik1xt−1,k1k2bjk2+et,ij.

Hence is controlled only by -th row of and -th column of . In the example of Figure 1, the -th row of can be viewed as the coefficient corresponding to -th indicator and the -th column of as the coefficient corresponding to the -th country. Their values can be interpreted, as we will demonstrate in the real example in Section 4.

The error covariance structure in (4), which consists of all pairwise covariances , has a similar interpretation. For example, if , then , which implies that the matrix captures the concurrent dependence of the columns of shocks in . Note that each row of in this case is . Hence for all rows , and therefore captures the covariance among the (column) elements in each row. In parallel, captures the concurrent dependence among the rows of shocks in .

### 2.2 Probabilistic properties of MAR(1)

For any square matrix , we use

to denote its spectral radius, which is defined as the maximum modulus of the (complex) eigenvalues of

. Since the MAR(1) model can be represented in the form (3), we see that (2) admits a causal and stationary solution if the spectral radius of , which is the product of the spectral radii of and , is strictly less than 1. Hence we have

###### Proposition 1.

If , then the MAR(1) model (2) is stationary and causal.

The detailed proof of the proposition will be given in the appendix.

We remark that the property of being “stationary and causal” is referred to as “stable” in Lütkepohl (2005). Here we follow the terminology and definitions used in Brockwell and Davis (1991).

When the condition of Proposition 1 is fulfilled, the MAR(1) model in (2) has the following causal representation after vectorization:

 vec(Xt)=∞∑k=0(Bk⊗Ak)vec(Et−k). (6)

It follows that the autocovariance matrices of (2) is given by

 Γk:=Cov(vec(Xt),vec(Xt−k))=∞∑l=0(Bk+l⊗Ak+l)Σ(Bl⊗Al)′,k∈Z, (7)

where is the covariance matrix of . The condition guarantees that the infinite matrix series is absolutely summable.

We discuss the impulse response function of the MAR(1) model, following the definition given in Section 2.10 of Tsay (2014) regarding the vector AR models. Let be the -th column of , and be the -th entry of . Following Tsay (2014)

, the effect of a unit standard deviation change (which is

) in on the future is given by .

This type of impulse response function exhibits a special structure if has the Kronecker product form (4). Let and be the -th columns of and , respectively. Then the effect of a unit standard deviation change in on the future is given by .

Consider that case that a one-standard deviation shock occurs at series. Let , the -th element of and , the -th element of . Then the impulse response function of -th series at lag is

 fi,j(k)=fri(k)fcj(k)=(AkΣr,⋅1)[i]⋅(BkΣc,⋅1)[j].

We further let , and . (We use the boldfaced here to emphesize it is a vector of functions.) We can view as the column response function and as the row response function. Using the example shown in Figure 1, if there is a unit standard deviation shock in the interest rate of US (location in the matrix), then the lag effect on the four economic indicators for -th country is

 [f1,j(k),f2,j(k),f3,j(k),f4,j(k)]′=[fr1(k),fr2(k),fr3(k),fr4(k)]′fcj(k)=fcj(k)⋅fr(k),

which has the following interpretations. The effect of the shock on four economic indicators in the same -th country, a 4-dimensional vector , is proportional to the vector , for all countries . Hence the five (4-dimensional economic indicator) vectors corresponding to the five countries are parellel to each other and only differ by the multiplier . This form of impulse response function implies that the economies of different countries have a co-movement as responses to a shock, but the impacts on different countries are of different scales. Similarly, we have

 [fi,1(k),…,fi,5(k)]=fri(k)⋅[fc1(k),…,fc5(k)]=fri(k)⋅[fc(k)]′,1≤i≤4.

That is, the effect on five countries regarding each economic indicator, which is a 5-dimensional row vector, is proportional to , and the four vectors corresponding to four indicators only differ by lengths .

In general, for a shock occurs at location , the lag- row and column response functions are and , respectively. In fact, the response function in matrix form in this case is given by the rank-one matrix:

 \bf IRi,j(k)=fr,i(k)[fc,j(k)]′.

## 3 Estimation

### 3.1 Projection method

To estimate the coefficient matrices and , our first approach is to view the MAR(1) model in (2) as the structured VAR(1) model in (3). We first obtain the maximum likelihood estimate or the least square estimate of in (1) without the structure constraint, then we find the estimators by projecting onto the space of Kronecker products under the Frobenius norm:

 (^A1,^B1)=argminA,B∥^Φ−B⊗A∥2F. (8)

This minimization problem is called the nearest Kronecker product (NKP) problem in matrix computation (Van Loan and Pitsianis, 1993; Van Loan, 2000)

. It turns out that an explicit solution exists, which can be obtained through a singular value decomposition (SVD) of a rearranged version of

.

We note that the set of all entries in is exactly the same as the set of all entries in . The two matrices have the same set of elements, and only differ by the placement of the elements in the matrices. Define a re-arrangement operator such that

 G(B⊗A)=vec(A)vec(B)′.

It is easy to see that the operator is a linear operator such that . We also note that the Frobenius norm of a matrix only depends on the elements in the matrix, but not the arrangement, hence . Then we have

 minA,B∥^Φ−B⊗A∥2F = minA,B∥G(^Φ)−G(B⊗A)∥2F = minA,B∥G(^Φ)−vec(A)vec(B)′∥2F = minA,B∥~Φ−vec(A)vec(B)′∥2F,

where is the re-arranged . It follows that the solution of (8) can be obtained through

 vec(^A)vec(^B)′=d1u1v′1,

where is the largest singular value of , and and are the corresponding first left and right singular vectors, respectively. By converting the vectors into matrices, we obtain corresponding estimators of and , denoted by and , with the normalization that . We call them projection estimators, and will use the acronym PROJ for later references.

We illustrate the re-arrangement operation with a special case of . We first rearrange the entries of the Kronecker product :

 ⎡⎢ ⎢ ⎢ ⎢⎣b11a11b11a12b12a11b12a12b11a21b11a22b12a21b12a22b21a11b21a12b22a11b22a12b21a21b21a22b22a21b22a22⎤⎥ ⎥ ⎥ ⎥⎦⟶⎡⎢ ⎢ ⎢⎣b11a11b21a11b12a11b22a11b11a21b21a21b12a21b22a21b11a12b21a12b12a12b22a12b11a22b21a22b12a22b22a22⎤⎥ ⎥ ⎥⎦.

We then rearrange the entries of in exactly the same way:

 ^Φ=⎡⎢ ⎢ ⎢ ⎢⎣ϕ11ϕ12ϕ13ϕ14ϕ21ϕ22ϕ23ϕ24ϕ31ϕ32ϕ33ϕ34ϕ41ϕ42ϕ43ϕ44⎤⎥ ⎥ ⎥ ⎥⎦⟶⎡⎢ ⎢ ⎢ ⎢⎣ϕ11ϕ31ϕ13ϕ33ϕ21ϕ41ϕ23ϕ43ϕ12ϕ32ϕ14ϕ34ϕ22ϕ42ϕ24ϕ44⎤⎥ ⎥ ⎥ ⎥⎦=:~Φ.

By abuse of notation, we omit the hat on each individual . Now it is clear that the NKP problem (8) is equivalent to .

Note that this procedure requires the estimation of the coefficient matrix first. This task is often formidable and inaccurate with moderately large and and a finite sample size. Hence the resulting projection estimator may not be accurate. However, it can serve as the initial value for a more elaborate iterative procedure.

### 3.2 Iterated least squares

If we assume the entries of

are i.i.d. normal with mean zero and a constant variance, the maximum likelihood estimator, denoted by

and , is the solution of the least squares problem

 minA,B∑t∥Xt−AXt−1B′∥2F. (9)

We refer to this estimator as LSE for the rest of this paper. If the error covariance matrix is arbitrary, the LSE is still an intuitive and reasonable estimator. To see the connection between the two estimators PROJ and LSE, we define

 Y =[vec(X2),vec(X3),…,vec(XT)], X =[vec(X1),vec(X2),…,vec(XT−1)].

The minimization problem (9) can be rewritten as

 minA,B∥Y−(B⊗A)X∥2F. (10)

Comparing (10) and (8), we see the problem (10) can be viewed as an inverse NKP problem. Unfortunately it does not have an explicit SVD solution (Van Loan, 2000).

There is another way to understand the minimization problem (9). Define

 Y′ =[X′2,X′3,…,X′T], (11) X′A =[X′1A′,X′2A′,…,X′T−1A′].

With these notations, the least squares problem (9) is equivalent to

 minA{minB∥Y−XAB∥2F}. (12)

In other words, we aim to find the optimal , so that the projection of the columns of on the column space of is maximized.

Taking partial derivatives of (9) with respect to the entries of and respectively, we obtain

 ∑tAXt−1B′BX′t−1−∑tXtBX′t−1 =0 (13) ∑tBX′t−1A′AXt−1−∑tX′tAXt−1 =0.

The function is guaranteed to have at least one global minimum, so solutions to (13) are guaranteed to exist. On the other hand, if and solve the equations in (13), so are and , where is any nonzero constant. We should regard them as the same solution because they yield the same matrix product . Equivalently, we say that and are the same solution of (13), if

. With this convention, we argue that with probability one, the global minimum of (

9) is unique. For this purpose, we need the following condition.

(Condition: R:)  The innovations are independent and identically distributed, and absolutely continuous with respect to Lebesque measure.

If Condition (R) is fulfilled, it holds that with probability one, the solutions of (13) have full ranks, and they have no zero entries. Let us restrict our discussion to this event of probability one. Without loss of generality, we fix the first entry of at . Let us use to denote the set of entries of and :

 Z:={aij,bkl:1≤i,j≤m,(i,j)≠(1,1),1≤k,l≤n}.

The matrix equations in (13) involves individual equations, and each equation takes the form , where is a multivariate polynomial in the polynomial ring over the complex field . The collection of all solutions of (13) is thus an affine variety in the space . By computing a Groebner basis for the ideal generated by the polynomials in (13), we see that is a finite set (see for example, Theorem 6, page 251 of Cox et al., 2015). Equivalently, the equations in (13) have a finite number of solutions.

Now consider the uniqueness of the global minimum. Assume and are two nonzero matrices such that for any constant . Define and as in (11) with and respectively. We further define . The projection of on the column space of is given by , and its Frobenius norm

 trace{Y′X(c)[X(c)′X(c)]−1X(c)′Y} (14)

is a rational function of with random coefficients determined by . With probability one, this rational function takes distinct values at different local minima. Combining this fact and the preceding argument, we see that with probability one, the least squares problem (9) has an unique global minimum, and finitely many local minima.

To solve (9), we iteratively update the two matrices and , by updating one of them in the least squares (9) while holding the other one fixed, starting with some initial and . By (13) the iteration of updating given is

and similarly by (13) the iteration of updating given is

 A←(∑tXtBX′t−1)(∑tXt−1B′BX′t−1)−1.

We denote these estimators by and , with the name least squares estimators, and the acronym LSE.

The iterative least squares may converge to a local minimum. In practice, we suggest to use the PROJ estimators and as the starting values of the iterations. On the other hand, by permuting the entries of the corresponding matrices (Van Loan, 2000), the problem (9

) can be rewritten as a problem of best rank-one approximation under a linear transform, which in turn can be viewed as a generalized SVD problem. The variable projection methods discussed in

Golub and Pereyra (1973) and Kaufman (1975) may also be applicable here.

### 3.3 MLE under a structured covariance tensor

When the covariance matrix of the error matrix assumes the structure in (4), it can be utilized to improve the efficiency of the estimators. The log likelihood under normality can be written as

 −m(T−1)log|Σc|−n(T−1)log|Σr|−∑ttr(Σ−1r(Xt−AXt−1B′)Σ−1c(Xt−AXt−1B′)′).

Four matrix parameters are involved in the log likelihood function. The gradient condition at the MLE is given by

 A∑tXt−1B′Σ−1cBX′t−1−∑tXtΣ−1cBX′t−1 =0 B∑tX′t−1A′Σ−1rAXt−1−∑tX′tΣ−1rAXt−1 =0 m(T−1)Σc−∑t(Xt−AXt−1B′)′Σ−1r(Xt−AXt−1B′) =0 n(T−1)Σr−∑t(Xt−AXt−1B′)Σ−1c(Xt−AXt−1B′)′ =0.

To find the MLE, we iteratively update one, while keeping the other three fixed. These iterations are given by

 A ←(∑tXtΣ−1cBX′t−1)(∑tXt−1B′Σ−1cBX′t−1)−1 B ←(∑tX′tΣ−1rAXt−1)(∑tX′t−1A′Σ−1rAXt−1)−1 Σc ←∑tR′tΣ−1rRtm(T−1), where Rt=Xt−AXt−1B′ Σr ←∑tRtΣ−1cR′tn(T−1), where Rt=Xt−AXt−1B′

The MLE for and under the covariance structure (4) will be denoted by and , with an acronym MLEs, where the “s” stands for the special structure (4) of the error covariance matrix. Due to the unidentifiability of the pairs of and , to make sure the numerical computation is stable, after looping through and in each iteration, we renormalize so that both and .

## 4 Asymptotics and Efficiency

Due to the identifiability issue regarding and , we make the convention that , and the three estimators are rescaled so that . Since the Kronecker product is unique, we also state the asymptotic distributions of the estimated Kronecker product , in addition to that of and .

We first present the central limit theorem for the projection estimators

and . Following standard theory of multivariate ARMA models (Hannan, 1970; Dunsmuir and Hannan, 1976), the conditions of Theorem 2 guarantees that converges to a multivariate normal distribution:

 √Tvec(^Φ−B⊗A)⇒N(0,Γ−10⊗Σ),

where is the covariance matrix of , and is given in (7). Let be the rearranged version of , and be the asymptotic covariance matrix of . The matrix is obtained by rearranging the entries of , and can be expressed using permutation matrices and Kronecker products, but we omit the explicit formula here.

###### Theorem 2.

Set , , and

 V0 :=(∥β∥−1[β′1⊗(I−α1α′1)]I⊗α1), V1 :=(β1β′1)⊗I+I⊗(α1α′1)−(β1β′1)⊗(α1α′1).

Assume that

are iid with mean zero and finite second moments. Also assume the stationarity condition

, and , and are nonsingular. It holds that

 √T(vec(^A1−A)vec(^B1−B))⇒N(0,V0Ξ1V′0),

and

 √T[vec(^B1)⊗vec(^A1)−vec(B)⊗vec(A)]⇒N(0,V1Ξ1V′1).

The proof of the theorem is in Section 7.

Now we consider the least squares estimators . Let , , and be a vector in . Note that should not be confused with defined in Theorem 2, which is related to the vectorization of . Recall that is the covariance matrix of . We have the following result for and .

###### Theorem 3.

Define , and . Let