# Sliced Average Variance Estimation for Multivariate Time Series

Supervised dimension reduction for time series is challenging as there may be temporal dependence between the response y and the predictors x. Recently a time series version of sliced inverse regression, TSIR, was suggested, which applies approximate joint diagonalization of several supervised lagged covariance matrices to consider the temporal nature of the data. In this paper we develop this concept further and propose a time series version of sliced average variance estimation, TSAVE. As both TSIR and TSAVE have their own advantages and disadvantages, we consider furthermore a hybrid version of TSIR and TSAVE. Based on examples and simulations we demonstrate and evaluate the differences between the three methods and show also that they are superior to apply their iid counterparts to when also using lagged values of the explaining variables as predictors.

## Authors

• 4 publications
• 2 publications
• 19 publications
• 7 publications
04/20/2021

### Fusing Sufficient Dimension Reduction with Neural Networks

We consider the regression problem where the dependence of the response ...
02/26/2021

### Ensemble Conditional Variance Estimator for Sufficient Dimension Reduction

Ensemble Conditional Variance Estimation (ECVE) is a novel sufficient di...
12/16/2020

### Series Saliency: Temporal Interpretation for Multivariate Time Series Forecasting

Time series forecasting is an important yet challenging task. Though dee...
04/15/2019

### Estimation of Cross-Sectional Dependence in Large Panels

Accurate estimation for extent of crosssectional dependence in large pan...
11/26/2018

### Wavelet-based and Fourier-based multivariate Whittle estimation: multiwave

Multivariate time series with long-dependence are observed in many appli...
05/09/2018

### Semiparametric multivariate D-vine time series model

This paper proposes a novel semiparametric multivariate D-vine time seri...
06/14/2013

### Sparse Auto-Regressive: Robust Estimation of AR Parameters

In this paper I present a new approach for regression of time series usi...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Linear supervised dimension reduction has a long tradition for independent and identically distributed (iid) observations with a rich literature reviewed for example in [1]

. The idea is to find all linear combinations of a predictor vector

which are needed to model a response even when the true functional relationship between the response and explaining variables is not known. In multivariate time series context with temporal dependence, the goal is similarly to model a response time series value at as a function of the previous history of a stationary multivariate predictor time series . Popular time series dimension reduction methods such as those based on (dynamic) factor models, reviewed for example in [2], are not supervised, and supervised dimension reduction methods are still rare in the literature. Both Xia et al. [3] and Becker and Fried [4] propose the use of standard supervised iid dimension reduction methods simply by explaining a response series value with a vector of lagged predictor time series values in . This may then increase the dimension of the problem dramatically and at the same time reduces the sample size. Barbarino and Bura [5, 6] combine ideas from factor models and standard iid supervised dimension reduction methods.

Recently, Matilainen et al. [7] proposed a procedure that finds most relevant linear combinations of the predictor series with their most relevant lags when modelling the response series. The approach, called TSIR, is an extension of the sliced inverse regression (SIR), introduced by Li [8], and is based on the approximate joint diagonalization of the covariance matrices of conditional expected values , which naturally consider the temporal nature of the data. Considering for example for a -variate explaining time series , lags means to jointly approximately diagonalize matrices while in approaches like [3] and [4] two matrices need to be simultaneously diagonalized.

In sliced average variance estimation (SAVE; Cook and Weisberg [9, 10]) for iid observations, one considers the variation of conditional covariance rather than the variation of the conditional expectation to detect better the cases of nonlinear dependence. In this paper we suggest the similar use of , , in a time series context. This is a time series extension of SAVE, and is called here TSAVE. As TSIR and TSAVE have their own specific drawbacks, a hybrid of TSIR and TSAVE, denoted as TSSH, is also introduced. It can be seen as a weighted combination of TSIR and TSAVE generalizing the hybrid in Zhu et al. [11] to the time series context.

One aim here is also to see whether in a time series context the number of slices and the weight coefficient of the hybrid have similar preferred values as their iid counterparts. This was also not investigated in [7] for TSIR and the number of slices was just assumed to be the same as in the iid case. We will also investigate if these tuning parameters depend much on the underlying stochastic processes.

The structure of the paper is as follows. We first recall SIR and SAVE for iid data. Then we move to the time series context, where first TSIR is reviewed and then in Section 3.3 TSAVE and in Section 3.4 also the hybrid of TSIR and TSAVE are introduced. Section 4 then includes examples and simulation studies. In Section 4.3 we conduct a simulation study to find some guidelines to how many slices we need in practice to estimate the matrices that we approximately jointly diagonalize. Then in Section 4.4 we have another simulation study to find the appropriate weights of TSIR and TSAVE parts for method TSSH and show the hybrid can sometimes be more efficient than these methods separately. Finally in 4.5 we show that also TSAVE is often better than TSIR and that that both methods beat their iid counterparts applied to time series, such as the method in [4].

## 2 Supervised dimension reduction for iid data

In this section we review iid supervised dimension reduction methods SIR, SAVE and their hybrid version. We formulate the supervised dimension reduction problem as an estimation problem in a blind source separation (BSS) model for the joint distribution of the response variable

and the -variate vector of observable explaining variables . The BSS model then assumes that

 x=Ωz+μ, (1)

where the full rank matrix is called the mixing matrix and is the location -vector. The latent -vector can be partitioned as with the respective dimensions and , and

1. and and

2. .

Hence in this model carries all the information needed to model the response and can be considered as the noise part. Note that assumption (I2) made in this paper for both SIR and SAVE is slightly stronger than those made in the original papers, i.e.

 z(2)⊥⊥y|z(1), E(z(2)t|z(1)t)=0 (a.s.) (for SIR) and COV(z(2)t|z(1)t)=Ip−k (a.s.) (for SAVE).

The assumption (I2) implies all these three assumptions and is needed in [12] to build asymptotic and bootstrap tests for the true subspace dimension in the SIR methodology.

As there may be several partitions of fulfilling (I1) and (I2), we choose the one with the smallest . The aim is to find a unmixing matrix such that up to a pre-multiplication by a orthogonal matrix. Note that the latent as stated in our model has the same indeterminacy.

A direct consequence of assuming this model is the following.

###### Result 2.1.

Let denote the response and have the properties as stated in (I1) and (I2). Then

 COV[E(z|y)]=(COV[E(z(1)|y)]000)

and

As in [13], implies that there exists an orthogonal matrix such that

 (2)

Then it is shown in [7] that, based upon Result 2.1 (the first equation) and (2), one can define an unmixing matrix in the sliced inverse regression (SIR) [8] using the following steps.

###### Definition 2.2.

The SIR functional is defined as follows.

1. Consider the standardized variable .

2. Find the matrix with orthonormal rows which maximizes

 k∑i=1[w⊤iCOV[E(xst|y)]wi]2.
3. .

Assume now that

is the eigenvector-eigenvalue decomposition of

. Let be the matrix consisting of the first columns of . Based upon the second equation of Result 2.1 and (2)

 E[(Ip−COV(xst|y))2] = E[(Ip−UCOV(z|y)U⊤)2] = UE[(Ip−COV(z|y))2]U⊤ = U1E[(Ik−COV(z(1)|y))2]U⊤1 = U1V1Λ1V⊤1U⊤1.

Write now . Then is diagonal. The sliced average variance estimation (SAVE) [9, 10] then has the following steps.

###### Definition 2.3.

The SAVE functional is defined as follows.

1. Consider the standardized variable .

2. Find the matrix with orthonormal rows that maximizes

 k∑i=1[w⊤iE[(Ip−COV(xst|y))2]wi]2.
3. .

For a random sample , the population values of the two functionals can be consistently estimated if is discrete with a finite number of values. In practice, the continuous is then often replaced by its discretized version utilizing disjoint slices , . One can for example define a discrete variable by the condition , . Note that the condition (I2) and the model still holds true for , but the dimension and the functionals may change.

###### Remark 2.4.

Both SIR and SAVE can be seen as an approach which jointly diagonalizes two matrices [14], the regular covariance matrix and the supervised matrices or , respectively. Hence, both methods can be solved using a generalized eigenvector-eigenvalue decomposition, where under model (1) there are non-zero eigenvalues, and the functional is given by the corresponding generalized eigenvectors. Hence the functionals are unique if all non-zero eigenvalues are distinct.

###### Remark 2.5.

The estimation of the dimension of the subspace has some issues. For example the number of slices can change the estimated value for the subspace, see e.g. [15]. However, the block diagonal structures in Result 2.1 still exist for different values of , but the block sizes may be different. Also the method used may not find the whole subspace. In case of SIR for example when is a quadratic function of a component, SIR fails to capture it (see e.g. [9]). In general SAVE is able to capture a larger portion of the subspace than SIR, as Cook and Critchley have shown in [16]. Due to these issues, it is possible that .

In the practical data analysis with unknown , the estimated eigenvalues and the variation of the eigenvectors have been used to estimate , see for example especially in the context of SIR, see [17, 18, 12] and the references therein. The magnitude of the eigenvalues indicates the relevance of the corresponding source to model the response.

As [16] show, SAVE is in general considered more comprehensive when estimating the subspace of interest and SIR can be seen in certain situations as a special case. This increased flexibility of SAVE is however considered costly and it is usually said that SAVE needs more data than SIR [16]. This is also reflected when considering the numbers of slices used.

For SIR the slices are often chosen so that , , with . In simulations in [8] it was shown that SIR is not very sensitive to the choice of . The rank of and the maximum number of non-zero eigenvalues is , which however gives the restriction .

SAVE, unlike SIR, is more sensitive to the choice of

as it uses higher moments and therefore needs more observations per slice than SIR, see for example

[10, 16, 19]. Zhu et al. have conducted some simulations for SAVE [11]. With a data length of , SAVE with still produces proper results in all of their settings, but with not anymore. However, we should note that these results are based only on some specific simulation settings.

Asymptotic properties of SAVE estimator have been investigated e.g. in [20] in order to find a way an estimate of the subspace dimension . Li and Zhu [19] have examined the consistency of the SAVE estimator. SAVE can achieve consistency and in the case where the response is discrete and takes finite values, SAVE can also achieve consistency. However, generally SAVE cannot achieve this, unlike SIR. For asymptotic properties of the SIR estimator, including consistency and asymptotic normality of the estimator, see [21].

In [16] it is argued that SAVE with sufficient data is superior to SIR but in practice it would be better to use both and complement them to uncover the structures of interest. A hybrid method based on SIR and SAVE, using a convex combination , with , has been proposed. It is discussed first briefly in [22] and then more closely with the discussion of the choice of the coefficient in [11]. In Section 3.4 we introduce a time series version of this hybrid method.

Also [23] have combined the strengths of SIR and SAVE by suggesting the method. As SIR is efficient in finding the linear relationships, it can be used to find a partial dimension reduction subspace and SAVE is then used to find the rest.

Note also that SIR, mentioned already in the rejoinder of Li’s SIR paper [8] and developed further in [24] and [25], is the first kind of hybrid method of the first and second moments in supervised dimension reduction. However, this is not a combination of SIR and SAVE.

As recently [7] extended SIR to the time series framework it is therefore natural also to extend SAVE, which will be done in the following sections.

## 3 Linear supervised dimension reduction for time series

### 3.1 The blind source separation model for linear supervised dimension reduction for time series

Assume that and are (weakly and jointly) stationary univariate and -variate time series, respectively. In this paper the term time series is used for both the observed realizations and the stochastic process producing them.

In the time series prediction problem, it is usually assumed that the response series at time is an unspecified function of and where is an unspecified stationary noise process independent from , i.e.,

 yt=f(xt,xt−1,…;ϵt,ϵt−1,…).

As in the iid case we assume the blind source separation (BSS) model which states that only linear combinations of are needed in the prediction model. In the following, if is a matrix and a -vector, is a -variate time series with the value at . In the time series BSS model, we assume that

 x=Ωz+μ,

where is a full rank mixing matrix and is a location vector. Furthermore, just like in the iid case, the stationary -variate source time series can be partitioned as with the dimensions and of the subseries, respectively. Dimension is the smallest one to fulfil the conditions

1. and and

2. .

As in Section 2, from (T2) it follows therefore that

 z(2)⊥⊥y|z(1), E(z(2)t+s|z(1)t)=0 (a.s.) for all s∈Z and

All the information needed to model is therefore contained in the process and one can write

 yt=f(xt,xt−1,…;ϵt,ϵt−1,…)=f0(z(1)t,z(1)t−1,…;ϵt,ϵt−1,…) (3)

with another unspecified function , possibly depending on and .

Also in this time series case the model is ill-defined in the sense that both and can be multiplied by respective orthogonal matrices and they still fulfil (T1) and (T2). The goal is therefore the estimation of the unmixing matrix such that up to orthogonal transformations. The function should then be parametrized to allow all linear combinations of the elements of , for example. One should also identify which lagged values contribute in the model.

In this section, the TSIR method from [7] is first reviewed and then the methods TSAVE and TSSH, a combination of TSIR and TSAVE are introduced. Finally we recall the choosing of the number of important sources and lags mentioned in [7].

### 3.2 SIR for time series

In [7] the sliced inverse regression for time series uses the matrices

 G0,j(z,y)=COV(E(zt|yt+j)),  j∈Z+

with the following important property.

###### Result 3.1.

Under (T1) and (T2)

 COV[E(zt|yt+j)]=(COV[E(z(1)t|yt+j)]000),

for all lags .

Based on Result 3.1, [7] then finds the time series sliced inverse regression (TSIR) estimate of the unmixing matrix with the following three steps.

###### Definition 3.2.

The TSIR functional is defined as follows.

1. Find .

2. Find the matrix with orthonormal rows which maximizes

 ∑j∈Sk∑i=1[w⊤iG0,j(xst,y)wi]2, (4)

for a chosen set of lags with .

3. .

The matrix is a matrix. If the approach for time series used in [3] and [4] were applied here, this matrix would be of size . This would make the method less stable when the number of time series and the number of lags used increases.

### 3.3 SAVE for time series

To make a time series version of SAVE a natural extension for the matrix of interest is

 G1,j(z,y)=E((Ip−COV(zt|yt+j))2),  j∈Z+

that depends a joint distribution of and . We then have the following.

###### Result 3.3.

Under (T1) and (T2)

 E((Ip−COV(zt|yt+j))2)=(E((Ip−COV(z(1)t|yt+j))2)000),

for all lags .

The following unmixing matrix estimate is then called the time series sliced average variance estimator (TSAVE) functional:

###### Definition 3.4.

The TSAVE functional is defined as follows.

1. Find .

2. Find the matrix with orthonormal rows that maximizes

 ∑j∈Sk∑i=1[w⊤iG1,j(xst,y)wi]2, (5)

for a chosen set of lags with .

3. .

TSIR and TSAVE can be seen as procedures which jointly diagonalize matrices (in terms of the Frobenius norm), that is, the covariance matrix of and matrices depending on the joint distributions of and with lags in . Under the blind source separation model assumed in this paper all the matrices of interest can be jointly diagonalized. However, for finite data this can be done only approximately anymore and it has to be solved using algorithms using some objective criterion as for example stated in the algorithmic outline above. While many other criteria are possible and many algorithms exists in the literature, for practical purposes we will use the approach based on Jacobi rotations [26] in search for the matrices and that maximize (4) and (5), respectively. This algorithm was recommended in [27] and for more details about joint diagonalization in BSS see for example [28, 29, 30, 13, 31, 32, 33] and the references therein. In the following we will however not distinguish between joint diagonalization and joint approximate diagonalization.

Given a solution , the maximum value of the criterion function is , where

 λij=(w⊤iE[(Ip−COV(xstt|yt+j))2]wi)2,

in a sense that it measures the contribution of the th lag of the th linear combination to this maximum value, ; . is unique if , are distinct. The components are standardized and can be naturally ordered so that . The large value indicates a strong dependence between the time series and . The higher the value of , the stronger is the dependence between and . Identifying the relevant sources and lags is however difficult due to the possible serial correlations in which means that might vanish only slowly to zero with for irrelevant lags.

As for SAVE, the unmixing matrix estimate is obtained for the sliced version , where is a discrete time series such that , . As with SAVE it can be also here concluded that TSAVE will be more sensitive to the number of slices as also TSAVE, just like SAVE, estimates a higher order moment and therefore needs more information (see Section 4.3).

Consider the following important property of TSAVE.

###### Result 3.5.

Let , where is a full rank matrix and a -vector. TSAVE is affine equivariant in the sense that, for all transformed time series , up to the signs and the location of the component series.

Result 3.5 also means that , where is a diagonal matrix with diagonal elements , up to the location. The proof is straightforward and hence is omitted from here.

To derive asymptotic properties of TSAVE the challenge consists of deriving the joint limiting distributions of and for all lags

for which probably stronger assumptions on the process

need to be made. Therefore this is beyond the scope of this paper and we just outline how the asymptotics could be derived given the joint distribution of these matrices and that the signal dimension would be known.

Write , for all . The maximization (5) can also be written as

 ∑j∈Sk∑i=1[w⊤iG∗jwi]2, (6)

where . Denote then and , where , for . Now we can use the Lagrange multiplier technique, which yields and . These equations lead to a fixed-point algorithm (see e.g. [26]) with a step .

Using this we can search for the limiting distributions of and , with known dimension and lags in . As the estimate is also affine equivariant, we can wlog consider here the case, where and .

Let be the length of the time series. We also need to assume here that the joint limiting distribution of and is known. Assuming the distribution is known, the joint limiting distribution of and satisfies the conditions

 √T(^W−W)M′−M√T(^W−W)′=√T(^M−M)W′−W√T(^M−M)′+oP(1)

and

 √T(^W−W)W′=−W√T(^W−W)′+oP(1),

which then can be used in finding the joint limiting distribution of and .

Finally we get . For similar derivations based on the Langrangian multiplier technique, see for example [32].

### 3.4 A hybrid of TSIR and TSAVE

As both SIR and SAVE have their advantages and drawbacks, a hybrid of SIR and SAVE using a convex combination of the two supervised matrices was proposed in [11]. As the time series versions of SIR and SAVE show similar behaviour to the original SIR and TSAVE, respectively, we similarly propose here a convex combination of TSIR and TSAVE methods. We call this time series SIR and SAVE hybrid method TSSH. In TSSH we are searching for a matrix with orthonormal rows that maximizes

 ∑j∈Sk∑i=1(w⊤i((1−a)∗G1,j(xst,y)+a∗G2,j(xst,y))wi)2,

where and a set of chosen lags as before. Then gives TSIR and gives TSAVE.

Note that the results similar to Results 3.1 and 3.3 and the Result 3.5 apply to TSSH as well. Also the search for the number of latent sources and lags goes as for TSIR and TSAVE.

In addition to the issues that TSIR and TSAVE have, a proper value for the coefficient needs to be found. This is discussed in Section 4.4.

Extending the method by [23] for time series is challenging, as we need to search not only for sources but also lags corresponding to each of the sources.

### 3.5 Identification of k and the lags of interest

In practice the number of sources, i.e. the value of , is not known and needs to be estimated as well. Also the important lags regarding the sources need to be found. We can choose these by using the quantities . However, at this stage of the development of TSAVE, formal testing is not possible yet and we suggest to use the same strategies as suggested for TSIR in [7].

For that purpose consider the matrix where

 lij=λij∑ki=1∑sj=1λij,   i=1,…,k; j=1,…,s,

contains the scaled pseudo eigenvalues and the scaling is chosen such that the elements of add up to 1. Note that we have assume here that we use the first lags, as currently we do not have information that would suggest to use some other set of lags. Row and column sums of will be again denoted as , as before, and .

Assume then that is defined such that the latent sources are ordered according to their magnitudes of and . Then [7]

suggested different strategies to find the appropriate amount of sources and the lags corresponding to those sources, by trying to explain similar as in principal component analysis (PCA)

of the dependence between the latent sources and the response series.

The suggested strategies can be summarized as:

ALL LAGS:

keep all lags and find the smallest value such that .

ALL SOURCES:

keep all sources and find the smallest in such way that .

RECTANGLE:

find and with the smallest product in such way that .

BIGGEST VALUES:

find the smallest number of elements of in such way that .

While the first two strategies assume some prior knowledge about the number of lags or sources respectively, the last two methods seem suitable for general use. As in the iid case, the ‘real’ amount of sources may not be found due to the slicing (value of ) and/or the method used, and hence it is possible that .

Natural values for P are then for example or . How the different strategies perform will also be considered in the example and simulation section.

## 4 Examples and simulations

In this section the differences between TSIR and TSAVE are first visualized. Then the simulation settings and the prediction models are presented. In Section 4.3 we search for the best values for the number of slices in TSIR and TSAVE and in Section 4.4 the appropriate values for the coefficient for TSSH. In both cases we aim to give some guidelines how to choose them in practice. Finally we compare TSAVE with TSIR, SIR and SAVE (applied to time series case) in Section 4.5.

Note that TSIR, TSAVE and TSSH, together with the different selection strategies described in the previous section, are implemented in the R package tsBSS [34] and are used together with the R package JADE [33] in this section.

### 4.1 Visualization of the differences of TSIR and TSAVE

It is already well established that the regular SIR works efficiently with linear relationships, but not when the relationship in is specified by a symmetric function [8, 9]. On the other hand, the regular SAVE works with a symmetric function .

To consider the differences between the time series versions of both methods, consider the examples where the response at time be

where and follows an AR(1) model with . To illustrate how and where TSIR and TSAVE work, we plot the values of against the values of , where or for all the models (see Figures 1 and 2).

The larger black dots in the figures denote the sample values of in each slice. Also the variance of these values is added to each figure as text. A non-zero variance value indicates that TSIR is able to find a relationship between and .

The width of the dark gray bars added around the black dots corresponds to the sample value of Var in each slice. When TSAVE cannot find a relationship between and , all these values are so close to zero that the bars are hardly visible. Also the means of these values are added to each figure as text. A non-zero mean indicates that TSAVE is able to find a relationship between and .

From Figure 1 and 2 it can be seen that TSIR cannot find any relationship between and , when or , as did not depend on with those lags. However, the results are different with and . As seen on the left side panels of Figure 1, and have a strong linear relationship. The variance of the slice means is clearly non-zero and indicates that TSIR finds the relationship. Also a non-zero value for the mean of the conditional variances indicates that also TSAVE works with linear relationships.

On the left side panels of Figure 2, and have a strong quadratic relationship. It can easily be seen that TSAVE finds the quadratic relationship. However, as the mean values for each slice are close to zero, TSIR fails to find the quadratic relationship, similar to the regular SIR in iid regression.

Based on the figures and the values in them, it can be concluded that TSAVE finds both the linear and the quadratic relationship between and with lags 1 and 3 in the models M1 and M2.

Next we illustrate how the appropriate lags and the amount of the latent sources are chosen in TSAVE and TSIR using the strategies mentioned in Section 3.5.

Consider a -variate time series , where and are AR(1) processes with , is ARMA(1,1) with and , and is MA(1) model with . As both methods are affine equivariant, we have chosen as the mixing matrix and therefore for all . The time series are standardized and the length of the time series is .

In choosing the number of sources and the lags, we use as the threshold value, and as the number of slices. Tables are constructed as the average values of the elements over 100 repetitions, and we have used lags for both methods. Assume now that the response at a time depends on the predictors as follows.

 yt=z21,t−1+3z2,t−5+ϵt,

where . Table 1 includes the matrices for both TSAVE and TSIR based on this model. It can be seen that TSAVE finds two latent sources and five lags. Also already the values of the elements and are clearly bigger than others and together they already explain more than 80 % of the dependence between the response and the predictors. On the other hand TSIR finds only one source (and five lags), which seems to explain by far the most of the dependence, but fails to find the other source with the quadratic relationship.

### 4.2 Models and prediction

The results presented here are based on the following ARMA models, where the four component series are as follows.

 Components 1 and 2: AR(1) with ϕ=0.2 (or 0.8). Component 3: ARMA(1,1) with ϕ=0.3 and θ=0.4. Component 4: MA(1) with θ=−0.4, respectively.

Note that for the first two components two different values are used to compare how the level of the autocorrelation affects the results. The response series depends then on the first two components and , in the following different ways:

 Model A: yt=2z1,t−1+3z2,t−1+ϵt Model B: yt=z21,t−1+3z2,t−5+ϵt Model C: yt=(2z1,t−1+3z2,t−1)2+ϵt Model D: yt=z21,t−1+3z22,t−5+ϵt Model E: yt=2z31,t−1+3z22,t−5+ϵt

All the models have iid -distributed innovations . As before is used as the mixing matrix.

Note that we have also performed additional simulations which evaluated what happens if there are almost non-stationary components, stochastic volatility components or components with heavy-tailed innovations. The exact settings are detailed in the appendix together with the corresponding results. While there are maybe minor differences, we believe that the guidelines derived on the settings specified above suffice in practice.

For prediction we use the prediction model (3). As an approximation of the function

we use both simple linear regression and also regression with quadratic

-splines (for model E cubic -splines, as it includes a factor of the form ). The size of the testing set is 100, i.e. we predict the last 100 values of the data. To predict the value for , , we use the observations as a training set (‘rolling window approach’).

We estimate the accuracy of the prediction by calculating the root mean square error (RMSE) based on the one-step-ahead prediction errors of the last 100 observations (testing set). The lags used to create the matrix are and the number of repetitions is 500.

### 4.3 On the number of slices for TSIR and TSAVE

In [7] simulations for TSIR are conducted only with value , which is a common value used with SIR. Here we aim to go a bit deeper and provide some guidelines for choosing the value for TSAVE as well as for TSIR.

To find the optimal number of slices , we conduct an experiment with different strategies and with two different threshold values 0.5 and 0.8 to find the important lags and the number of sources. We use time series lengths and 3000.

First we predict the values using all the strategies with linear and spline predictions. With and and models AD, we calculate the RMSE values compared to value , which is commonly used in e.g. SIR and has been used in TSIR in [7]. The results are for the components with low and high autocorrelation as well as for time series lengths , 1000, 2000 and 3000.

The choices and do not work that well in any of the settings in TSAVE. We conclude from this that there are then simply too few observations per slice. Thus we show here results only concerning values and 10. The choices and are compared to . If the choice is better than , then the relative RMSE values will generally be lower than one.

In model A the linear predictions are the most efficient and spline predictions may add a little bit of additional noise. In models BD, however, the linear predictions are not very good at determining the best value of . Only with the choice seems to be a bit better than others, while in longer time series any possible difference is barely visible. This is expected, since the relationship between the predictors and the response is not linear. Thus the spline predictions is preferred for determining the optimal value for .

In both low and high dependence settings the threshold value seems to be working well for models A and C, where only one source is expected to be found, and choosing has only a very small effect on results. For models B and D it seems that might be enough in short time series (), but in longer time series, especially when we have sources with high autocorrelation, using is crucial for the second source to be found. Thus is seems to be a safe choice in general.

When comparing the different strategies to select the number of sources and lags, the biggest values strategy seems to produce almost always the best results, and in the remaining few cases it is very close to the best. Figures 36 show then based on that strategy the relative RMSE for models AD and the different sample sizes. And these figures clearly indicate that using only 2 or 5 slices for TSAVE is clearly better than 10 slices. Only with increasing sample size the differences vanish which means that then all slices contain enough observations. The same can also be observed using other strategies for the selection (not shown here) although then in rare cases can be 5 slices better than 2 slices.

For TSIR the choice seems to be the safest when linear predictions are used. However, when using spline predictions, and may be better choices with shortest time series, i.e. with . Figure 7 includes results from spline predictions using the biggest values strategy. Note that with the models BD, TSIR does not work well. Thus the evaluation of the value of for TSIR is not considered in those models.

We could also look for the best with the matrix. For the models A and C we can check, if only one source with at least lag 1 is found, as only one is expected. For the models B and D we can first check that if one source with lag 1 and one source with lag 5 are found. If that is true, then we can check that if only the two sources are found.

With the biggest values strategy with for model A, generally and 10 are good choices when , and is the best when . In all the other models also seems to be the most efficient choice when . For model C the choices and 5 are generally safe, while with also and 20 seem to be good enough. As an example, Figure 8 has the results for model C with . In model D the threshold value seems to be generally too low for efficiently finding the right amount of sources, while with choices and seem to produce the expected results.

For TSIR any value seems to be safe for model A. For a short time series (), seems mostly the safest choice for TSAVE and and 10 for TSIR.

To conclude this section, we can say that for TSIR is generally a reliable choice, however, with short time series a lower might be beneficial, depending on the prediction method used. For TSAVE the number of observations per slice is more important and depends also on the data generation process. Based on our simulations we recommend to have at least 100 observations per slice and for the time series lengths considered here our preferred choice is , but also seems good. For a shorter time series might be the only choice, but the longer the time series the smaller differences there are between the values of H and then also a larger would be reliable.

Furthermore, the simulations suggest that the value of has a big influence on the number of selected components and is more restrictive than , which is however also very intuitive. The biggest values strategy to choose the amount of sources and the lags corresponding to them is also recommended.

### 4.4 On the TSSH method and the choice of coefficient a

In model A, the relationship between the response and the predictors is linear. Already [7] show that TSIR works in such models efficiently. On the other hand, the models B-E have symmetric parts. As seen in Section 4.1, TSIR in unable to find the relationship in such case (see Figure 2). This is also seen later in the simulation results of Section 4.5. Also from Figure 2 it can be seen that TSAVE still works with the linear relationships, but not as efficiently as TSIR.

This preference for different structures of the different methods was the main motivation for the introduction of the hybrid in Section 4.3. Now we consider the optimal value of for model E. This model is similar to the model 4 in [11], for which the hybrid of iid SIR and SAVE shows clearly better performance than SIR or SAVE separately. Therefore we could expect here that the TSIR part uncovers the asymmetric part of the dependence efficiently and TSAVE the symmetric part , and hopefully both together work even better. The question how much weight should be given to which method, i.e. the proper value for . For the iid hybrid method [11] recommend as general rule of thumb to use .

Following our general guidelines from the previous section, in the following presentation the results are based on using slices for TSIR part and slices for TSAVE part. To select the components, is set to 0.8 and cubic -splines are used for the prediction. The RMSE values are then computed for the values , where refers to the pure TSIR method and to the pure TSAVE method. We show here the RMSE for the time series lengths and 3000 in Figure 9 and Figure 10. It seems that there are not so big differences as long as not all or almost all of the weight is given to TSIR or all the weight given to TSAVE. The central values of seem to be a little bit better.

To investigate this a bit further, we also compare the choices of and using the matrix. Figure 11 gives then for the percentages of right number of sources and appropriate lags chosen, based on the biggest value strategy, for

From Figure 11 we can conclude that values for around 0.5 and 0.6 are reasonable choices. Considering results using other selection strategies not shown here we can in general recommend the value , which coincides with the recommendation of the regular SIR and SAVE hybrid.

The main feature we observed for the hybrid is that with values closer to , the value of is less crucial and one often comes to the same conclusion. Quite different from what we have seen when using only TSIR or only TSAVE.

### 4.5 Comparison to other approaches

To compare different methods, we simulate with time series length using for TSAVE and for TSIR, and the biggest values strategy, as recommended in Section 4.3. For models AD the relative RMSE values are compared to Oracle estimator, where the functional form of the relationship between the response and the predictors is known, but the coefficients are estimated.

Figures 1215 have the relative RMSE values based on different methods compared to the Oracle estimator. The methods here are TSAVE and TSIR as well as the original SAVE and SIR (Becker & Fried SIR [4]) with the lagged values of as predictors, i.e. with . Here we have used .

For original SIR we have used [8] and for original SAVE , as may be too low for SAVE (see for example [19]). To choose the number of sources, we use the ordered empirical eigenvalues , , of the supervised matrices in the original SIR and