1 Introduction
Highdimensional shrinkage and parameter selection techniques are of increasing importance in statistics in the past years. In recent years, highdimensional shrinkage and parameter selection techniques have been of increasing importance. In many statistical areas, lasso (least absolute shrinkage and selection operator) estimation methods, as introduced by Tibshirani (1996), are very popular. In time series analysis the influence of lasso type estimators is growing, especially as the asymptotic properties of stationary time series are usually very similar for stationary time series to the standard regression case, see e.g. Wang et al. (2007b), Nardi and Rinaldo (2011) and Yoon et al. (2013)
. Hence, given the lasso’s shrinkage properties, it is attractive for subset selection in autoregressive models. In big data settings, it provides an efficient estimation technique, see
Hsu et al. (2008), Ren and Zhang (2010), and Ren et al. (2013) for more details.Unfortunately, almost the entire literature about penalised least square estimation, like the lasso, deals with homoscedastic models. The case of heteroscedasticity and conditional heteroscedasticity is rarely has rarely been covered so far. Recently, Medeiros and Mendes (2012)
showed that the adaptive lasso estimator is consistent and asymptotically normal under very weak assumptions. They proved that the consistency and the asymptotic normality hold if the residuals are described by a weak white noise process. This includes the case of conditional heteroscedastic ARCH and GARCHtype residuals. Nevertheless, their classical lasso approach does not make use of the structure of the conditional heteroscedasticity within the residuals. Without going into detail, it is clear that the estimators might be improved if the structure of the conditional heteroscedasticity in the data is used. Furthermore,
Yoon et al. (2013) analysed the lasso estimator in an autoregressive regression model. Additionally, they formulated the lasso problem in a time series setting with ARCH errors. However, they did not provide a solution to the estimation problem and left this for future research.Recently, Wagener and Dette (2012) and Wagener and Dette (2013) analysed the properties of weighted lassotype estimators in a classical heteroscedastic regression setting. They showed that their estimators are consistent and asymptotically normal. In addition, their estimators perform significantly better than their homoscedastic counterpart. Their results, conditioned on the covariates, can be used to construct a reweighted estimator that also works in time series settings.
We derive an iteratively reweighted adaptive lasso algorithm that addresses the above mentioned problems. It enables the estimation of highdimensional sparse time series models under conditional heteroscedasticity. We assume a regression structure which is satisfied by the majority of the important time series processes and which admits fast estimation methods. The computational complexity of the algorithm is essentially the same as the coordinate descent algorithm of Friedman et al. (2007). This very fast estimation method for convex penalised models, such as the given situation, can be applied to the iteratively reweighted adaptive lasso algorithm.
The algorithm is based on the results of Wagener and Dette (2013), as their results can be generalised to models with conditional heteroscedasticity. The sign consistency and asymptotic normality for the proposed estimator is adduced. Furthermore, a general highdimensional setting, where in which the underlying process might have an infinite amount of parameters, is considered. Note that all the time series results hold in a classical regression setting as well.
However, we restrict ourself to penalised regressions as they are popular in time series settings (see e.g. Wang et al. (2007b), Nardi and Rinaldo (2011) and Yoon et al. (2013)). In general, other penalty could also be considered, e.g. the penalty. The
penalty, which gives the ridge regression, is suitable for shrinkage as well, but does not allow for sparsity. However in
penalised regression, the case is the greatest case of practical intereststill allowing for sparsity. This sparsity property can be used in applications to select the required tuning parameter based on information criteria that are popular in time series analysis.The general problem ist stated in section 2. In section 3, we motivate and provide the estimation algorithm. Subsequently, the asymptotics are discussed in in section 4.In Section 5, an application to multivariate ARARCH type processes is considered. This includes several extensions such as periodic ARARCH, ARARCH with structural breaks, threshold ARARCH and ARMAGARCH models. The section 6 shows simulation which underline the results given above. It provides evidence that incorporating the heteroscedasticity in a highdimensional setting is more important than in low dimensional problems. Finally, we consider the proposed algorithm as a model for the electricity market and metal prices returns data. A twodimensional ARARCH type model is used in both applications to the hourly data, in the first one to electricity price and load data and in the second one to gold and silver price returns.
2 The considered time series model
The considered model is basically similar to the one used by Yoon et al. (2013) or Medeiros and Mendes (2012). Let be the considered causal univariate time series. We assume that it follows the linear equation
(1) 
where
is a possibly infinite vector of covariates of weakly stationary processes
, is an error process, and the parameter vector is with . The covariates can also contain lagged versions of , which allows flexible modelling of autoregressive processes.A simple example of a process that helps for understanding this paper is an invertable seasonal MA(1) process. In particular, the AR() representation of a seasonal MA(1) with seasonality is useful. It is given by , choosing with . The error process is assumed to follow a zero mean process with being uncorrelated to the covariates . Hence we require and for all . Moreover, we assume that is a weak white noise process, such that
(2) 
Here, is a positive function, is a possibly infinite vector of covariates of weakly stationary processes , and is a parameter vector. Similarly to the covariates in (1), can also include lags of or
. This allows for a huge class of popular conditional variance models, like ARCH or GARCH type models. Choosing
leads to the very popular GARCH(1,1) process. Note that the introduced setting is more general than the conditional heteroscedastic problem stated by Yoon et al. (2013), who mentioned only ARCH errors.
For the following we assume that the time points to are observable for . Thus, we denote by
the response vector , the matrix of the covariates , the parameter vector and the corresponding errors . Furthermore let be the rows of .
Since we deal with a highdimensional setting we are interested in situations where the number of possible parameters increases with sample size . Therefore, denote the restriction of to its first coordinates. Due to it follows for that there is a positive decreasing sequence with such that holds. Thus, for a sufficiently large we can approximate by arbitrarily well.
However, under the assumption of sparsity, meaning that only some of the regressors attribute significantly to the model, we can conclude that only of the parameters are nonzero. Hence, there are parameters that are exactly zero. Without loss of generality we assume that and are arranged so that the first components of are nonzero, whereas the following are zero. Obviously we have . This arrangement of the nonzero components is only used to simplify the notation, it is especially not required by the estimation procedure. Additionally we introduce the naive partitioning of and , in such a manner that , and holds.
Subsequently, we focus on the estimation of , for which we will utilize a lassobased approach for . Henceforth, we achieve never a direct estimate for , but we can approximate it by .
3 Estimation algorithm
The proposed algorithm is based on the classical iteratively reweighted least squares procedure. An example for an application of it to time series analysis can be found e.g. in Mak et al. (1997). However, similar approaches are not popular in time series modelling, as there are usually better alternatives if the number of parameters is small. In that case, we can simply perform an estimation of the joint likelihood function of (1), see e.g. Bardet et al. (2009)
. But when facing a highdimensional problem, it is almost impossible to maximise the nonlinear loss function with many parameters. In contrast, our algorithm can be based on the coordinate descent lasso estimation technique as suggested by
Friedman et al. (2007) which provides a feasible and fast estimation technique. Other techniques, like the LARS algorithm introduced by Efron et al. (2004) which provides the full lasso solution path, can be used as well.For motivating the proposed algorithm, we divide equation (1
) by its volatility, resp. conditional standard deviation
. Thus, we obtain(3) 
where and . Here, the noise is homoscedastic with variance . Hence, if the volatility of the process is known, we can simply apply common lasso time series techniques under homoscedasticity. Unfortunately, this is never the case in practice. The basic idea is now to replace by a suitable estimator , which allows us to perform a lasso estimate on a homoscedastic time series as in equation (3).
For estimating ARMAGARCH processes, practitioners sometimes use a multistep estimator. This estimation technique involves computing ARMA parameters in a homoscedastic setting first and then use the resulting estimated residuals are used to estimate the GARCH part in a second step, see e.g. Mak et al. (1997) or Ling (2007). We will apply a similar stepwise estimation technique here.
In general, we have no a priori information about , hence we should assume homoscedasticity in a first estimation step. We start with the estimation of the regression parameters , resp. , and obtain the residuals . We use the residuals to estimate the conditional variance parameters and thus by afterwards. Afterwards, we reweight model (1) by to get a homoscedastic model version which we utilise in order to reestimate again. We can use this new estimate of to repeat this procedure. Thus, we will end up in an iterative algorithm that hopefully converges in some sense to , resp. , with increasing sample size .
We use an adaptive weighted lasso estimator to estimate within each iteration step. It is given by
or in vector notation
where , are the heteroscedasticity weights, are the penalty weights and is a penalty tuning parameter. As described above, in the iteratively reweighted adaptive lasso algorithm we have the special choice for the heteroscedasticity weights within each iteration step. We require for the homoscedatic initial step.
Like Zou (2006) we consider, for the tuning parameter , the choice for some and some initial parameter estimate . With we obtain which is the usual lasso estimator. Obviously, there is no initial estimator required in this case. However, we consider the case of and the adaptive lasso approach for our practical application, as they resulted in different perfomances.
The selection of the tuning parameters and such as the choice of the initial estimate is crucial for the application and might demand some computational cost. We discuss this issue in more detail at the end of the next section.
Subsequently, we denote as a known plugin estimator for , which is the projection of to its first coordinates. We denote as restriction of that corresponds to . Thus, is defined such that is restricted to and is a restriction of to its first coordinates. Similarly, let be an estimator for .
For example, if follows a GARCH(1,1) process we receive for all , where with and with for all . This is similarly feasible for every variance model with a finite amount of parameters. However, if follows an infinite parameterised process, e.g. through an ARCH() process, and should tend to infinity as .
The estimation scheme of the described iteratively reweighted adaptive lasso algorithm is given by:
initialise , with and , estimate by weighted lasso: estimate the conditional variance model: and compute new weights with if the stopping criterion is not met, and back to 2. otherwise, return estimate and volatilities
We can summarise that we have to specify the tuning parameter , the initial estimator with an inital value of , and the initial heteroscedasticity weights . To reduce the computation time it can be convenient in practice to choose (lasso) or (almost nonnegative garotte).
The stopping criterion in step 5 has to be chosen as well, such that the algorithm eventually stops. A plausible stopping criterion should measure the convergence of , resp. . We suggest to stop the algorithm if for a selected vector norm and some small . Nevertheless, in our simulation study, we realised that the difference in the later steps are marginal, so that stopping at or seems to be reasonable for practice. This will be underlined by the asymptotics of the algorithm as analysed below; it can be shown that, under certain conditions, is sufficient to get an optimal estimator if is large.
4 Asymptotics of the algorithm
For the general convergence analysis it is clear that the asymptotic of the estimator will strongly depend on the (cond.) heteroscedasticity models (2) (esp. the formula for ) such as on the linked estimators and . Despite that strong dependence we are able to prove sign consistency as introduced by Zhao and Yu (2006) and asymptotic normality of the nonvanishing components of in a time series framework.
If we assume that the number of parameters does not depend on the sample size , then we could make use of the results from Wagener and Dette (2012) to obtain asymptotic properties, as they prove sign consistency and asymptotic normality under some conditions for the weighted adaptive lasso estimator.
The case where the number of parameters increases with is analysed euivalently in a regression framework by Wagener and Dette (2013), but only for the adaptive lasso case with . They basically achieve the same asymptotic behaviour as for the fixed case, but it is clear that the conditions are more complicated compared to those of Wagener and Dette (2012).
In the following we will introduce several assumptions, which allow us to generalise the results of Wagener and Dette (2013).One crucial point is the assumption that the process can be parameterised by infinitely many parameters, so that the error term , based on the restriction of the true parameter vector , is not identical to the true error restriction . In contrast to , the term is in general correlated. This has to be taken into account for the proof concerning the asymptotic behaviour.
For the asymptotic properties we introduce a few more notations. Let and , where . Let denote the true volatility matrix and its estimate in the th iteration. Additionally, we introduce as the scaled Gramian, where is the unscaled Gramian. Furthermore, let and denote the weight matrix and the Gramian that correspond to the true matrix . The submatrices to are denoted by , , and .
Similarly to Wagener and Dette (2013), we require the following additional assumptions, which we extended to carry out our proof:

The process is weakly stationary with zero mean for all .

The covariates are standardised so that for all and .

For the sequence of covariates of a fixed there is a positive sequence such that

For the minimum of the absolute nonzero parameters and the initial estimator there exists a constant so that

There exists a positive sequence with such that

There is a positive constant such that
for all with , and in an open neighbourhood of .

For all the estimator and are consistent for and , additionally
for some with as .

It holds for , , , , , , and that
as .


There are positive constants , and with such that

It holds for , , and that as .
Assumption 4 is standard in a time series setting. 4 is the scaling that is required in a lasso framework. 4 and 4 are usual assumptions in an adaptive lasso setting (see e.g. Zou (2006) or Huang et al. (2008)). 4 gives bounds for the weighted and unweighted Gramian. 4, 4 and 4 postulate properties required for the heteroscedasticity in the model. 4 states some convergence properties that make restrictions to the grow behaviour within the model, especially the number of parameters and the number of relevant parameters . 4 makes a statement about the tails of the errors.
Using the assumptions above we can prove sign consistency and asymptotic normality.
Theorem 1.
The proof is given in the appendix. Note that the variance for is substantially smaller than . Hence the estimator has minimal asymptotic variance for all .
Due to the general formulation of the theorem assumption 4 contains several assumptions on problem characterizing sequences. The convergence rate of the volatility model is relevant as well. If we have that (e.g. the variance model is asymptotic normal) then the three conditions involving are automatically satisfied by the other conditions. This reduces the relevant conditions in 4 a lot.
There is one condition in assumption 4 involving that is given through assumption 4. As it holds that it characterises the structure of regressors. Obviously it holds that if contains only a finite amount of nonzero parameters, so for some as . However, there are many other situations where holds. For example, if we have that is stationary. In the example above, where follows a seasonal moving average process the process, is stationary.
Furthermore, there is the option of 4 or 4 in the theorem. 4 restricts the residuals to have an exponential decay in the tail, like the normal or the Laplace distribution. However, this can be replaced by the stronger condition 4 in the theorem. In this situation, polynomially decaying tails in the residuals are possible. Here a specification of the constant in 4 is not required, as 4 implies directly 4 0d, which means that 4 0a is not used in this case. More details are given in the proof.
As discussed in Wagener and Dette (2013) the assumption 4 or 4 has an impact on the maximal possible growth of the amount of parameters in the estimation. There are situations where under assumption 4 can grow with every polynomial order, even slow exponential growth is possible. In contrast, given assumption 4 this is impossible. Here Wagener and Dette (2013) argued that sign consistency is possible for rates that increase slightly faster than linearly, such as , but not for polynomial rates like for some . Wagener and Dette (2013) do not discuss this case for the asymptotic normality. In this situation, we can get an optimal rate of for the number of relevant parameters (having , , and ), when we have a polynomial growth for .
The quite general formulation in 4 can be replaced by a more precise assumption when a variance model is specified. For example, if we have a finite dimensional conditional variance model where is asymptotic normal, i.e. converges with rate of , and is twice continously differentiable with uniformly bounded derivatives, then 4 can be satisfied by under some regularity conditions on and or its estimated counterparts and . If in contrast is increasing we will usually tend to get worse rates for .
In empirical applications, practitioners often just want to apply a lasso type algorithm without caring much about the chosen size of and . They tend to stick all available and into their model as long as it is computational feasible. However, usually it is feasible to validate the convergence assumptions in 4 at least partially. Therefore, we have to estimate the model for several sample sizes and a specified growth rate for and . As we can observe the estimated values for of the model we can get clear indications for the asymptotic convergence properties. This also helps to find the optimal tuning parameter . The tail assumption 4 can be checked using logdensity plots and related tests. The moment restriction 4 to the volatilities can be validated using tailindex estimation techniques, like the Hill estimator.
Note that in the algorithm is assumed to be the same in every iteration. It is clear that if we have two different sequences and that satisfy the assumptions of the theorem, we can use them both in the algorithm. For example we can use for the first iteration and for the subsequent iterations. This might help in practice to achieve better finite sample results.
For finding the optimal tuning parameters we suggest to use common time series methods that are based on information criteria. Zou et al. (2007), Wang et al. (2007b), Zhang et al. (2010) and Nardi and Rinaldo (2011) analyse information criteria in the lasso and adaptive lasso time series framework. Possible options for this information criteria are the Akaike information criterion (AIC), Bayes information criterion (BIC) or a crossvalidation based criterion. Here, it is worth mentioning that Kim et al. (2012) discusses the generalised information criterion (GIC) in a classical homoscedastic lasso framework where the amount of parameters depends on . They establish that under some regularity conditions the GIC can be chosen so that a consistent model selection is possible.
For the initial estimate that is required for the penalty weights there are different options available. The simplest is the OLS estimator, which is available if . Another alternatives are the lasso (), elastic net or ridge regression estimator, see e.g. Zou and Hastie (2005). Remember that we require an initial estimate only for the adaptive lasso case if .
Note that Wagener and Dette (2013) described a setting with two initial estimators. One for the adaptive lasso weights as we do, and another one for the weight matrix . The first estimator corresponds to our , whereas the second inital estimator is not required, as we can initialise the volatility weight matrix by the homoscedastic setting. A similar result was achieved by Wagener and Dette (2013) who showed that the homoscedastic estimator can be used as initial estimator in their setting.
5 Applications to ARARCH type models
In the introduction we mentioned that one of the largest fields of application might be the estimation of highdimensional ARARCH type processes. Therefore, we discuss a standard multivariate ARARCH model in detail. Afterwards, we briefly deal with several extensions, the periodic ARARCH model, change point ARARCH models, threshold ARARCH models, interaction models and ARMAGARCH models.
Let be a dimensional multivariate process and .
5.1 ARARCH model
The multivariate AR model is given by
(4) 
for , where are nonzero autoregressive coefficients, are the index sets of the corresponding relevant lags and is the error term. The error processes follow the same conditional variance structure as in (2), so where and is i.i.d. with and .
Now, we define the representation (4) that matches the general representation (1) by
for where the parameter vector and the corresponding regressor matrix . Note that this definition of is only well defined if all for are finite, if one index set is infinite we have to consider another enumeration, but everything holds in the same way.
Furthermore, we assume that follows an ARCH type model. In detail we consider a multivariate powerARCH process which generalises the common multivariate ARCH process slightly. Recently, Francq and Zakoïan (2013) discussed the estimation of such powerARCH() processes and showed applications to finance. It is given by
(5) 
with as index set and as power of the corresponding . The parameters satisfy the positivity restriction, so and . Moreover we require that the ’s absolute moment exists. Obviously, we have
where and . Similarly as for , is only well defined if all for are finite. Otherwise we have to consider another enumeration. The case leads to the well known ARCH process which turns into a multivariate ARCH() if .
For estimating the ARCH part parameters we will make use of a recursion that holds for the residuals. This is given by
(6) 
where , and with . Here, is a weak white noise process with . The fitted values of equation (6) are proportional to the up to the constant . As is the ’s absolute moment of , it holds that , if . If and
follows a normal distribution
it is . If exhibits e.g. a standardised tdistribution we will observe larger first absolute moments .Clearly, the true index sets and are unknown in practice. Thus we fix some index sets and for the estimation that can depend on the underlying sample size . If the true index sets and are finite, then the choices and are obvious. If and are infinite, and should be chosen so that they are monotonically increasing in the sense that and with and . The size of and is directly related to the size of the estimated parameters for and for . It holds that and
Comments
There are no comments yet.