# Fitting ARMA Time Series Models without Identification: A Proximal Approach

Fitting autoregressive moving average (ARMA) time series models requires model identification before parameter estimation. Model identification involves determining the order for the autoregressive and moving average components which is generally performed by visual inspection of the autocorrelation and partial autocorrelation functions, or by other offline methods. In many of today's big data regime applications of time series models, however, there is a need to model one or multiple streams of data in an iterative fashion. Hence, the offline model identification step is significantly prohibitive. In this work, we regularize the objective of the optimization behind the ARMA parameter estimation problem with a nonsmooth hierarchical sparsity inducing penalty based on two path graphs that allows incorporating the identification into the estimation step. A proximal block coordinate descent algorithm is then proposed to solve the underlying optimization problem. The resulting model satisfies the required stationarity and invertibility conditions for ARMA models. Numerical results supporting the proposed method are presented.

## Authors

• 4 publications
• 4 publications
03/13/2021

### Statistical inference for ARTFIMA time series with stable innovations

Autoregressive tempered fractionally integrated moving average with stab...
04/12/2018

### Model identification for ARMA time series through convolutional neural networks

In this paper, we use convolutional neural networks to address the probl...
08/13/2020

### Sensitivity Analysis of Error-Contaminated Time Series Data under Autoregressive Models with Application of COVID-19 Data

Autoregressive (AR) models are useful tools in time series analysis. Inf...
06/20/2018

### Beta seasonal autoregressive moving average models

In this paper we introduce the class of beta seasonal autoregressive mov...
10/25/2017

### Free deterministic equivalent Z-scores of compound Wishart models: A goodness of fit test of 2DARMA models

We introduce a new method to qualify the goodness of fit parameter estim...
05/03/2016

### Temporal Clustering of Time Series via Threshold Autoregressive Models: Application to Commodity Prices

This study aimed to find temporal clusters for several commodity prices ...
03/04/2020

### Adaptive exponential power distribution with moving estimator for nonstationary time series

While standard estimation assumes that all datapoints are from probabili...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

ARIMA time series models have a multitude of applications, e.g., in epidemiological surveillance [35], water resource management [32], transportation systems [8], drought forecasting [21], stock price forecasting [1], business planning [13], and power systems [15]

, to name a few. Even emergence of deep neural networks and their customized architectures for time series modeling, e.g., Recurrent Neural Nets (RNN) and Long Short-Term Memory (LSTM) has not decreased the popularity of ARIMA models

[25].

Fitting ARMA(p,q) time series models requires a two-step process: 1. Model identification, 2. Parameter estimation. The model identification step determines the order of the autoregressive (AR) component (p) and moving average (MA) component (q). Next, given the underlying ARMA model, the parameters are estimated by solving an optimization problem for the maximum likelihood or least square estimates [10, 17]. We should note that ARMA models are to model stationary processes; however, there exists a more general class of ARIMA models for homogenous nonstationary processes (which are stationary in the mean). Such processes become stationary after times differencing; hence, the corresponding ARIMA(p,d,q) model includes differencing of order . The results of this paper are mainly for stationary processes with potential extension to the homogenous nonstatioanry processes.

Model identification is based primarily on visual inspection of the sample autocorrelation function (ACF) and partial autocorrelation (PACF) plots. For the AR(p) process, the sample ACF follows an exponential decay and the sample PACF cuts off after lag , while for the MA(q) process, the sample ACF cuts off after lag and the sample PACF decays exponentially [17]

. When the process involves both AR and MA components, it is more difficult to identify the correct orders. Next, model parameters are estimated by minimizing a loss function (e.g., negative loglikelihood or least square).

Box et al.  [10] stepped even further and recommended an iterative

approach between model identification and parameter estimation which involves inspection of the the residuals from the fitted model to make sure that they are indeed white noise.

In many of today’s applications, ARMA models should be fitted to many times series of data where

is very large. For instance, the data could be the demand for more than thousands of product over time which are not necessarily correlated; hence, fitting Vector ARMA models are unnecessary, and separate modeling is more parsimonious. In such scenarios, model identifications become a significant bottleneck in the fitting process. This work is about a novel approach for fitting ARMA time series models that allows automating the fitting procedure by eliminating an explicit identification step. Indeed, with the aid of a single tuning parameter, the proposed algorithms allows data to identify the appropriate model.

### 1.1 Contributions

The contributions of this work are as follows:

• We develop a novel approach to fit ARMA time series models that identifies the model only by tuning a single continuous parameter (). The main idea behind this approach is to merge the model identification into the parameter estimation step by introducing a hierarchical sparsity inducing penalty into the optimization problem. The sparsity inducing penalty preserves the hierarchical structure of the nonzero parameter, e.g., it will not set the first AR parameter to zero while the second AR parameter is nonzero.

• We propose an efficient proximal block coordinate descent (BCD) algorithm to solve the underlying nonsmooth and nonconvex optimization problem to a stationary point – see Algorithm 2. The proximal map of the nonsmooth hierarchical sparsity inducing penalty is shown to be separable on the AR and MA components.

• The proposed approach will automate the ARMA time series modeling without a need for offline inspection for model identification, and allows fitting ARMA time series models to large number of time series data.

### 1.2 Notations

Lowercase boldface letters denote vectors, and uppercase greek letters denote sets, except for which denotes the back-shift operator. The set of all real and complex numbers are denoted by and , respectively. Given a set , denotes its cardinality and denotes its complement. Given and , is a vector with its elements selected from over the index set .

## 2 Problem definition

We consider a stationary ARMA(p,q) time series process with a zero mean as

 yt=ϕ1yt−1+ϕ2yt−2+⋯+ϕpyt−p−θ1ϵt−1−θ2ϵt−2−⋯−θqϵt−q+ϵt, (1)

where are the parameters of the AR component, and are the parameters of the MA component, and

is a white noise with zero mean and variance

. The process (1) can also be written as

 Ppϕ(B)yt=Pqθ(B)ϵt, (2)

where is the back-shift operator, i.e., , and

 Pdα(z)≜1−α1z−α2z2−⋯−αdzd, (3)

is a polynomial of degree with the parameter . The process (2) is stationary if the AR component is stationary which is the case if all roots of the polynomial are outside the unit circle; furthermore, the process is invertible if the the MA component is invertible which is the case if all roots of the polynomial are outside the unit circle [17]. Requiring the two polynomials to have roots outside of the unit circle in the space translates to some constraints on and , i.e. and , where is defined as

 (4)

We should note that there is also another (maybe more common) representation for based on the monic polynomial

 ¯Pdα(z)≜zd+α1zd−1+⋯+αd−1z+αd, (5)

of degree , where it can be shown that

 Xdα={α∈Rd: ∀z∈C, ¯Pdα(z)=0⇒|z|<1}. (6)

Note that the new representation requires roots of the polynomial to be inside the unit circle. For an arbitrary , geometrical complexity of makes projection onto this set very difficult [16]. Indeed, [16] discussed that is open, bounded, and not necessarily convex – see also [28, 9]. To deal with the openness of , it is common to approximate it with a closed set from inside – see (12). However, projection onto this set or its approximation may not be unique due to their potential nonconvexities. A method for projection onto the set was developed in [28]. While their scheme is easy to implement, the convergence of this iterative method is slower than steepest descent – see also [16]. To conclude, imposing stationarity and invertibility of the model is performed by projecting and onto (inner approximate of) and , respectively, which my not be unique.

The above discussion is for an ARMA model that is already identifies, i.e., and are known. For a model that is not identified, we also need

 if ϕℓ=0 then ϕℓ′=0, ∀ℓ<ℓ′,⟺if ϕℓ′≠0 then ϕℓ≠0, ∀ℓ<ℓ′,if θℓ=0 then θℓ′=0, ∀ℓ<ℓ′,⟺if θℓ′≠0 % then θℓ≠0, ∀ℓ<ℓ′, (7)

i.e., the sparsity of and follow hierarchical structures.

Before discussing how these sparsity structures are enforced, we will briefly discuss the loss function for fitting ARMA models. Provided an identified model, i.e., and are known, fitting ARMA models are generally performed by finding the conditional maximum likelihood or conditional least-square estimates, which are close to each other assuming that in (1

) follow a Normal distribution and the size of the time series

is reasonably large. The conditional least-square estimate (for an identified model) requires solving

 (8)

where is the model prediction for using the data , and is called conditional since it depends on the initial values for and initial values for . Note that in the absence of MA terms (i.e., ), the objective function of (8) is convex in the parameters of the AR model . However, if then the objective function of (8) is nonconvex, and optimization routines are not guaranteed to converge to the global optimum [20, 10, 4, 18]. To sum up, in its most general case, problem (8) involves nonconvex minimization over a nonconvex set and, hence, it is difficult to solve.

This paper is concerned with an optimization problem solution of which preserves the hierarchical sparsity structure discussed above. In the next section, we propose a method that allows learning and within the parameter estimation step.

## 3 Proposed method

Before discussing the proposed method, we should briefly discuss the notion of hierarchical sparsity. Let be a Directed Acyclic Graph (DAG) where is the set of graph nodes and

be the set of ordered pair of nodes denoting edges where each pair denotes an edge from the node in the first element to the node in the second element. Each

is an index set of the parameters of the model such that and where is the number of parameters. DAG shows the sparsity structures of interest in the parent/child relationship. Assuming one variable per node, the variable in a child node can only be nonzero if the variable in the parent node is nonzero. For instance, given a parameter , the left plot in Figure 1 requires if and if .

For a DAG that contains more than one variable per node (e.g. the right plot in Figure 1), two different hierarchies can be considered: 1. Strong hierarchy: the parameters in the child node can only be nonzero if all of the parameters in its parent node(s) are nonzero. 2. Weak hierarchy: the parameters in the child node can be nonzero if at least one of the parameters in its parent node(s) is nonzero [7]. For more information about hierarchical sparsity structures refer to [36, 23, 24, 2, 33].

### 3.1 Hierarchical sparsity for ARMA models

In this work, we want to include the model identification of ARMA models into the parameter estimation step. We assume the knowledge about some upper bounds on the true and , i.e., and , respectively. Considering , the estimated parameters should satisfy the condition (7). To do so, we define two path graphs as shown in Figure 2. Since this DAG consists of two path graphs and there is only one variable in each node, weak and strong hierarchies are equivalent.

Enforcing the sparsity structure shown in Figure 2 exactly

requires introducing binary variables into the optimization problem (

8) and solving a Mixed Integer Program (MIP). For instance, to model the parent/child hierarchy between and , one need to introduce a binary variable and two constraints as and for some reasonably small and large parameters and , respectively. Provided that the the underlying optimization problem is already very difficult to solve, introducing binary variable makes the problem even more challenging. Hence, despite the significant recent advances in MIP algorithms (see e.g., [26, 6, 27, 5]), we use a nonsmooth but convex regularizer that induces hierarchical sparsity structures of interest.

### 3.2 Latent Overlapping Group (LOG) Lasso

The hierarchical sparsity structure shown in Figures 2 is induced by regularizing the objective function in (8) by the LOG penalty – see [22]. Let denote all of the parameters of the ARMA model. The LOG penalty function is defined as

 ΩLOG(β)=infν(g), g∈G⎧⎨⎩∑g∈Gwg∥ν(g)∥2  s.t. ∑g∈Gν(g)=β, ν(g)gc=0⎫⎬⎭ (9)

where is the set of groups (of the nodes of the DAG which is discussed next), is itself a set, is a latent vector indexed by , and is the weight for group . The groups should follow an ascending structure, i.e., for each node there is a group that contains that node and all of its ascendants. For the DAG in Figure 2, the groups are

 G={{1},{1,2},⋯,{1,⋯,¯p},{¯p+1},{¯p+1,¯p+2},⋯,{¯p+1,⋯,¯p+¯d}},

where given the definition of , we assumed the indices of the AR parameter first, and those of the MA parameter next. These groups are shown with the red dotted rectangles in Figure 2. Given that -norm induces block sparsity, the LOG penalty tries to find block sparse combinations of the latent variables the sum of which is equal to  [22, 33]. For instance, for an ARMA model with and , , the objective of the infimum is (where for simplicity ) and the constraints are

 ⎡⎢ ⎢ ⎢ ⎢⎣ν{1}1000⎤⎥ ⎥ ⎥ ⎥⎦+⎡⎢ ⎢ ⎢ ⎢ ⎢⎣ν{1,2}1ν{1,2}200⎤⎥ ⎥ ⎥ ⎥ ⎥⎦+⎡⎢ ⎢ ⎢ ⎢⎣00ν{3}30⎤⎥ ⎥ ⎥ ⎥⎦+⎡⎢ ⎢ ⎢ ⎢ ⎢⎣00ν{3,4}3ν{3,4}4⎤⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢⎣ϕ1ϕ2θ1θ2⎤⎥ ⎥ ⎥ ⎥⎦.

### 3.3 The proposed Hierarchically Sparse learning problem

The proposed Hierarchically Sparse (HS) learning problem is

 minϕ,θL(ϕ,θ)+λΩLOG(ϕ,θ)s.t.ϕ∈Xpϕ,θ∈Xqθ, (HS-ARMA)

where is a tuning parameter, and are defined based on (4), and is defined in (9). controls the tradeoff between the loss and penalty functions and, hence, allows model identification and parameter estimation simultaneously. Obviously increasing result in sparser models where these nested models satisfy the hierarchical sparsity structure shown in Figure 2. As discussed in Section 3.1, and are some upper bounds on the true and known a priori.

Given the convex nonsmooth function , we propose to solve (HS-ARMA) using a proximal method [29, 3, 30]. Similar to gradient methods which requires iterative evaluation of the gradient, proximal methods require iterative evaluation of the proximal operator. Proximal operator of the at is defined as

 proxλΩLOG(b)≜argminβ∈R(¯p+¯q){λΩLOG(β)+12∥β−b∥22}. (10)

In [34], authors developed a two-block alternating direction method of multiplier (ADMM) with a sharing scheme [11] to solve (10) – see Algorithm 1. The proposed algorithm can be parallelized over all groups in in the update of the first-block; furthermore, it converges linearly – see [34] for more details.

Let and be the LOG penalties for the pure AR, i.e, ARMA, and pure MA, i.e, ARMA, models, respectively. In Lemma 3.1 below, we show that the proximal operator of is separable over and .

###### Lemma 3.1

The proximal operator of the LOG penalty defined over the ARMA DAG is separable, i.e., .

###### Proof 3.1

With a slight abuse of notation, let be the set of groups for such that (the top path graph in Figure 2). Similarly, let be the set of groups for such that . Given that the objective of the infimum in the definition of for the ARMA DAG is separable in and , we have . Hence, the result follows from the separable sum property of the proximal operator.

Indeed, in Algorithm 2, the proximal operator of LOG is not evaluated in one step while the algorithm evaluates and sequentially in a Gauss-Seidel manner.

The algorithm to solve problem (HS-ARMA) is a two-block proximal block coordinate descent (BCD) with projection, shown in Algorithm 2.

From (1), since where and , we have

 ∇ϕL(ϕ,θ) =−T∑t=max{¯p,¯q}(yt−ϕ⊤yt−1t−p−θ⊤ϵt−1t−q)yt−1t−p, (11a) ∇θL(ϕ,θ) =−T∑t=max{¯p,¯q}(yt−ϕ⊤yt−1t−p−θ⊤ϵt−1t−q)ϵt−1t−q. (11b)

The gradient updates are passed to proximal operators as arguments which is indeed proximal gradient steps [3, 30]. Note that the solution of the proximal operators are sparse vectors that conform to the hierarchical sparsity of Figure 2.

The solutions of the proximal gradient steps for the AR and MA components, i.e., and are not necessarily stationary or invertible, respectively. The stationarity and invertibility of AR and MA are regained by projection on and where and are the order of AR and MA components from the proximal steps. For projection, we use the second definition of in (6). Since is an open set, following [16], we find its approximation with a closed set from inside as

 ~Xdα(δ)≜{α∈Rd: ∀z∈C, ¯Pdα(z)=0⇒−1+δ≤z≤1−δ}, (12)

where determines the approximation gap. Euclidean projection on and sets guarantee stationarity and invertibility of and , respectively. Note that these projections do not change sparsity of the parameters.

Finally, since in the objective of (HS-ARMA) is calculated based on ARMA while the iterates and are feasible with respect to and respectively, we need to show . This is established in Lemma 3.2 below.

###### Lemma 3.2

For any , we have .

###### Proof 3.2

Proof follows from the definition of in (4), and that if then .

Therefore, is a sequence of nested sets as . However, the reverse is not true, i.e., is not sufficient for , which can be shown by counter examples.

### 3.4 A note on the optimization problem (Hs-Arma)

Problem (HS-ARMA) requires nonconvex and nonsmooth optimization over a nonconvex set. To be specific, if the loss function is convex in ; otherwise, is nonconvex in both and . Indeed, when the objective function is a polynomial function of degree . The penalty is a nonsmooth but jointly convex function in its arguments. Finally, and are open nonconvex sets and their approximations and (defined in (12)) are closed but still nonconvex.

To Deal with nonconvexities of and , one may try to approximate them with some inscribed convex sets which requires generalizations of the potato peeling problem [19] and the algorithm in [14] to non-polygon geometries – see also [12]. Note that optimization over the convex hulls of these sets may result in nonstationary or noninvertible solutions.

Under some convex approximations of the sets and , the problem under investigation is a nonconvex nonsmooth optimization over a convex set. For such a setting, algorithms are settled with finding solutions that satisfy some necessary optimality conditions, e.g., stationary solutions which are those that lack a feasible descent direction. To the best of our knowledge, the only study that provides a method that converges to stationary points in this setting is [31], which involves iterative minimization of a consistent majorizer of the objective function over the feasible set.

## 4 Numerical Studies

### 4.1 Data generation process

To generate a statioanry and an invertible ARMA() model, we first generate numbers uniformly at random on for all parameters. The samples are then rejected if the stationary and invertibility conditions, based on (6), are not satisfied. Given an accepted sampled parameter , a realization of the time series with length is simulated with a zero mean and variance equal to one.

### 4.2 Model identification and parameter estimation accuracy

To evaluate the estimation error of the proposed method, we simulate realizations of ARMA models with orders such that and following our discussion in Section 4.1. The tuning parameter of the penalty is set as with and in its definition is set to . The estimation error is calculated as , where are the true and are the estimated parameters based on Algorithm 2. Table 1

reports the mean and standard deviation of the estimation errors for different

values.

To provide a better understanding of the quality of parameter estimates and how they conform to the induced sparsity structure in Figure 2, we conduct another study. First, we sampled one realization from 10 different ARMA(3,2) models. Then, with and , the HS-ARMA parameter estimates are calculated using Algorithm 2 and reported along with the true parameters in Table 2, where is the simulation index. Simple tuning of allows the method to correctly identify the true orders and the estimated parameters conform to the underlying sparsity structure. Furthermore, the estimation errors are reasonably small. We also compared the estimation errors with pre-identified models where their parameters are estimated using a package – see Figure 3. The mean of the HS-ARMA estimation error lies between those of the correctly and incorrectly identified (by one order in the AR component) models. For some samples with around 2 or 3, the error of HS-ARMA is very close to the correctly identified ARMA model.

### 4.3 Prediction performance

We also compare the prediction performance of the HS-ARMA with those of correctly and incorrectly identified models using 10 realizations of one ARMA(3,2) model. For each realization, the estimated parameters with are used to forecast the process for the next 20 time points. Note that is omitted because the fitted parameters were too sparse. Figure 4 illustrates Root Mean Square Error (RMSE) for these methods.

For some , the RMSE of HS-ARMA is smaller than that of the correctly identified ARMA model. Furthermore, all HS-ARMA predictions for different values have significantly lower RMSE compared to the incorrectly identified model.

## 5 Concluding remarks

This work presents a new learning framework that allows model identification and parameter estimation for ARMA time series models simultaneously. To do so, we introduced a hierarchical sparsity inducing penalty, namely the Latent Overlapping Group (LOG) lasso, in the objective of the parameter estimation problem. While the addition of a nonsmooth (but convex) function to the objective of an already difficult nonconvex optimization seems restrictive, we propose a proximal block coordinate descent algorithm that can solve the problem to a potential stationary point efficiently. Numerical simulation studies confirm capabilties of the proposed learning framework to identify the true model and estimate its parameters with reasonably high accuracy.

We believe that this study sheds some light on the hard optimization problem behind the parameter estimation of ARMA time series models (see our brief discussion in Section 3.4). Furthermore, we hope it motivates future studies to look into convergence analysis of the proposed proximal BCD or other algorithms for such problem structures. Finally, the proposed framework can be extended to fit Vector ARMA (VARMA) models where the underlying path graphs would contain multiple variables per nodes (see e.g. the right plot in Figure 1), which we also leave for future studies.

## References

• Adebiyi et al.  [2014] Adebiyi, Ayodele Ariyo, Adewumi, Aderemi Oluyinka, & Ayo, Charles Korede. 2014. Comparison of ARIMA and artificial neural networks models for stock price prediction. Journal of Applied Mathematics, 2014.
• Bach et al.  [2012] Bach, Francis, Jenatton, Rodolphe, Mairal, Julien, Obozinski, Guillaume, et al. . 2012. Structured sparsity through convex optimization. Statistical Science, 27(4), 450–468.
• Beck & Teboulle [2009] Beck, Amir, & Teboulle, Marc. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183–202.
• Benidir & Picinbono [1990] Benidir, Messaoud, & Picinbono, B. 1990. Nonconvexity of the stability domain of digital filters. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(8), 1459–1460.
• Bertsimas & Van Parys [2017] Bertsimas, Dimitris, & Van Parys, Bart. 2017. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions. arXiv preprint arXiv:1709.10029.
• Bertsimas et al.  [2016] Bertsimas, Dimitris, King, Angela, Mazumder, Rahul, et al. . 2016. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2), 813–852.
• Bien et al.  [2013] Bien, Jacob, Taylor, Jonathan, & Tibshirani, Robert. 2013. A lasso for hierarchical interactions. Annals of statistics, 41(3), 1111.
• Billings & Yang [2006] Billings, Daniel, & Yang, Jiann-Shiou. 2006. Application of the ARIMA models to urban roadway travel time prediction-a case study. Pages 2529–2534 of: 2006 IEEE International Conference on Systems, Man and Cybernetics, vol. 3. IEEE.
• Blondel et al.  [2012] Blondel, Vincent D, Gurbuzbalaban, Mert, Megretski, Alexandre, & Overton, Michael L. 2012. Explicit solutions for root optimization of a polynomial family with one affine constraint. IEEE transactions on automatic control, 57(12), 3078–3089.
• Box et al.  [2015] Box, George EP, Jenkins, Gwilym M, Reinsel, Gregory C, & Ljung, Greta M. 2015. Time series analysis: forecasting and control. John Wiley & Sons.
• Boyd et al.  [2011] Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, & Eckstein, Jonathan. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers.

Foundations and Trends® in Machine Learning

, 3(1), 1–122.
• Cabello et al.  [2017] Cabello, Sergio, Cibulka, Josef, Kyncl, Jan, Saumell, Maria, & Valtr, Pavel. 2017. Peeling potatoes near-optimally in near-linear time. SIAM Journal on Computing, 46(5), 1574–1602.
• Calheiros et al.  [2014] Calheiros, Rodrigo N, Masoumi, Enayat, Ranjan, Rajiv, & Buyya, Rajkumar. 2014. Workload prediction using ARIMA model and its impact on cloud applications? QoS. IEEE Transactions on Cloud Computing, 3(4), 449–458.
• Chang & Yap [1986] Chang, Jyun-Sheng, & Yap, Chee-Keng. 1986. A polynomial solution for the potato-peeling problem. Discrete & Computational Geometry, 1(2), 155–182.
• Chen et al.  [2009] Chen, Peiyuan, Pedersen, Troels, Bak-Jensen, Birgitte, & Chen, Zhe. 2009. ARIMA-based time series model of stochastic wind power generation. IEEE transactions on power systems, 25(2), 667–676.
• Combettes & Trussell [1992] Combettes, Patrick L, & Trussell, H Joel. 1992. Best stable and invertible approximations for ARMA systems. IEEE Transactions on signal processing, 40(12), 3066–3069.
• Del Castillo [2002] Del Castillo, Enrique. 2002. Statistical process adjustment for quality control. Vol. 369. Wiley-Interscience.
• Georgiou & Lindquist [2008] Georgiou, Tryphon T, & Lindquist, Anders. 2008. A convex optimization approach to ARMA modeling. IEEE transactions on automatic control, 53(5), 1108–1119.
• Goodman [1981] Goodman, Jacob E. 1981. On the largest convex polygon contained in a non-convex n-gon, or how to peel a potato. Geometriae Dedicata, 11(1), 99–106.
• Hamilton [1994] Hamilton, James D. 1994. Time series analysis. Vol. 2. Princeton New Jersey.
• Han et al.  [2010] Han, Ping, Wang, Peng Xin, Zhang, Shu Yu, et al. . 2010. Drought forecasting based on the remote sensing data using ARIMA models. Mathematical and computer modelling, 51(11-12), 1398–1403.
• Jacob et al.  [2009] Jacob, Laurent, Obozinski, Guillaume, & Vert, Jean-Philippe. 2009. Group lasso with overlap and graph lasso. Pages 433–440 of: Proceedings of the 26th annual international conference on machine learning. ACM.
• Jenatton et al.  [2011a] Jenatton, Rodolphe, Mairal, Julien, Obozinski, Guillaume, & Bach, Francis. 2011a. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12(Jul), 2297–2334.
• Jenatton et al.  [2011b] Jenatton, Rodolphe, Audibert, Jean-Yves, & Bach, Francis. 2011b. Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research, 12(Oct), 2777–2824.
• Makridakis et al.  [2018] Makridakis, Spyros, Spiliotis, Evangelos, & Assimakopoulos, Vassilios. 2018. Statistical and Machine Learning forecasting methods: Concerns and ways forward. PloS one, 13(3).
• Manzour et al.  [2019] Manzour, Hasan, Küçükyavuz, Simge, & Shojaie, Ali. 2019. Integer Programming for Learning Directed Acyclic Graphs from Continuous Data. arXiv preprint arXiv:1904.10574.
• Mazumder & Radchenko [2017] Mazumder, Rahul, & Radchenko, Peter. 2017. TheDiscrete Dantzig Selector: Estimating Sparse Linear Models via Mixed Integer Linear Optimization. IEEE Transactions on Information Theory, 63(5), 3053–3075.
• Moses & Liu [1991] Moses, Randolph L, & Liu, Duixian. 1991. Determining the closest stable polynomial to an unstable one. IEEE Transactions on signal processing, 39(4), 901–906.
• Nesterov [2013] Nesterov, Yu. 2013. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161.
• Parikh et al.  [2014] Parikh, Neal, Boyd, Stephen, et al. . 2014. Proximal algorithms. Foundations and Trends® in Optimization, 1(3), 127–239.
• Razaviyayn et al.  [2013] Razaviyayn, Meisam, Hong, Mingyi, & Luo, Zhi-Quan. 2013. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2), 1126–1153.
• Wang et al.  [2015] Wang, Wen-chuan, Chau, Kwok-wing, Xu, Dong-mei, & Chen, Xiao-Yun. 2015. Improving forecasting accuracy of annual runoff time series using ARIMA based on EEMD decomposition. Water Resources Management, 29(8), 2655–2675.
• Yan et al.  [2017] Yan, Xiaohan, Bien, Jacob, et al. . 2017. Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations. Statistical Science, 32(4), 531–560.
• Zhang et al.  [2020] Zhang, Dewei, Liu, Yin, & Tajbakhsh, Sam Davanloo. 2020. A first-order optimization algorithm for statistical learning with hierarchical sparsity structure.
• Zhang et al.  [2014] Zhang, Xingyu, Zhang, Tao, Young, Alistair A, & Li, Xiaosong. 2014. Applications and comparisons of four time series models in epidemiological surveillance data. PLoS One, 9(2).
• Zhao et al.  [2009] Zhao, Peng, Rocha, Guilherme, & Yu, Bin. 2009. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 3468–3497.