Time series data is important nowadays in terms of economics, financial mathematics, weather forecasting, and signal processing. In finance, we have the high-frequency data from the stock market which is presented as time series. In weather forecasting, we have the data of rainfall also presented as time series. It is essential that select a good fitting time series model in given data, which helps to benefits the financial investors to control the risk, social scientists to analyze the population growth. If we are able to select a good fitting time series model, it significantly contributes to the time series community, that’s why this topic is popular in the academic and has been researched for the long term.
Minimum Message Length estimator (MML87) is introduced by Wallace and Freeman[wallace1987estimation], which is the key inductive inference method in this paper, it belongs to the information-theoretic criterion in terms of model selection. MML87 is an extended version of Minimum Message Length (MML) is introduced also by Wallace and Boulton in 1968 [wallace1968information]
. MML87 is widely applicable in different areas, and successfully in solving many time series problems, the results from Fitzgibbon and Sak show the MML87 has outperformance in the Autoregressive model (AR) and Moving-Average model (MA)[fitzgibbon2004minimum, sak2005minimum]zheng2021].
The purpose of this paper is to derive and investigate the use of the MML87 methodology for the Autoregressive–Moving-Average Model (ARMA) model. The ARMA model is one of the most popular time series models introduced by Box and Jenkins in 1976 [box1976time], because the ARMA model considers the ability in modeling the lag (i.e., past) value in time series, and also modeling the random factors in time series such as strikes, or error residuals. The effects of lag value and lag residuals on the forecasted value will gradually decrease with subsequent periods [chatfield2004cross]. The MML87 in this paper is based on unconditional log-likelihood from the ARMA model. This paper also derives the Fisher information matrix based on the likelihood function, which is the key to make the MML87 to calculable in the ARMA time series case. The MML87 is empirically compared with the AIC, AICc, BIC, and HQ, in terms of the Mean squared error (MSE) in the forecasting window. This paper also compares the number of times that each information-theoretic criterion selects the model with lower mean squared error, the results show that the MML87 is outperforming for the ARMA forecasting than the other information-theoretic criterion. It is because the MML87 selects the model with minimizing the overall message length gives the goodness of fit for the model and less complexity of the model and selected [wallace1987estimation, wallace1999minimum] we will define the message length later in section 4.
2 Time Series Background
2.1 ARMA Model
This section reviews the theory of Autoregressive Moving AverageModel (ARMA) modelling due to Box and Jenkins (1970) [box2015time]. Let’s consider a time series data generated from ARMA() model:
Rearrange equation 1
, the function of white noisebecome:
Denote the is backshift operator where for , rearrange the equation 2:
more generally is , where and .
The can be rearranged as:
where is function of : such that
is a 1 x (p + q + 1) vector.
Assuming the data generated from a stationary Gaussian ARMA(
) process, it follows the multivariate Gaussian distribution with zero mean, based on the unconditional log-likelihood function, N observations is:
where is n by n autocovariance matrix, is determinant of which is determinant in joint density of . The is n x n theoretical covariance matrix of the data satisfying an ARMA() model [fitzgibbon2004minimum, sak2005minimum, anderson1976inverse].
2.2 Information Theoretic Criterion
There are some existing model selection techniques from the information theoretic criterion in the ARMA time series model, including the AIC (Akaike’s Information Criterion), AICc (Corrected AIC), BIC (Bayesian Information Criterion), and HQ (Hannan-Quinn). The formulas to calculate those information theoretic criterion in ARMA models given as:
AIC = .
where if intercept and if intercept , is the number of observations for a given time series, and is likelihood of the data.
3 Fisher Information Matrix
In this section, we calculates the Fisher Information matrix required by the MML87, it calculated as negative of the expected value of second derivatives of log-likelihood given as:
be a probability density function ofgiven some parameter .
In the meantime, the term
from the equation 8
will converges to 0 in probability whenthen we can ignore this term, because the function is dominated by the second term of equation 8 in large values of .
Based on the Gaussian MLE by maximize equation 5:
So it is important to note that does not depend on . The exact least squares estimate to provide the approximations really close to the results from Gaussian MLE [yao2006gaussianii, yao2006gaussiani].
where the mean of white noise is 0 [miller1995exact, wincek1986exact, shephard1997relationship]. Box and Jenkins (1976) and Box, Jenkins, and Reinsel (2015, Section 7.1.2-4 p 210-213.)[ljung1979likelihood, box2015time, box1976time] point out the parameters can be obtained by minimize the unconditional sum of squares function: .
The Fisher Information matrix 7 can be expressed as only the second term of equation 8 inside the expectation:
by using the product rule. The term
can be rearranged to
by using the product rule again [vahid1999partial].
So the Fisher Information Matrix will be rearranged as:
In the meantime, the Fisher information matrix can be written in a general format regarding and :
Based on from equation 4 and the lag of is [klein1995computation].
Consequently the element of the block of the Fisher information matrix is:
4 Minimum Message Length
Minimum Message Length (MML) is an inductive inference method for the model selection, the MML found on top of the coding theory. The underlying concept for the MML principle is based on the data compression theory. Assuming there is a sequence of time series data, and a set of candidates ARMA models, the MML is assuming that the data is encoded into two parts of messages transmitted from sender to receiver [wallace1968information, wong2019minimum]:
ARMA time series model
Time series data fitted on the ARMA model in the first message component
where is the parameter space of the ARMA model , and the is a countable subset of parameter space containing all possible MML estimates parameter [wong2019minimum].
The first dot point in MML theory encodes the model structure and parameters then transmits from sender to receiver, which is called the assertion. The second dot point in MML theory encodes the observed data using the model specified in the assertion, which is called detail. The MML minimize the total message length transmitted from sender to receiver contains data and model given as [wallace1968information]:
where the is represented as the message length transmitted from sender to receiver. The assertion represents the complexity of the model, and detail represents how well the model fitting the data. So the MML minimize the tradeoff between the complexity and goodness-of-fit of the model, that’s why the MML has fewer chances to overfitting the data.
The MML87 is extended version of Minimum Message Length (MML) used in model selection of several continuous parameter set [wallace1999minimum]:
Rearrange the equation 14 will become:
where is measuring the accuracy of data, is the prior on the number of parameters, is the Bayesian prior distribution over the parameter set, is standard statistical likelihood function, is the Fisher Information matrix of the parameter set is the lattice constant (which accounts for the expected error in the log-likelihood function from ARMA model in equation 5 due to quantization of the n-dimensional space, it is bounded above by and bounded below by .
For example, , , , and when ). Assuming the exact likelihood function is used, and assuming that the data comes from a stationary [fitzgibbon2004minimum, wallace1999minimum, ward2008review].
The stationary model message length is calculated by substitution of equation 12 and likelihood function into (5) alone with the prior [fitzgibbon2004minimum, sak2005minimum, wallace1999minimum].
5 Bayesian Priors
In the MML87, it is ignorant prior to observation of data. The MML87 for ARMA model in this report using exact likelihood function and expect the data comes from a stationary process, so placing a uniform prior on from the results [fitzgibbon2004minimum, sak2005minimum].
where is hypervolume of the stationarity region and based on the results from [barndorff1973parametrization, piccolo1982size]. We can recursively calculate the invertibility region hypervolume by:
6 Empirical Comparison
6.1 Simulated data ARMA(1, 1)
We use the simulated data to compare the performances of MML87 with AIC, AICc, BIC, and HQ. In this case, using uniform distribution to generate 5 different parameter values forand 2 different parameter values for . In each combination of and , we generate the different sizes of in-sample and out-sample datasets. We totally have 1,000 different data in each combination of and . For each data set, we use different combination of lag order from 1 to 5 for and from 0 to 5 for in ARMA to construct the ARMA model, it provides 30 different models as the comparison in one dataset. Then use each MML87, AIC, AICc, BIC, and HQ to select one best model out of 30 models and calculate the lost function, the lost function used in the empirical comparison is MSPE:
Table 1 use the in-sample size N = 100, and forecast method used in this case is rolling forecast with a window size of T = 10. Under one data set, we select the prediction errors for minimum values of AIC, AICc, BIC, HQ, and MML87, because the minimum of those information criteria values is the model selected out of 30 different models. Next, we compared the different prediction errors selected from the model selected by AIC, AICc, BIC, HQ, and MML87, the selected model with the smallest prediction error, the information criteria is more efficient. In each combination of simulated parameter combination of and , there are 100 data set as mentioned above, we compared the number of times that the AIC, AICc, BIC, HQ, and MML87 can select the model with minimum prediction error.
For example, the 1st row table 1 indicate that using the particular values of parameter combination of and to generate 100 dataset, there are N = 100 plus T = 10 data points in each dataset. We find out the number of times that MML87 select minimum prediction error is 45 out of 100 for modeling each dataset. There is possible that the summation of each number of times is greater than 100 because there is possible that the two or more information criteria selected the same model for minimum prediction errors. According to the above table 1, the MML87 select the most of minimum prediction error models in different combinations of parameters. It significantly demonstrates that the MML87 is outperformance in the model selection for the ARMA time series model compared with AIC, AICc, BIC, and HQ.
The Table from 8 to 12 shows the comparison of counting number of times that those information theoretical criteria to select the minimum prediction error in N = 50, 150, 200, 300, and 500. The results suggest that the MML87 is able to select more times of lower error model than other information criteria.
Table 2 show the average MSPE in 100 datasets and different simulated parameters when in-sample size N = 50 and 100 and out-sample size T = 10. In each row, the average of MSPE from the model is selected by the different information criteria. The lower forecast error indicates the better model, it highlighted as bold text, the MML87 outperform other information criteria in the majority of the cases.
Table 3 test the average of prediction error in out-sample size T = 10 by different amount in-sample size N. It uses the in-sample data size of N = 50, 100, 150, 200, 300, and 500 to compare the average MSPE from the models selected by different information-theoretic criteria. The MML87 outperforms the AIC, AICc, BIC, and HQ in the size N = 50, 100, 150, and 300 because the AIC and BIC tend to overfit the data in the smaller in-sample size.
|N = 50||1.796509||1.886955||1.834129||1.801342||1.826132|
|N = 100||1.714055||1.792131||1.771764||1.716704||1.724539|
|N = 150||1.748152||1.810529||1.809941||1.749774||1.776628|
|N = 200||1.771909||1.833165||1.821335||1.764313||1.789190|
|N = 300||1.703449||1.741882||1.737454||1.711380||1.712863|
|N = 500||1.719375||1.736897||1.735968||1.717373||1.718875|
Table 4 compares the average of MSPE in the out-sample size T = 10, 30, 50, and 100. The MML87 outperforms other information-theoretic criteria in T = 10, 30, and 50.
|T = 10||1.714055||1.792131||1.771764||1.716704||1.724539|
|T = 30||1.793344||1.823455||1.810748||1.804022||1.801981|
|T = 50||1.901877||1.925961||1.922792||1.914095||1.909096|
|T = 100||1.898179||1.925083||1.917025||1.897318||1.903784|
6.2 Simulated data ARMA(p, q)
This subsection also uses uniform distribution to simulate the dataset by the different amounts and different values of and . This section compares the MSPE from five information-theoretic criteria in terms of forecast windows used T = 1, 3, 5, 10, 30, 50, 70, 100, 130, 150 and N = 100. Each different forecast window contains and generates 100 datasets, the results show the average of MSPE. Table 5 shows the results from simulated data by ARMA( to ARMA(.
Figure 1 shows the average MSPE comparison between the MML87, AIC, AICc, BIC, and HQ, the simulated data generated from parameters ARMA(2, 1). The Figure 2 shows the comparison in terms of simulated data generated from ARMA(3, 3) parameters. The MML is outperform the other information criteria because the mean of prediction error is the smallest across the different forecast window sizes. Figures from 2 to 8 in section 8.2 show the diagram in the data generated from parameters ARMA(2, 2) to ARMA(5, 2).
6.3 Actual data
In this section, we are using the real financial data with time series characteristics to compare the performance for AIC, AICc, BIC, HQ, and MML87. In considering both the systematic and unsystematic risk, the data collected from the stock index from the market portfolio across different countries including the ASX200, Dow Jones Composite Average, FTSE100, Russell, Nasdaq 100, Nasdaq Composite, and NYSE Composite, also includes two individual stock AAPL and GS. The time horizon that the financial data selected is from 2020-09-08 to 2021-09-07. Table 6 shows the number of times that the information criteria selected the lower error model. For each asset portfolio, the data are separated into the 8 segments with N = 30 and T = 10, and use different combinations of lag order 1 to 5 in AR and lag order 0 to 5 in MA. The results suggest that the MML87 is able to select a larger number of lower error models than other information criteria in 8 assets out of 10, the best information criterion shown in bold text.
|Dow Jones Composite Average||4||3||4||3||4|
|Dow Jones Industrial Average||4||4||6||6||3|
Because the selected portfolio values have different data scales, to avoid the misclassification of best model selection criteria. Table 7 provides the average of the log prediction error for ten different asset portfolios. The bold texts are the minimum error in this asset portfolio. The results show the models selected by MML87 beat the AIC, AICc, BIC, and HQ with having a lower average of log prediction error.
We have investigated the Autoregressive–Moving-Average model in the MML87 information criteria based on the Wallace and Freeman (1987) approximation. Using the maximum likelihood estimate in ARMA modeling, and unconditional likelihood function in calculation of Fisher Information matrix. The results show MML87 outperformance but not dominate the other information criteria in the simulated data and actual financial data when using the MSPE. On average, MML87 is a really good model selection technique in time series data.