Minimum Message Length Autoregressive Moving Average Model Order Selection

10/07/2021
by   Zheng Fang, et al.
0

This paper derives a Minimum Message Length (MML) criterion for the model selection of the Autoregressive Moving Average (ARMA) time series model. The MML87 performances on the ARMA model compared with other well known model selection criteria, Akaike Information Criterion (AIC), Corrected AIC (AICc), Bayesian Information Criterion (BIC), and Hannan Quinn (HQ). The experimental results show that the MML87 is outperformed the other model selection criteria as it select most of the models with lower prediction errors and the models selected by MML87 to have a lower mean squared error in different in-sample and out-sample sizes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/11/2015

Bridging AIC and BIC: a new criterion for autoregression

We introduce a new criterion to determine the order of an autoregressive...
08/01/2018

Model selection by minimum description length: Lower-bound sample sizes for the Fisher information approximation

The Fisher information approximation (FIA) is an implementation of the m...
04/20/2021

Constrained Bayesian Hierarchical Models for Gaussian Data: A Model Selection Criterion Approach

Consider the setting where there are B>1 candidate statistical models, a...
05/22/2018

Multi-model inference through projections in model space

Information criteria have had a profound impact on modern ecological sci...
12/28/2018

Cardiology Admissions from Catheterization Laboratory: Time Series Forecasting

Emergent and unscheduled cardiology admissions from cardiac catheterizat...
06/23/2019

Inferring Latent dimension of Linear Dynamical System with Minimum Description Length

Time-invariant linear dynamical system arises in many real-world applica...
08/10/2017

Automatic Selection of t-SNE Perplexity

t-Distributed Stochastic Neighbor Embedding (t-SNE) is one of the most w...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time series data is important nowadays in terms of economics, financial mathematics, weather forecasting, and signal processing. In finance, we have the high-frequency data from the stock market which is presented as time series. In weather forecasting, we have the data of rainfall also presented as time series. It is essential that select a good fitting time series model in given data, which helps to benefits the financial investors to control the risk, social scientists to analyze the population growth. If we are able to select a good fitting time series model, it significantly contributes to the time series community, that’s why this topic is popular in the academic and has been researched for the long term.

Minimum Message Length estimator (MML87) is introduced by Wallace and Freeman

[wallace1987estimation], which is the key inductive inference method in this paper, it belongs to the information-theoretic criterion in terms of model selection. MML87 is an extended version of Minimum Message Length (MML) is introduced also by Wallace and Boulton in 1968 [wallace1968information]

. MML87 is widely applicable in different areas, and successfully in solving many time series problems, the results from Fitzgibbon and Sak show the MML87 has outperformance in the Autoregressive model (AR) and Moving-Average model (MA)

[fitzgibbon2004minimum, sak2005minimum]

, and results from Z Fang shows the MML87 outperform in the hybrid ARMA with neural network LSTM (Long short-term memory)

[zheng2021].

The purpose of this paper is to derive and investigate the use of the MML87 methodology for the Autoregressive–Moving-Average Model (ARMA) model. The ARMA model is one of the most popular time series models introduced by Box and Jenkins in 1976 [box1976time], because the ARMA model considers the ability in modeling the lag (i.e., past) value in time series, and also modeling the random factors in time series such as strikes, or error residuals. The effects of lag value and lag residuals on the forecasted value will gradually decrease with subsequent periods [chatfield2004cross]. The MML87 in this paper is based on unconditional log-likelihood from the ARMA model. This paper also derives the Fisher information matrix based on the likelihood function, which is the key to make the MML87 to calculable in the ARMA time series case. The MML87 is empirically compared with the AIC, AICc, BIC, and HQ, in terms of the Mean squared error (MSE) in the forecasting window. This paper also compares the number of times that each information-theoretic criterion selects the model with lower mean squared error, the results show that the MML87 is outperforming for the ARMA forecasting than the other information-theoretic criterion. It is because the MML87 selects the model with minimizing the overall message length gives the goodness of fit for the model and less complexity of the model and selected [wallace1987estimation, wallace1999minimum] we will define the message length later in section 4.

2 Time Series Background

2.1 ARMA Model

This section reviews the theory of Autoregressive Moving AverageModel (ARMA) modelling due to Box and Jenkins (1970) [box2015time]. Let’s consider a time series data generated from ARMA() model:

(1)

where .

Rearrange equation 1

, the function of white noise

become:

(2)

Denote the is backshift operator where for , rearrange the equation 2:

(3)

more generally is , where and .

The can be rearranged as:

(4)

where is function of : such that

is a 1 x (p + q + 1) vector.

Assuming the data generated from a stationary Gaussian ARMA(

) process, it follows the multivariate Gaussian distribution with zero mean, based on the unconditional log-likelihood function

, N observations is:

(5)

where is n by n autocovariance matrix, is determinant of which is determinant in joint density of . The is n x n theoretical covariance matrix of the data satisfying an ARMA() model [fitzgibbon2004minimum, sak2005minimum, anderson1976inverse].

2.2 Information Theoretic Criterion

There are some existing model selection techniques from the information theoretic criterion in the ARMA time series model, including the AIC (Akaike’s Information Criterion), AICc (Corrected AIC), BIC (Bayesian Information Criterion), and HQ (Hannan-Quinn). The formulas to calculate those information theoretic criterion in ARMA models given as:

  • AIC = .

  • AICc =

  • BIC =

  • HQ =

where if intercept and if intercept , is the number of observations for a given time series, and is likelihood of the data.

3 Fisher Information Matrix

In this section, we calculates the Fisher Information matrix required by the MML87, it calculated as negative of the expected value of second derivatives of log-likelihood given as:

(6)

where the

be a probability density function of

given some parameter .

The Fisher information matrix for log likelihood of ARMA time series model is combine the equations 5 and 6:

(7)

From equation 5 and equation 7, the fisher information matrix given as:

(8)

In the meantime, the term

from the equation 8

will converges to 0 in probability when

then we can ignore this term, because the function is dominated by the second term of equation 8 in large values of .

Based on the Gaussian MLE by maximize equation 5:

So it is important to note that does not depend on . The exact least squares estimate to provide the approximations really close to the results from Gaussian MLE [yao2006gaussianii, yao2006gaussiani].
Let

(9)

and is the white noise function of with according to equation 4.
Rearrange the equation 4 and 9, which is in function of , the formula given by:

(10)

where the mean of white noise is 0 [miller1995exact, wincek1986exact, shephard1997relationship]. Box and Jenkins (1976) and Box, Jenkins, and Reinsel (2015, Section 7.1.2-4 p 210-213.)[ljung1979likelihood, box2015time, box1976time] point out the parameters can be obtained by minimize the unconditional sum of squares function: .
The Fisher Information matrix 7 can be expressed as only the second term of equation 8 inside the expectation:

by using the product rule. The term

can be rearranged to

by using the product rule again [vahid1999partial].

So the Fisher Information Matrix will be rearranged as:

(11)

In the meantime, the Fisher information matrix can be written in a general format regarding and :

Based on from equation 4 and the lag of is [klein1995computation].

Consequently the element of the block of the Fisher information matrix is:

(12)

4 Minimum Message Length

Minimum Message Length (MML) is an inductive inference method for the model selection, the MML found on top of the coding theory. The underlying concept for the MML principle is based on the data compression theory. Assuming there is a sequence of time series data, and a set of candidates ARMA models, the MML is assuming that the data is encoded into two parts of messages transmitted from sender to receiver [wallace1968information, wong2019minimum]:

  • ARMA time series model

  • Time series data fitted on the ARMA model in the first message component

where is the parameter space of the ARMA model , and the is a countable subset of parameter space containing all possible MML estimates parameter [wong2019minimum].

The first dot point in MML theory encodes the model structure and parameters then transmits from sender to receiver, which is called the assertion. The second dot point in MML theory encodes the observed data using the model specified in the assertion, which is called detail. The MML minimize the total message length transmitted from sender to receiver contains data and model given as [wallace1968information]:

(13)

where the is represented as the message length transmitted from sender to receiver. The assertion represents the complexity of the model, and detail represents how well the model fitting the data. So the MML minimize the tradeoff between the complexity and goodness-of-fit of the model, that’s why the MML has fewer chances to overfitting the data.

The MML87 is extended version of Minimum Message Length (MML) used in model selection of several continuous parameter set [wallace1999minimum]:

(14)

Rearrange the equation 14 will become:

(15)

where is measuring the accuracy of data, is the prior on the number of parameters, is the Bayesian prior distribution over the parameter set, is standard statistical likelihood function, is the Fisher Information matrix of the parameter set is the lattice constant (which accounts for the expected error in the log-likelihood function from ARMA model in equation 5 due to quantization of the n-dimensional space, it is bounded above by and bounded below by .

For example, , , , and when ). Assuming the exact likelihood function is used, and assuming that the data comes from a stationary [fitzgibbon2004minimum, wallace1999minimum, ward2008review].

The stationary model message length is calculated by substitution of equation 12 and likelihood function into (5) alone with the prior [fitzgibbon2004minimum, sak2005minimum, wallace1999minimum].

5 Bayesian Priors

In the MML87, it is ignorant prior to observation of data. The MML87 for ARMA model in this report using exact likelihood function and expect the data comes from a stationary process, so placing a uniform prior on from the results [fitzgibbon2004minimum, sak2005minimum].




where is hypervolume of the stationarity region and based on the results from [barndorff1973parametrization, piccolo1982size]. We can recursively calculate the invertibility region hypervolume by:

= 2






6 Empirical Comparison

6.1 Simulated data ARMA(1, 1)

We use the simulated data to compare the performances of MML87 with AIC, AICc, BIC, and HQ. In this case, using uniform distribution to generate 5 different parameter values for

and 2 different parameter values for . In each combination of and , we generate the different sizes of in-sample and out-sample datasets. We totally have 1,000 different data in each combination of and . For each data set, we use different combination of lag order from 1 to 5 for and from 0 to 5 for in ARMA to construct the ARMA model, it provides 30 different models as the comparison in one dataset. Then use each MML87, AIC, AICc, BIC, and HQ to select one best model out of 30 models and calculate the lost function, the lost function used in the empirical comparison is MSPE:

(16)

Table 1 use the in-sample size N = 100, and forecast method used in this case is rolling forecast with a window size of T = 10. Under one data set, we select the prediction errors for minimum values of AIC, AICc, BIC, HQ, and MML87, because the minimum of those information criteria values is the model selected out of 30 different models. Next, we compared the different prediction errors selected from the model selected by AIC, AICc, BIC, HQ, and MML87, the selected model with the smallest prediction error, the information criteria is more efficient. In each combination of simulated parameter combination of and , there are 100 data set as mentioned above, we compared the number of times that the AIC, AICc, BIC, HQ, and MML87 can select the model with minimum prediction error.

MML87 AIC AICc BIC HQ
45 25 34 41 40
69 62 60 62 68
42 44 40 40 41
75 64 68 72 72
46 36 40 44 38
69 51 57 69 69
42 32 37 42 38
64 51 50 66 57
46 41 39 56 47
63 69 66 72 68
Table 1: Number of times that information-theoretic selects the minimum forecast error in N = 100 T = 10

For example, the 1st row table 1 indicate that using the particular values of parameter combination of and to generate 100 dataset, there are N = 100 plus T = 10 data points in each dataset. We find out the number of times that MML87 select minimum prediction error is 45 out of 100 for modeling each dataset. There is possible that the summation of each number of times is greater than 100 because there is possible that the two or more information criteria selected the same model for minimum prediction errors. According to the above table 1, the MML87 select the most of minimum prediction error models in different combinations of parameters. It significantly demonstrates that the MML87 is outperformance in the model selection for the ARMA time series model compared with AIC, AICc, BIC, and HQ.

The Table from 8 to 12 shows the comparison of counting number of times that those information theoretical criteria to select the minimum prediction error in N = 50, 150, 200, 300, and 500. The results suggest that the MML87 is able to select more times of lower error model than other information criteria.

Table 2 show the average MSPE in 100 datasets and different simulated parameters when in-sample size N = 50 and 100 and out-sample size T = 10. In each row, the average of MSPE from the model is selected by the different information criteria. The lower forecast error indicates the better model, it highlighted as bold text, the MML87 outperform other information criteria in the majority of the cases.

MML87 AIC AICc BIC HQ
1.424699
(0.90860) 1.521890
(0.99921) 1.499588
(1.02887) 1.428254
(0.90914) 1.462676
(0.99725)
4.397397
(3.48977) 4.541932
(3.64444) 4.515208
(3.60842) 4.417018
(3.49456) 4.335861
(3.43196)
1.192377
(0.63315) 1.237391
(0.66865) 1.220480
(0.66442) 1.199164
(0.66119) 1.197432
(0.66071)
1.400034
(0.63992) 1.427151
(0.63984) 1.413369
(0.64121) 1.389479
(0.64031) 1.398019
(0.62572)
1.238596
(0.63055) 1.283056
(0.64058) 1.268670
(0.63659) 1.404347
(0.61788) 1.252059
(0.62601)
1.430728
(0.68179) 1.534319
(0.73082) 1.504423
(0.72636) 1.426555
(0.68741) 1.437177
(0.68837)
1.286106
(0.69430) 1.338531
(0.70770) 1.311318
(0.68228) 1.276376
(0.67433) 1.287715
(0.68372)
1.467448
(0.75440) 1.547590
(0.84311) 1.537011
(0.83362) 1.472064
(0.75156) 1.488488
(0.78266)
2.261384
(1.87727) 2.417009
(1.97325) 2.380226
(1.95017) 2.301430
(1.93637) 2.338255
(1.97315)
1.041781
(0.47548) 1.072443
(0.48750) 1.067345
(0.48744) 1.044731
(0.47907) 1.047712
(0.47553)
Table 2: Average of MSPE for the model selected by different information-theoretic in N = 100 T = 10

Table 3 test the average of prediction error in out-sample size T = 10 by different amount in-sample size N. It uses the in-sample data size of N = 50, 100, 150, 200, 300, and 500 to compare the average MSPE from the models selected by different information-theoretic criteria. The MML87 outperforms the AIC, AICc, BIC, and HQ in the size N = 50, 100, 150, and 300 because the AIC and BIC tend to overfit the data in the smaller in-sample size.

MML87 AIC AICc BIC HQ
N = 50 1.796509 1.886955 1.834129 1.801342 1.826132
N = 100 1.714055 1.792131 1.771764 1.716704 1.724539
N = 150 1.748152 1.810529 1.809941 1.749774 1.776628
N = 200 1.771909 1.833165 1.821335 1.764313 1.789190
N = 300 1.703449 1.741882 1.737454 1.711380 1.712863
N = 500 1.719375 1.736897 1.735968 1.717373 1.718875
Table 3: Average of MSPE for different in-sample size N

Table 4 compares the average of MSPE in the out-sample size T = 10, 30, 50, and 100. The MML87 outperforms other information-theoretic criteria in T = 10, 30, and 50.

Criterions MML87 AIC AICc BIC HQ
T = 10 1.714055 1.792131 1.771764 1.716704 1.724539
T = 30 1.793344 1.823455 1.810748 1.804022 1.801981
T = 50 1.901877 1.925961 1.922792 1.914095 1.909096
T = 100 1.898179 1.925083 1.917025 1.897318 1.903784
Table 4: Average of MSPE for different out-sample size T

6.2 Simulated data ARMA(p, q)

This subsection also uses uniform distribution to simulate the dataset by the different amounts and different values of and . This section compares the MSPE from five information-theoretic criteria in terms of forecast windows used T = 1, 3, 5, 10, 30, 50, 70, 100, 130, 150 and N = 100. Each different forecast window contains and generates 100 datasets, the results show the average of MSPE. Table 5 shows the results from simulated data by ARMA( to ARMA(.

MML87 AIC AICc BIC HQ
ARMA(2, 1) 1.919648 1.983033 1.958249 1.922295 1.949766
ARMA(2, 2) 2.051329 2.118476 2.104302 2.053353 2.081951
ARMA(3, 1) 1.487463 1.527326 1.521105 1.490115 1.500657
ARMA(3, 2) 1.712706 1.738684 1.721419 1.699448 1.712409
ARMA(4, 1) 1.961224 2.034625 2.018761 1.978715 1.994742
ARMA(4, 2) 1.520506 1.567478 1.556770 1.521140 1.535147
ARMA(5, 1) 2.015772 2.084429 2.055148 2.022353 2.034267
ARMA(5, 2) 1.598522 1.647486 1.637956 1.598533 1.620931
Table 5: Average of MSPE for different ARMA()

Figure 1 shows the average MSPE comparison between the MML87, AIC, AICc, BIC, and HQ, the simulated data generated from parameters ARMA(2, 1). The Figure 2 shows the comparison in terms of simulated data generated from ARMA(3, 3) parameters. The MML is outperform the other information criteria because the mean of prediction error is the smallest across the different forecast window sizes. Figures from 2 to 8 in section 8.2 show the diagram in the data generated from parameters ARMA(2, 2) to ARMA(5, 2).

Figure 1: Average MSPE in ARMA(2, 1) Simulated Data

6.3 Actual data

In this section, we are using the real financial data with time series characteristics to compare the performance for AIC, AICc, BIC, HQ, and MML87. In considering both the systematic and unsystematic risk, the data collected from the stock index from the market portfolio across different countries including the ASX200, Dow Jones Composite Average, FTSE100, Russell, Nasdaq 100, Nasdaq Composite, and NYSE Composite, also includes two individual stock AAPL and GS. The time horizon that the financial data selected is from 2020-09-08 to 2021-09-07. Table 6 shows the number of times that the information criteria selected the lower error model. For each asset portfolio, the data are separated into the 8 segments with N = 30 and T = 10, and use different combinations of lag order 1 to 5 in AR and lag order 0 to 5 in MA. The results suggest that the MML87 is able to select a larger number of lower error models than other information criteria in 8 assets out of 10, the best information criterion shown in bold text.

Criterions MML87 AIC AICc BIC HQ
Dow Jones Composite Average 4 3 4 3 4
SP500 5 2 5 5 4
ASX200 7 3 5 5 4
Russell 2 5 5 5 5
Dow Jones Industrial Average 4 4 6 6 3
FTSE100 5 5 4 4 5
Nasdaq100 7 3 2 3 3
Nasdaq Composite 5 3 2 3 2
AAPL 5 5 4 5 3
GS 6 4 4 4 4
Table 6: Number of times that information-theoretic selects the minimum forecast error in financial data

Because the selected portfolio values have different data scales, to avoid the misclassification of best model selection criteria. Table 7 provides the average of the log prediction error for ten different asset portfolios. The bold texts are the minimum error in this asset portfolio. The results show the models selected by MML87 beat the AIC, AICc, BIC, and HQ with having a lower average of log prediction error.

MML87 AIC AICc BIC HQ
DJ Com 9.339222
(1.48159) 9.431943
(1.48159) 9.266359
(1.50159) 9.350298
(1.43139) 9.266359
(1.39585)
SP500 8.170703
(0.95529) 8.467639
(0.95529) 8.175019
(1.19616) 8.175019
(0.94133) 8.209827
(0.89824)
ASX200 8.668513
(0.94712) 9.073504
(0.94712) 8.807688
(1.04215) 8.807688
(0.91327) 8.807688
(0.91327)
Russell 7.556093
(0.97889) 7.464931
(0.97889) 7.362797
(1.27475) 7.362797
(1.13518) 7.362797
(1.13518)
Dow Jones 12.124174
(1.28175) 12.279021
(1.28175) 11.930101
(1.49217) 11.930101
(1.39267) 12.066458
(1.45273)
FTSE100 9.610056
(0.92881) 9.732419
(0.92881) 9.740016
(0.87461) 9.740016
(0.88376) 9.719694
(0.90703)
Nasdaq 100 11.247209
(0.88016) 11.297879
(0.88016) 11.30298
(0.73106) 11.253671
(0.76663) 11.300069
(0.73563)
Nasdaq Com 11.172626
(0.97813) 11.494775
(0.97813) 11.261841
(0.90478) 11.201055
(0.9338) 11.37422
(0.91929)
AAPL 3.082
(1.21437) 2.967387
(1.21437) 3.053346
(1.22936) 2.871599
(1.08716) 3.132657
(1.10378)
GS 4.714533
(0.82053) 4.931766
(0.82053) 4.776378
(0.68177) 4.776378
(0.89269) 4.786212
(0.88672)
Table 7: Average of MSPE for the model selected by different information-theoretic in financial data

7 Conclusion

We have investigated the Autoregressive–Moving-Average model in the MML87 information criteria based on the Wallace and Freeman (1987) approximation. Using the maximum likelihood estimate in ARMA modeling, and unconditional likelihood function in calculation of Fisher Information matrix. The results show MML87 outperformance but not dominate the other information criteria in the simulated data and actual financial data when using the MSPE. On average, MML87 is a really good model selection technique in time series data.

8 Appendix

8.1 Tables

MML87 AIC AICc BIC HQ
48 35 42 48 42
61 65 66 65 67
33 39 41 54 48
66 48 57 66 58
40 40 47 53 52
66 58 63 63 64
30 47 50 57 54
69 61 59 69 60
53 41 39 36 40
49 47 54 64 57
Table 8: Number of times that information-theoretic selects the minimum forecast error in N = 50 T = 10
MML87 AIC AICc BIC HQ
48 40 43 41 40
75 64 65 67 67
43 37 34 43 39
66 62 61 71 68
48 37 41 31 40
74 60 63 73 73
41 46 50 43 43
71 60 60 77 73
45 46 44 49 47
67 57 58 66 61
Table 9: Number of times that information-theoretic selects the minimum forecast error in N = 150 T = 10
MML87 AIC AICc BIC HQ
47 33 37 36 36
65 57 54 64 57
55 51 49 44 40
68 57 59 69 59
42 47 50 47 50
68 59 59 71 62
52 53 49 49 49
71 62 62 68 63
51 51 49 53 55
39 55 52 43 50
Table 10: Number of times that information-theoretic selects the minimum forecast error in N = 200 T = 10
MML87 AIC AICc BIC HQ
30 44 45 45 43
75 49 52 67 65
56 50 49 51 53
64 69 70 67 68
32 49 51 29 34
65 61 60 62 59
55 46 48 55 59
72 61 63 72 69
61 48 49 44 44
50 44 44 59 55
Table 11: Number of times that information-theoretic selects the minimum forecast error in N = 300 T = 10
MML87 AIC AICc BIC HQ
47 42 41 49 38
74 71 73 73 74
42 48 50 44 46
73 62 60 73 73
45 44 41 39 40
67 60 60 65 69
38 47 47 50 45
70 51 51 70 70
54 57 55 50 51
65 58 60 65 60
Table 12: Number of times that information-theoretic selects the minimum forecast error in N = 500 T = 10

8.2 Figures

Figure 2: Average MSPE in ARMA(2, 2) Simulated Data
Figure 3: Average MSPE in ARMA(3, 1) Simulated Data
Figure 4: Average MSPE in ARMA(3, 2) Simulated Data
Figure 5: Average MSPE in ARMA(4, 1) Simulated Data
Figure 6: Average MSPE in ARMA(4, 2) Simulated Data
Figure 7: Average MSPE in ARMA(5, 1) Simulated Data
Figure 8: Average MSPE in ARMA(5, 2) Simulated Data

[title=Reference]