1 Introduction
Time series data is important nowadays in terms of economics, financial mathematics, weather forecasting, and signal processing. In finance, we have the highfrequency data from the stock market which is presented as time series. In weather forecasting, we have the data of rainfall also presented as time series. It is essential that select a good fitting time series model in given data, which helps to benefits the financial investors to control the risk, social scientists to analyze the population growth. If we are able to select a good fitting time series model, it significantly contributes to the time series community, that’s why this topic is popular in the academic and has been researched for the long term.
Minimum Message Length estimator (MML87) is introduced by Wallace and Freeman
[wallace1987estimation], which is the key inductive inference method in this paper, it belongs to the informationtheoretic criterion in terms of model selection. MML87 is an extended version of Minimum Message Length (MML) is introduced also by Wallace and Boulton in 1968 [wallace1968information]. MML87 is widely applicable in different areas, and successfully in solving many time series problems, the results from Fitzgibbon and Sak show the MML87 has outperformance in the Autoregressive model (AR) and MovingAverage model (MA)
[fitzgibbon2004minimum, sak2005minimum], and results from Z Fang shows the MML87 outperform in the hybrid ARMA with neural network LSTM (Long shortterm memory)
[zheng2021].The purpose of this paper is to derive and investigate the use of the MML87 methodology for the Autoregressive–MovingAverage Model (ARMA) model. The ARMA model is one of the most popular time series models introduced by Box and Jenkins in 1976 [box1976time], because the ARMA model considers the ability in modeling the lag (i.e., past) value in time series, and also modeling the random factors in time series such as strikes, or error residuals. The effects of lag value and lag residuals on the forecasted value will gradually decrease with subsequent periods [chatfield2004cross]. The MML87 in this paper is based on unconditional loglikelihood from the ARMA model. This paper also derives the Fisher information matrix based on the likelihood function, which is the key to make the MML87 to calculable in the ARMA time series case. The MML87 is empirically compared with the AIC, AICc, BIC, and HQ, in terms of the Mean squared error (MSE) in the forecasting window. This paper also compares the number of times that each informationtheoretic criterion selects the model with lower mean squared error, the results show that the MML87 is outperforming for the ARMA forecasting than the other informationtheoretic criterion. It is because the MML87 selects the model with minimizing the overall message length gives the goodness of fit for the model and less complexity of the model and selected [wallace1987estimation, wallace1999minimum] we will define the message length later in section 4.
2 Time Series Background
2.1 ARMA Model
This section reviews the theory of Autoregressive Moving AverageModel (ARMA) modelling due to Box and Jenkins (1970) [box2015time]. Let’s consider a time series data generated from ARMA() model:
(1) 
where .
Rearrange equation 1
, the function of white noise
become:(2) 
Denote the is backshift operator where for , rearrange the equation 2:
(3) 
more generally is , where and .
The can be rearranged as:
(4) 
where is function of : such that
is a 1 x (p + q + 1) vector.
Assuming the data generated from a stationary Gaussian ARMA(
) process, it follows the multivariate Gaussian distribution with zero mean, based on the unconditional loglikelihood function
, N observations is:(5) 
where is n by n autocovariance matrix, is determinant of which is determinant in joint density of . The is n x n theoretical covariance matrix of the data satisfying an ARMA() model [fitzgibbon2004minimum, sak2005minimum, anderson1976inverse].
2.2 Information Theoretic Criterion
There are some existing model selection techniques from the information theoretic criterion in the ARMA time series model, including the AIC (Akaike’s Information Criterion), AICc (Corrected AIC), BIC (Bayesian Information Criterion), and HQ (HannanQuinn). The formulas to calculate those information theoretic criterion in ARMA models given as:

AIC = .

AICc =

BIC =

HQ =
where if intercept and if intercept , is the number of observations for a given time series, and is likelihood of the data.
3 Fisher Information Matrix
In this section, we calculates the Fisher Information matrix required by the MML87, it calculated as negative of the expected value of second derivatives of loglikelihood given as:
(6) 
where the
be a probability density function of
given some parameter .The Fisher information matrix for log likelihood of ARMA time series model is combine the equations 5 and 6:
(7) 
From equation 5 and equation 7, the fisher information matrix given as:
(8) 
In the meantime, the term
from the equation 8
will converges to 0 in probability when
then we can ignore this term, because the function is dominated by the second term of equation 8 in large values of .Based on the Gaussian MLE by maximize equation 5:
So it is important to note that does not depend on . The exact least squares estimate to provide the approximations really close to the results from Gaussian MLE [yao2006gaussianii, yao2006gaussiani].
Let
(9) 
and is the white noise function of with according to equation 4.
Rearrange the equation 4 and 9, which is in function of , the formula given by:
(10) 
where the mean of white noise is 0 [miller1995exact, wincek1986exact, shephard1997relationship]. Box and Jenkins (1976) and Box, Jenkins, and Reinsel (2015, Section 7.1.24 p 210213.)[ljung1979likelihood, box2015time, box1976time] point out the parameters can be obtained by minimize the unconditional sum of squares function: .
The Fisher Information matrix 7 can be expressed as only the second term of equation 8 inside the expectation:
by using the product rule. The term
can be rearranged to
by using the product rule again [vahid1999partial].
So the Fisher Information Matrix will be rearranged as:
(11) 
In the meantime, the Fisher information matrix can be written in a general format regarding and :
Based on from equation 4 and the lag of is [klein1995computation].
Consequently the element of the block of the Fisher information matrix is:
(12) 
4 Minimum Message Length
Minimum Message Length (MML) is an inductive inference method for the model selection, the MML found on top of the coding theory. The underlying concept for the MML principle is based on the data compression theory. Assuming there is a sequence of time series data, and a set of candidates ARMA models, the MML is assuming that the data is encoded into two parts of messages transmitted from sender to receiver [wallace1968information, wong2019minimum]:

ARMA time series model

Time series data fitted on the ARMA model in the first message component
where is the parameter space of the ARMA model , and the is a countable subset of parameter space containing all possible MML estimates parameter [wong2019minimum].
The first dot point in MML theory encodes the model structure and parameters then transmits from sender to receiver, which is called the assertion. The second dot point in MML theory encodes the observed data using the model specified in the assertion, which is called detail. The MML minimize the total message length transmitted from sender to receiver contains data and model given as [wallace1968information]:
(13) 
where the is represented as the message length transmitted from sender to receiver. The assertion represents the complexity of the model, and detail represents how well the model fitting the data. So the MML minimize the tradeoff between the complexity and goodnessoffit of the model, that’s why the MML has fewer chances to overfitting the data.
The MML87 is extended version of Minimum Message Length (MML) used in model selection of several continuous parameter set [wallace1999minimum]:
(14) 
Rearrange the equation 14 will become:
(15) 
where is measuring the accuracy of data, is the prior on the number of parameters, is the Bayesian prior distribution over the parameter set, is standard statistical likelihood function, is the Fisher Information matrix of the parameter set is the lattice constant (which accounts for the expected error in the loglikelihood function from ARMA model in equation 5 due to quantization of the ndimensional space, it is bounded above by and bounded below by .
For example, , , , and when ). Assuming the exact likelihood function is used, and assuming that the data comes from a stationary [fitzgibbon2004minimum, wallace1999minimum, ward2008review].
The stationary model message length is calculated by substitution of equation 12 and likelihood function into (5) alone with the prior [fitzgibbon2004minimum, sak2005minimum, wallace1999minimum].
5 Bayesian Priors
In the MML87, it is ignorant prior to observation of data. The MML87 for ARMA model in this report using exact likelihood function and expect the data comes from a stationary process, so placing a uniform prior on from the results [fitzgibbon2004minimum, sak2005minimum].
where is hypervolume of the stationarity region and based on the results from [barndorff1973parametrization, piccolo1982size]. We can recursively calculate the invertibility region hypervolume by:
= 2
6 Empirical Comparison
6.1 Simulated data ARMA(1, 1)
We use the simulated data to compare the performances of MML87 with AIC, AICc, BIC, and HQ. In this case, using uniform distribution to generate 5 different parameter values for
and 2 different parameter values for . In each combination of and , we generate the different sizes of insample and outsample datasets. We totally have 1,000 different data in each combination of and . For each data set, we use different combination of lag order from 1 to 5 for and from 0 to 5 for in ARMA to construct the ARMA model, it provides 30 different models as the comparison in one dataset. Then use each MML87, AIC, AICc, BIC, and HQ to select one best model out of 30 models and calculate the lost function, the lost function used in the empirical comparison is MSPE:(16) 
Table 1 use the insample size N = 100, and forecast method used in this case is rolling forecast with a window size of T = 10. Under one data set, we select the prediction errors for minimum values of AIC, AICc, BIC, HQ, and MML87, because the minimum of those information criteria values is the model selected out of 30 different models. Next, we compared the different prediction errors selected from the model selected by AIC, AICc, BIC, HQ, and MML87, the selected model with the smallest prediction error, the information criteria is more efficient. In each combination of simulated parameter combination of and , there are 100 data set as mentioned above, we compared the number of times that the AIC, AICc, BIC, HQ, and MML87 can select the model with minimum prediction error.
MML87  AIC  AICc  BIC  HQ  

45  25  34  41  40  
69  62  60  62  68  
42  44  40  40  41  
75  64  68  72  72  
46  36  40  44  38  
69  51  57  69  69  
42  32  37  42  38  
64  51  50  66  57  
46  41  39  56  47  
63  69  66  72  68 
For example, the 1st row table 1 indicate that using the particular values of parameter combination of and to generate 100 dataset, there are N = 100 plus T = 10 data points in each dataset. We find out the number of times that MML87 select minimum prediction error is 45 out of 100 for modeling each dataset. There is possible that the summation of each number of times is greater than 100 because there is possible that the two or more information criteria selected the same model for minimum prediction errors. According to the above table 1, the MML87 select the most of minimum prediction error models in different combinations of parameters. It significantly demonstrates that the MML87 is outperformance in the model selection for the ARMA time series model compared with AIC, AICc, BIC, and HQ.
The Table from 8 to 12 shows the comparison of counting number of times that those information theoretical criteria to select the minimum prediction error in N = 50, 150, 200, 300, and 500. The results suggest that the MML87 is able to select more times of lower error model than other information criteria.
Table 2 show the average MSPE in 100 datasets and different simulated parameters when insample size N = 50 and 100 and outsample size T = 10. In each row, the average of MSPE from the model is selected by the different information criteria. The lower forecast error indicates the better model, it highlighted as bold text, the MML87 outperform other information criteria in the majority of the cases.
MML87  AIC  AICc  BIC  HQ  

1.424699  
(0.90860)  1.521890  
(0.99921)  1.499588  
(1.02887)  1.428254  
(0.90914)  1.462676  
(0.99725)  
4.397397  
(3.48977)  4.541932  
(3.64444)  4.515208  
(3.60842)  4.417018  
(3.49456)  4.335861  
(3.43196)  
1.192377  
(0.63315)  1.237391  
(0.66865)  1.220480  
(0.66442)  1.199164  
(0.66119)  1.197432  
(0.66071)  
1.400034  
(0.63992)  1.427151  
(0.63984)  1.413369  
(0.64121)  1.389479  
(0.64031)  1.398019  
(0.62572)  
1.238596  
(0.63055)  1.283056  
(0.64058)  1.268670  
(0.63659)  1.404347  
(0.61788)  1.252059  
(0.62601)  
1.430728  
(0.68179)  1.534319  
(0.73082)  1.504423  
(0.72636)  1.426555  
(0.68741)  1.437177  
(0.68837)  
1.286106  
(0.69430)  1.338531  
(0.70770)  1.311318  
(0.68228)  1.276376  
(0.67433)  1.287715  
(0.68372)  
1.467448  
(0.75440)  1.547590  
(0.84311)  1.537011  
(0.83362)  1.472064  
(0.75156)  1.488488  
(0.78266)  
2.261384  
(1.87727)  2.417009  
(1.97325)  2.380226  
(1.95017)  2.301430  
(1.93637)  2.338255  
(1.97315)  
1.041781  
(0.47548)  1.072443  
(0.48750)  1.067345  
(0.48744)  1.044731  
(0.47907)  1.047712  
(0.47553) 
Table 3 test the average of prediction error in outsample size T = 10 by different amount insample size N. It uses the insample data size of N = 50, 100, 150, 200, 300, and 500 to compare the average MSPE from the models selected by different informationtheoretic criteria. The MML87 outperforms the AIC, AICc, BIC, and HQ in the size N = 50, 100, 150, and 300 because the AIC and BIC tend to overfit the data in the smaller insample size.
MML87  AIC  AICc  BIC  HQ  
N = 50  1.796509  1.886955  1.834129  1.801342  1.826132 
N = 100  1.714055  1.792131  1.771764  1.716704  1.724539 
N = 150  1.748152  1.810529  1.809941  1.749774  1.776628 
N = 200  1.771909  1.833165  1.821335  1.764313  1.789190 
N = 300  1.703449  1.741882  1.737454  1.711380  1.712863 
N = 500  1.719375  1.736897  1.735968  1.717373  1.718875 
Table 4 compares the average of MSPE in the outsample size T = 10, 30, 50, and 100. The MML87 outperforms other informationtheoretic criteria in T = 10, 30, and 50.
Criterions  MML87  AIC  AICc  BIC  HQ 
T = 10  1.714055  1.792131  1.771764  1.716704  1.724539 
T = 30  1.793344  1.823455  1.810748  1.804022  1.801981 
T = 50  1.901877  1.925961  1.922792  1.914095  1.909096 
T = 100  1.898179  1.925083  1.917025  1.897318  1.903784 
6.2 Simulated data ARMA(p, q)
This subsection also uses uniform distribution to simulate the dataset by the different amounts and different values of and . This section compares the MSPE from five informationtheoretic criteria in terms of forecast windows used T = 1, 3, 5, 10, 30, 50, 70, 100, 130, 150 and N = 100. Each different forecast window contains and generates 100 datasets, the results show the average of MSPE. Table 5 shows the results from simulated data by ARMA( to ARMA(.
MML87  AIC  AICc  BIC  HQ  
ARMA(2, 1)  1.919648  1.983033  1.958249  1.922295  1.949766 
ARMA(2, 2)  2.051329  2.118476  2.104302  2.053353  2.081951 
ARMA(3, 1)  1.487463  1.527326  1.521105  1.490115  1.500657 
ARMA(3, 2)  1.712706  1.738684  1.721419  1.699448  1.712409 
ARMA(4, 1)  1.961224  2.034625  2.018761  1.978715  1.994742 
ARMA(4, 2)  1.520506  1.567478  1.556770  1.521140  1.535147 
ARMA(5, 1)  2.015772  2.084429  2.055148  2.022353  2.034267 
ARMA(5, 2)  1.598522  1.647486  1.637956  1.598533  1.620931 
Figure 1 shows the average MSPE comparison between the MML87, AIC, AICc, BIC, and HQ, the simulated data generated from parameters ARMA(2, 1). The Figure 2 shows the comparison in terms of simulated data generated from ARMA(3, 3) parameters. The MML is outperform the other information criteria because the mean of prediction error is the smallest across the different forecast window sizes. Figures from 2 to 8 in section 8.2 show the diagram in the data generated from parameters ARMA(2, 2) to ARMA(5, 2).
6.3 Actual data
In this section, we are using the real financial data with time series characteristics to compare the performance for AIC, AICc, BIC, HQ, and MML87. In considering both the systematic and unsystematic risk, the data collected from the stock index from the market portfolio across different countries including the ASX200, Dow Jones Composite Average, FTSE100, Russell, Nasdaq 100, Nasdaq Composite, and NYSE Composite, also includes two individual stock AAPL and GS. The time horizon that the financial data selected is from 20200908 to 20210907. Table 6 shows the number of times that the information criteria selected the lower error model. For each asset portfolio, the data are separated into the 8 segments with N = 30 and T = 10, and use different combinations of lag order 1 to 5 in AR and lag order 0 to 5 in MA. The results suggest that the MML87 is able to select a larger number of lower error models than other information criteria in 8 assets out of 10, the best information criterion shown in bold text.
Criterions  MML87  AIC  AICc  BIC  HQ 

Dow Jones Composite Average  4  3  4  3  4 
SP500  5  2  5  5  4 
ASX200  7  3  5  5  4 
Russell  2  5  5  5  5 
Dow Jones Industrial Average  4  4  6  6  3 
FTSE100  5  5  4  4  5 
Nasdaq100  7  3  2  3  3 
Nasdaq Composite  5  3  2  3  2 
AAPL  5  5  4  5  3 
GS  6  4  4  4  4 
Because the selected portfolio values have different data scales, to avoid the misclassification of best model selection criteria. Table 7 provides the average of the log prediction error for ten different asset portfolios. The bold texts are the minimum error in this asset portfolio. The results show the models selected by MML87 beat the AIC, AICc, BIC, and HQ with having a lower average of log prediction error.
MML87  AIC  AICc  BIC  HQ  

DJ Com  9.339222  
(1.48159)  9.431943  
(1.48159)  9.266359  
(1.50159)  9.350298  
(1.43139)  9.266359  
(1.39585)  
SP500  8.170703  
(0.95529)  8.467639  
(0.95529)  8.175019  
(1.19616)  8.175019  
(0.94133)  8.209827  
(0.89824)  
ASX200  8.668513  
(0.94712)  9.073504  
(0.94712)  8.807688  
(1.04215)  8.807688  
(0.91327)  8.807688  
(0.91327)  
Russell  7.556093  
(0.97889)  7.464931  
(0.97889)  7.362797  
(1.27475)  7.362797  
(1.13518)  7.362797  
(1.13518)  
Dow Jones  12.124174  
(1.28175)  12.279021  
(1.28175)  11.930101  
(1.49217)  11.930101  
(1.39267)  12.066458  
(1.45273)  
FTSE100  9.610056  
(0.92881)  9.732419  
(0.92881)  9.740016  
(0.87461)  9.740016  
(0.88376)  9.719694  
(0.90703)  
Nasdaq 100  11.247209  
(0.88016)  11.297879  
(0.88016)  11.30298  
(0.73106)  11.253671  
(0.76663)  11.300069  
(0.73563)  
Nasdaq Com  11.172626  
(0.97813)  11.494775  
(0.97813)  11.261841  
(0.90478)  11.201055  
(0.9338)  11.37422  
(0.91929)  
AAPL  3.082  
(1.21437)  2.967387  
(1.21437)  3.053346  
(1.22936)  2.871599  
(1.08716)  3.132657  
(1.10378)  
GS  4.714533  
(0.82053)  4.931766  
(0.82053)  4.776378  
(0.68177)  4.776378  
(0.89269)  4.786212  
(0.88672) 
7 Conclusion
We have investigated the Autoregressive–MovingAverage model in the MML87 information criteria based on the Wallace and Freeman (1987) approximation. Using the maximum likelihood estimate in ARMA modeling, and unconditional likelihood function in calculation of Fisher Information matrix. The results show MML87 outperformance but not dominate the other information criteria in the simulated data and actual financial data when using the MSPE. On average, MML87 is a really good model selection technique in time series data.
8 Appendix
8.1 Tables
MML87  AIC  AICc  BIC  HQ  

48  35  42  48  42  
61  65  66  65  67  
33  39  41  54  48  
66  48  57  66  58  
40  40  47  53  52  
66  58  63  63  64  
30  47  50  57  54  
69  61  59  69  60  
53  41  39  36  40  
49  47  54  64  57 
MML87  AIC  AICc  BIC  HQ  

48  40  43  41  40  
75  64  65  67  67  
43  37  34  43  39  
66  62  61  71  68  
48  37  41  31  40  
74  60  63  73  73  
41  46  50  43  43  
71  60  60  77  73  
45  46  44  49  47  
67  57  58  66  61 
MML87  AIC  AICc  BIC  HQ  

47  33  37  36  36  
65  57  54  64  57  
55  51  49  44  40  
68  57  59  69  59  
42  47  50  47  50  
68  59  59  71  62  
52  53  49  49  49  
71  62  62  68  63  
51  51  49  53  55  
39  55  52  43  50 
MML87  AIC  AICc  BIC  HQ  

30  44  45  45  43  
75  49  52  67  65  
56  50  49  51  53  
64  69  70  67  68  
32  49  51  29  34  
65  61  60  62  59  
55  46  48  55  59  
72  61  63  72  69  
61  48  49  44  44  
50  44  44  59  55 
MML87  AIC  AICc  BIC  HQ  

47  42  41  49  38  
74  71  73  73  74  
42  48  50  44  46  
73  62  60  73  73  
45  44  41  39  40  
67  60  60  65  69  
38  47  47  50  45  
70  51  51  70  70  
54  57  55  50  51  
65  58  60  65  60 
8.2 Figures
[title=Reference]
Comments
There are no comments yet.