Integer-valued autoregressive () models appeared for the first time in McKenzie (1985) and Al-Osh and Alzaid (1987). Over the time, they showed to be a very useful tool for describing the integer-valued data. One may, for example, apply models to describe the monthly number of rainy days, crime cases, newborn individuals of one species. It becomes clear that such data may be found in any area. For that reason, numerous models have been proposed and studied in the literature. models are based on so called thinning operator
, which to a given integer-valued random variableassigns the sum of independent identically distributed random variables. The distribution of the auxiliary random variables determines the type of the thinning operator. Some of the models with different thinning operators may be found in Aly and Bouzar (1994), Latour (1998), Zheng et al. (2006, 2007) and Ristić et al. (2009). Another variety of models arise from considering different marginal distributions, see for example McKenzie (1986), Al-Osh and Aly (1992), Alzaid and Al-Osh (1993) and Bakouch and Ristić (2010).
We focus on the recent random environment models that appeared for the first time in Nastić et al. (2016) and are flexible towards the environment conditions changes. The behavior of these models is ruled by a Markov chain , called random environment process. The elements of the random environment process are also called (random) environment states. To apply a random environment model, one must first estimate the environment states. In Nastić et al. (2016) this was done using clustering methods, in particular K-means introduced by Hartigan and Wong (1979). However, using K-means for this purpose induces a certain loss of information, as we discuss below, and may lead to poor performance of the model. In order to estimate environment states as accurate as possible, we propose a new random environment estimation (abbrev. RENES) method. To avoid the confusion, it is important to emphasize that we do not introduce a new method for estimating the parameters of random environment models, but only a new method for estimating . However, the estimators of random environment models parameters
are defined under the assumption that is known in advance, meaning that a different approach for estimating will for sure imply the difference in the parameter estimates.
We need to provide more details on the random environment models. We begin with the first order random environment model with geometric marginals () introduced in Nastić et al. (2016). The marginal distribution of the time series in moment is determined by the realization of the random environment process recorded in the same moment — for this reason we write . Moreover, the distribution of is geometric with expectation . The recursive relation that defines model is given by
where, denotes the negative binomial thinning operator, that to each integer-valued random variable assigns the sum of
independent random variables having geometric distribution with mean value. In order to measure the goodness of fit of such defined model, corresponding environment states for all observations must be estimated. This is where the K-means clustering method took place. The predefined number of clusters was chosen to be the number of environment states registered in the observed phenomenon. Each cluster was assigned to one state and each sample element was assigned to a state depending on the cluster it felt into. In particular, if two different process elements and belong to the same cluster, then it was assumed that . In that way, the sequence of random environment states was fully determined. However, this approach shows a serious shortcomings, which are consequence of the fact that only data point value was taken into account in K-means clustering method. Once the K-means is performed, graphed representation of the database will be divided by horizontal lines into strips, as shown in the right-hand panel of Figure 1. Each strip corresponds to one cluster. This entails that all high values in the database must be located in the same cluster. Similar to this, all low values must be located in the same cluster. This is, however, not the case with the data simulated from process. The left-hand panel of Figure 1 shows the data simulated from model. As we can see, it is possible that high data values appear also in the environment conditions different than those assumed for the high data values, so the data is no longer divided by a horizontal strip. K-means totally rules out this possibility.
As an example, consider the number of the new COVID-19 cases per day. As it is known, the weather conditions significantly affected the spread rate, so we use model with different environment states: summer and winter. However, there are some other circumstances undetected or not measurable, that can affect the number of new cases per day, for example public demonstrations, unallowed gatherings of people during vacations or emergence of the new virus strain. In these situations, it would be ideal for clustering method to recognize specific circumstances and keep high values (detected in summertime) in ’summer’ cluster. However, standard K-means is incapable to do so. Observing only numerical value of the process realization, K-means might recognize high summertime values as winter occasions, and locate those realizations in the wrong cluster. The same holds for all K-means adaptations familiar so far. Obviously, an improved random environment estimation method is needed in order to solve such problem.
In years that followed, few more sophisticated models appeared. Nastić et al. (2017) defined random environment models of higher order. Beside the marginal distribution parameter, authors assumed here that the order of the model is also determined by the environment state in particular moment. Another step ahead was made by Laketa et al. (2018). Beside all the assumptions mentioned above, authors additionally assumed that the thinning parameter value in moment depends on the environment state in the same moment. According to Laketa et al. (2018), is called a generalized random environment model of higher order with geometric marginals and negative binomial thinning operator () if its element at moment is determined by the recursive relation
Sets , , contain model parameter values — is the mean of the marginal geometric distribution of , is the thinning parameter value and represents the maximal value that order may take for a fixed state . There are two different models, depending on the way sequence is defined. One of them, is constructed so that for each it holds , while for the other one, we have if and otherwise. Here represents the number of predecessors of that are mutually equal.
Beside the shortcomings mentioned above, we detected another difficulty when applying K-means to a generalized random environment process of higher order, as it was done in Laketa et al. (2018). To explain the difficulty, consider the simplest case with environment states and suppose similarity between mean values within states, . In other words, the observations inside states are not that much different and are accumulated around parallel horizontal lines that are close to each other. In this situation, it is reasonable to expect the existence of a strip in which points from both environment states will be mixed. The border between states won’t be a straight line, but a wavy and jagged line. Taking into account the fact that K-means method separates clusters by straight horizontal lines, it becomes obvious that some improvements are necessary.
In this paper we introduce a new RENES method for estimation of , that will eliminate disadvantages mentioned above. The idea of RENES method is to transform the data sample that corresponds to the generalized random environment model of higher order before applying clustering. As previously mentioned, all the parameter values , and carry the information about . To prevent the information loss, the main goal is to form a three-dimensional sequence, based on real-life data realizations, that mimics the behavior of . Finally, the K-means algorithm will be applied on such obtained three-dimensional data sequence. It is possible to apply RENES method to any other generalized random environment model of higher order (to those that have different marginal distribution and thinning operator), but we focus on models, as it was the case in Laketa et al. (2018). In order to confirm the efficiency of RENES, corresponding simulated data sequences are created. Observing the simulations, we can examine whether changes in the number of states and parameter values affect the efficiency of RENES method.
The structure of this paper is as follows. In Section , a construction of the new random environment estimation (RENES) method, which overcomes problems mentioned above, is presented. Section provides description of simulations and their properties. Cases with and different environment states are described. An extensive simulation study for newly proposed RENES method is given in Section . Results of applying the RENES method to the real-life data are given in Section .
2 Construction of the new Renes method
Consider a given sample of size from model. In order to construct the method, the main idea is to determine certain kind of pre-estimators , and of parameter sequences , and based only on the realized sample, without knowing the random environment sequence . Such obtained three-dimensional sequence is supposed to mimic the behavior of the model parameters over time. Then, clustering the three-dimensional data would produce better estimation of than clustering of the starting sequence , since the information loss is prevented.
As mentioned before, our goal is not to define new estimators of model parameters, but to improve the estimation of . Given sequence of so-called ’pre-estimators’ is just a a helpful tool to estimate , and it doesn’t represent any kind of alternative estimates of model parameters. Model parameters have already been successfully estimated in Laketa et al. (2018), and we rely on those results in evaluating our RENES method. For an illustration, see Figure 2.
Although it is possible to implement clustering of three-dimensional data , method can be improved even more, by considering trimmed (truncated) means. To that purpose, for a given sequence
and vectorlet us define a function
Elements of are decreasing , so that represents a trimmed mean affected the most by the current value . The effect of the neighboring elements of decreases when moving away from . If for some , it will make sense to use instead of when estimating the environment state , because in this case all the elements carry information about . As mentioned in Laketa et al. (2018),
model shows bad performances in case when environment states are changing rapidly. Its application makes sense only if the probability of remaining in the same state is big enough. Hence, forsmall enough, one may assume that neighboring elements of are equal with high probability in all the situations of practical importance. Thus, it sounds reasonable to replace in clustering procedure with another three-dimensional data sequence , for some vectors , and . Theoretically speaking, lengths of these vectors do not have to be equal. The upper limit of the vector’s length might be discussed as well. Higher values of give better pre-estimates, provided all of observations correspond to the same state. Otherwise, pre-estimates might be even worsened. To reconcile these two opposing facts, we are not going to discuss vectors , , of length higher than .
If we want all the coordinates of to have equal impact on the clustering, it is necessary to scale them. Thus, we define a function that to a given element of a sequence assigns properly scaled (normed) value of given with
By introducing three more parameters , it becomes possible to control the level of impact each coordinate has on the clustering procedure. Finally, by using standard K-means we cluster the three-dimensional data vector
The only left is to define starting pre-estimators , in a reasonable way, by looking at the construction of models. Bearing in mind the fact that parameters represent means within clusters, it would be reasonable to set for any that
Taking into account the fact that the partial auto-correlation function is used to determine the order of the time series, estimation of the sequence will take place as follows. If is the maximal order allowed for particular element in the state , than we have
where is the partial auto-correlation function at lag and . The function (6) works well if all elements of the sequence involved in correspond to the same state . However, this requirement is not demanding, since the application of model is reasonable only in the case when the probability of remaining in the same state is higher than the probability of changing the state.
To predict the thinning parameter value in moment , we use the known property of the negative binomial thinning operator, that . This motivates us, by looking at (2), to define
for any , where and for . Here represents the positive part of . Since such obtained thinning parameter value might be greater than , we finally have
To apply new RENES method, one should choose the values of parameters , , , , , and (called in sequel RENES method parameters). We will try to make an optimal choice based on and simulations. It is important here to distinguish RENES method parameters from the model parameters. In Section 3 we give details about our choice of model parameters, while in Section 4 we give results of the simulation study with such choice of model parameters and discuss how to choose RENES method parameters.
3 Simulation study—the choice of model parameters
We simulated time series of length . In the sequel we consider all sequences of the same length, so instead of we write shortly . Properties of simulations are such that they make difficult to apply standard K-means.
The case with different environment states is presented within the section. On the other hand, the case with environment states can be found in Appendix A. For each of the cases, two different combinations of model parameters are observed. Further, each combination of parameters will generate two different replications of the corresponding time series. One of them will be used to obtain the values of . With the help of such obtained RENES method parameters, the other replication will be reconstructed in order to evaluate efficiency of the new RENES method. Furthermore, both versions of the model, and will be analyzed simultaneously. For more information about these models, see Laketa et al. (2018).
The random environment process, being a Markov chain, has parameters — a vector containing initial probabilities of being in certain state and — transition probability matrix that in the intersection of -th row and -th column contains the probability for any . Another remark about the notation is that we write , and as vectors, even though they are introduced as sets. We do so to eliminate the ambiguity, preserving the order of the states.
In order to create simulations, the following combinations of parameters are given.
First of all, we are going to create time series with similar means within states, while other model parameters will differ significantly. Surrounding like this would make K-means useless. Hence, we choose . On the contrary to that, thinning parameters, as well as maximal orders within states, should differ significantly. Hence, we choose and . Regarding the choice of one of them is chosen to be very small, while the other is chosen to be close to its upper limit. Furthermore, probabilities corresponding to the simulation are chosen to be
Probabilities corresponding to the simulation are located in last rows of these matrices. An initial state is nearly fair, due to the value of its distribution . In order to have long arrays of elements corresponding to the same state within the simulated time series, transition probabilities outside the main diagonal are significantly smaller than those located on the main diagonal. Thus, transition probability matrix is of the form
The other combination of parameters is characterized by a great similarity between thinning parameters. Beside that, the mean values will be similar enough to to make it difficult to use the standard K-means method. Orders of the model will be the only values on the basis of which it is possible to determine the environment states of realizations. That will be a good test for our new approach. To that purpose, we have that , and . Further, in the case of ,
Last rows of these matrices contain probabilities corresponding to the simulation. An initial state is fair, since . Finally, the transition probability matrix provides long arrays of elements corresponding to the same state, that is,
4 Simulation study — results and the choice of Renes method parameters
In this section we seek for optimal RENES method parameters, based on simulations with environment states. The choice of corresponding model parameters is given in the previous section. In order to improve the readability of the manuscript, only one parameters combination will be discussed in detail, with fully exposed procedure of obtaining , , , , and . As for the second combination, the procedure will be omitted and only final results will be provided. Corresponding discussion regarding optimal RENES method parameters in the case of simulations with environment states is provided in Appendix B.
Using the first combination of the model parameters, corresponding and simulations were created, two replications of each. The first replication of each pair was used to provide the parameters of RENES method. The procedure starts with the determination of using (5). In order to improve , vector has to be provided. For small enough, we have already assumed that all correspond to the same state. Thus, all can have similar contribution to . In other words, we can choose coordinates of the vector to be as equal as possible. Due to the fact that it multiplies the middle realization , the value of may eventually be a bit higher. Figure 3 shows sequences of pre-estimates , obtained for various selections of . There we show only the first elements, to increase readability of the plot.
As Figure 3 shows, the usage of results in much more accurate pre-estimates of the sequence in both cases. Using this technique, we managed to trim peaks that deviate significantly form the real mean values. Obviously, the best result is obtained for in the case of simulation. Speaking of simulation, similar results are obtained in the case of and . The second option is selected.
We determine in two steps. The first step provides a determination of the parameter , given in (6). In order to obtain the optimal value of , let us denote by the root mean square of differences between correct orders and corresponding estimated order values obtained by (6). The error is calculated for various choices of and the results are presented in Table 1. The smallest value of will reveal the optimal value of parameter . Having a brief look at Table 1, we conclude that, in the case of simulation, the smallest value of is obtained for . Similarly, the smallest value in the case of simulation is obtained for .
The second step involves determination of the corresponding vector for fixed optimal value of . Similarly as for , we assume that all correspond to the same state. Thus, are all assumed to have similar contribution to . Hence, coordinates of are chosen to be as similar as possible. Sequences obtained for various selections of are shown in Figure 4 and compared to the exact order sequence .
In order to interpret Figure 4, one fact needs to be clarified. Namely, the goal is to choose order pre-estimate which provides the highest probability of placing corresponding observations in correct clusters. In other words, the best pre-estimate of is not necessarily the one that most often matches the exact order value, but the one that is close enough in most of the cases. In this respect, the best result in both cases ( and ) is obtained for , ie. for . Although this pre-estimate struggle to reach maximal orders, in most of the cases it stays close enough to the correct order values and do not make large mistakes.
Finally, having calculated and , we are able to calculate using (7). Following the same reasons as before, coordinates of are assumed to be as similar as possible. Regarding the length of , several options are tested. Sequences obtained for various selections of are given in Figure 5 and compared to the real sequence .
According to the figure, vectors and provide more accurate pre-estimates then or . Namely, sequences obtained for and do not show sudden and sharp ups and downs so often. The most of their values stay in a strip between and , which is an expected behavior for a fine sequence of pre-estimates. It is hard to choose the better one, but it seems that the plot line obtained for stays a bit closer to the real parameter values. This conclusion holds for both, and models, so the same is chosen in both cases.
To summarize, optimal values of , and , involved in RENES method are provided in Table 2. Note that in both cases all three vectors are chosen to be of the same length , which is not surprising. Recall that depends on the probabilities of staying in the same state. If we chose smaller diagonal values of , the optimal length would certainly be smaller.
Now, one can provide -dimensional sequences , and after that , , . It is left to determine parameters and given in (4). To that purpose, a modification of the procedure used to determine is applied. More precise, for each , , , the clustering of the three dimensional data sequence
is performed. In that way, a thousand different estimates of the environment state sequence are provided. To select the best one, estimates thus obtained are compared with the sequence of exact states. The highest number of exactly estimated states will reveal the best combination of parameters , , . In the case of simulation, the best result in random environment estimation is obtained for , , , having estimated states equal to corresponding exact states. On the other hand, result obtained by standard K-means managed to have exactly estimated states. A comparative overview of exact states, states obtained by standard K-means and states obtained by usage of RENES method is provided by Figure 6. And yet again, only first states are given in each figure.
Beside the higher number of the exactly estimated states, two more improvements are achieved by usage of RENES method. According to the figures, the RENES method produces much longer data series that correspond to the same state. Bearing in mind the fact that random environment models show poor performances when environment states are changing rapidly, mentioned improvement seems very convenient. Further, RENES method doesn’t make a crisp data division using a horizontal line, as K-means does. The possibility of obtaining a high data value in the environment conditions different from those assumed for the high data values is not ruled out this time. In other words, RENES method allows data elements with high values to belong to the cluster with predominantly low values, and vice versa. This property makes RENES method more suitable for clustering the data where, beside one detected predominant environment condition, some hidden circumstances also have an impact on the time series realizations.
Similar holds in the case of simulated time series. For each , , , the clustering of three dimensional data sequence
is performed. The best result is obtained for , , , having estimated states equal to the corresponding exact states. On contrary to that, the standard K-means method managed to have exactly estimated states. And yet again, comparative overview of the exact states, states obtained by standard K-means method and states obtained by new RENES method is provided by Figure 7. The plots undoubtedly show dominance of RENES method comparing to the standard K-means. Achievements mentioned in the case of simulation, also hold here.
Unused replications are suitable here to check the efficiency of our RENES method. First of all, these replications were observed as real-life data sequences. Further, environment state estimation via K-means method and via RENES method took place. The same holds for both, and simulations. Having results of both random environment estimation methods given above, unknown model parameters were estimated for each clustering result by usage of conditional maximum likelihood () procedure.
A data sequences reconstruction by corresponding or model may happen now for each clustering result. of differences between simulated data and their reconstructions will represent the measure of the fitting quality. Results of the modeling obtained after applying standard K-means and RENES method are provided in Table 3. Dominance of RENES method is noticeable. values obtained after applying standard K-means method are unexpectedly high ( in the case of model and in the case of model). This confirms the hypothesis given in the introduction that K-means is not a useful tool for clustering the data corresponding to the process with similar means within states. On the other hand, values obtained after applying RENES method are much more acceptable ( in the case of model and in the case of model).