1 Introduction
Relational data contain information about the relationship between objects and may be obtained from, for example, pointofsale (POS) information and the social networking service (SNS) community. These types of data are often represented as a matrix in which each element indicates the relationship between the row and column elements (e.g., the presence or absence of a relationship or its strength). Biclustering is one way to cluster elements along the row and column dimensions of a matrix simultaneously [14, 6]. By applying biclustering to relational data, we can ascertain the latent structure of the relationship. Indeed, many probabilistic models using biclustering of relational data have been proposed [9, 15, 8].
Another situation commonly found in practice is when relationships change over time. For example, in the case of simultaneous purchases of products, the relationships between them may change according to events such as the season. In the SNS community, on the contrary, human relationships change through events such as career moves. In this way, the relationship between objects often changes through an event. In this paper, we call relational data obtained at multiple time points “timevarying relational data,” as in [7].
A naive approach to biclustering timevarying relational data is to apply biclustering to the data at each time step separately. However, in practice, this leads to unstable clustering results [22], and it becomes difficult to interpret clusters across time points since cluster structures are inconsistent over time. On the contrary, clustering using a timevarying structure [20, 5] can help detect changes in relationships due to specific events.
Several biclustering methods exist for timevarying relational data [22, 20, 5]. First, the dynamic infinite relational model (dIRM) [7], which extends HMMbased method [18, 3] to relationship data, is especially useful since it can automatically determine the number of clusters from the data. The dIRM method can handle relational data containing either a zero or a one (essentially whether a relationship exists). However, in practice, we often obtain relational count data. For example, if each element of a data matrix represents how many times two products (corresponding to the row and column of the matrix) are simultaneously purchased, this data can be considered to be relational count data between the products. Since count data contain more information than binary data, it is preferable to use a model that can handle relational count data when we have count information instead of only zeros and ones.
Based on the foregoing, in this paper, we propose two new methods for the biclustering of count data under time changes. First, we extend the Poisson infinite relational model (PIRM) [21], a biclustering method for relational count data that does not consider biclustering relationships that change over time, to be able to handle time changes to a model. We call this first model the dynamic Poisson infinite relational model (dPIRM). Second, we propose a dPIRM that can handle data containing many zeros, which we call the dynamic zeroinflated Poisson infinite relational model (dZIPIRM). The latter extension is especially important for count data, as zeroinflated count data are often obtained in practice, especially when the interval between observation times is short [1, 11, 4].
The remainder of the paper is organized as follows. In Section 2, we begin by considering a biclustering model of time series count data without zeroinflated characteristics. Next, we propose a biclustering model for zeroinflated timevarying count data. In Section 3, we derive the posterior distributions of the relevant parameters and clarify their properties. In Section 4, we illustrate how our proposed models can handle numerically simulated data. In Section 5, we show how the two proposed biclustering methods may be used with real data. Finally, Section 6 concludes.
2 Biclustering for timevarying count data
In this section, we propose two biclustering methods. We first introduce one for timevarying count data and then the zeroinflated Poisson model (ZIP) [10] to extend the first method to be able to handle zeroinflated count data.
2.1 The dPIRM
We denote the data by , where represents the number of row objects, represents the number of column objects, and represents the time horizon. We can assume that without loss of generality. In addition, since relational count data are not necessarily limited to an undirected graph, we allow asymmetric relations (i.e., for all ).
We propose a biclustering model of timevarying count data under the following three assumptions:
 (A1)

Cluster structures remain consistent over time.
 (A2)

Objects can move between clusters at each time step.
 (A3)

The strength of the relationship between clusters can be expressed from zero to infinity.
We describe this dPIRM as follows:
(1)  
(2)  
(3)  
(4)  
(5) 
Here, , .
In Equation (1), the term “Stick” denotes a division formula known as the stickbreaking process [17]. This process can be described as , where . is a shape parameter. represents timeaveraged membership of the clusters (A1), and the are guaranteed to satisfy .
DP in Equation (2) represents the Dirichlet process. plays the role of a concentration parameter in the Dirichlet distribution. Thanks to the DP, the number of clusters is automatically determined. in , , ,
represents the transition probability that an object belonging to cluster
at time moves to cluster at time . is guaranteed to satisfy and . is an indicator function taking the value of one when it belongs to cluster and zero otherwise. adjusts how likely it is for objects to stay in the same cluster at each time step. In Equation (3), cluster assignment is generated with a multinomial distribution using to incorporate the time structure. This enables objects to move between clusters over time, following (A2).In Equation (4), represents the strength of the relationship between the cluster indicated by the th row and th column. and
represent the shape and scale parameters of the Gamma distribution, respectively. By generating
from a Gamma distribution, it is possible to represent relationships from zero to infinity, following (A3). In Equation (5), count data are generated with the parameter .Next, we consider the relationship between existing models (i.e., the dIRM and PIRM) and the dPIRM. For the dIRM, the Gamma distribution for the prior of in Equation (4
) of the dPIRM is replaced with a Beta distribution, while the Poisson distribution in Equation (
5) is replaced with the Bernoulli distribution. On the contrary, for the PIRM, the parameter
in Equation (1) in the dPIRM is only used as a parameter for a multinomial distribution instead of in Equation (3). Moreover, the PIRM framework does not consider the time structure as in Equation (2).Using the dPIRM, we can handle timevarying relational count data. However, in empirical applications, zeroinflated count data are often encountered, especially when the interval between time steps is short. In such cases, the approximation accuracy of the model is significantly low, as the dPIRM cannot handle zeroinflated count data. For this reason, we propose a model to address this issue in the next subsection.
2.2 The dZIPIRM
In this section, we extend the dPIRM to handle zeroinflated count data by proposing the dZIPIRM. In addition to Assumptions (A1)–(A3), this extension further assumes the following:
 (A4)

Zeroinflated data are expressed by assuming mixture distributions of count data and zero data.
Given timevarying count data , the dZIPIRM model is defined as follows:
(6)  
(7)  
(8)  
(9)  
(10)  
(11)  
(12) 
We assume that Equations (6), (7), (8), and (11) follow the same generator processes as the dPIRM. Equations (9), (10), and (12) represent the extension of the dPIRM to zeroinflated count data. Before explaining the model in detail, we first describe how we handle zeroinflated data with this model.
First, we introduce the ZIP, which is often used to model zeroinflated data. In the Poisson distribution, we assume that zeros and nonzeros are obtained from the same generator process. On the contrary, in the ZIP, count data are generated from, mixture of a Poisson distribution with weight and a Degenerate distribution for with weight . The Degenerate distribution for denotes a distribution of . Letting , , and , the probability function can be written as follows:
(13) 
is an indicator function.
A naive approach to extending the dPIRM for zeroinflated data is to simply assume that data follows the ZIP. However, this assumption causes a problem when estimating the parameters. Specifically, to estimate the parameters using Gibbs sampling, a technique often used in Bayesian models, we need to derive the full conditional distribution. In general, however, it tends to be difficult to derive full conditional distributions explicitly when the probability is represented as a sum, as in Equation (13) [2]. However, if we can derive the full conditional distribution, we can easily interpret the parameter properties. Therefore, to ensure tractability and interpretability, we introduce the latent variable [16, 12]. Letting , we consider a model by setting the distribution as follows:
(14)  
(15) 
In Equation (15), the value of generated by Equation (14) determines whether the data follow the Poisson distribution or the Degenerate distribution for . Thus, given and , the joint probability of and is
(16) 
In this way, the probability can be represented in the form of a product, and the joint probability of the parameters and data (i.e.,
) is also expressed as a product. This solves the problem of the joint distribution in Equation (
13) being expressed as a sum.Furthermore, when we sum out all the possible realized values of in Equation (16), we obtain the formula on the lefthand side of the ZIP as follows:
The resulting formula is equivalent to the ZIP in Equation (13). In other words, by assuming Equations (14) and (15), we obtain a model with the property of the ZIP while deriving an explicit posterior distribution. For this reason, the distributions in Equations (14) and (15) are assumed to hold in Equations (10) and (12).
Since Equations (6), (7), (8), and (11) are equivalent to the dPIRM, we explain Equations (9), (10), and (12) in detail. In Equation (9), we use the Beta distribution for the prior distribution of
. Since the Beta distribution is the conjugate prior distribution of the Bernoulli distribution, we can analytically derive the posterior distribution, with
and as the shape parameters. In Equations (10) and (12), determines whether follows the Degenerate distribution for or the Poisson distribution, following (A4). This indicates that in this model, zero data, which is generated from the Degenerate distribution (), is not taken into account for estimation. In other words, when estimating the parameters in the dZIPIRM, we can remove the effect of zeros that follow the Degenerate distribution. This intuitive interpretation will be justified by Theorem 1 in Section 3.2.Figure 1
illustrates the parameter relations in the PIRM, dPIRM, and dZIPIRM. The circle nodes denote the variables, square nodes are the hyperparameters, and shaded nodes indicate the observations. The figure illustrates the differences between the data generation processes considered in each model, showing that the dZIPIRM has a structure in which
depends on .3 Sampling parameters
In this section, we derive the full conditional distribution of the parameter of the dPIRM as well as the parameters , , and of the dZIPIRM. We then explain their properties. Deriving the full conditional distributions allows us to understand the properties of the parameters when the other parameters and data are fixed. Moreover, by comparing for the dZIPIRM and dPIRM, we can appreciate how the models differ.
We use beam sampling [19, 7] to sample the parameters, as it is faster and more efficient than Gibbs sampling when there is strong series correlation between and . Indeed, convergence is significantly slowed in this case for Gibbs sampling.
3.1 Derivation of the full conditional distribution
In this section, we derive the full conditional distribution of the parameters in the dPIRM and , , and in the dZIPIRM. Since the full conditional distribution of the other parameters in the dPIRM and dZIPIRM are the same as in the dIRM, we refer the reader to [7] for the derivation of those distributions. In the derivation of the full conditional distribution, it is sufficient to consider the proportionality relation between the parameters for the calculation. To simplify the notation, we define the following set, .
First, we explain the parameters of the dPIRM. in the dPIRM is represented by the following proportionality relation:
Therefore, the full conditional distribution of in the dPIRM becomes
(17) 
Next, we derive the full conditional distributions of , , and in the dZIPIRM. Unlike the dPIRM, the derivation of these full conditional distributions is not straightforward. However, thanks to the formulation of the dZIPIRM in Section 2, setting conjugate prior distributions for the prior distribution of the parameters allows us to analytically derive the full conditional distribution of the parameters in the dZIPIRM. First, in the dZIPIRM is represented by the following proportionality relation:
Following a similar reasoning as in Equation (17), the full conditional distribution of in the dZIPIRM may be expressed as follows:
(18) 
This expression may be further simplified as follows. From Equation (12):
Using this, we obtain
(19) 
which we prove in Appendix. Using Equation (19), Equation (18) may be rewritten as follows:
(20) 
This expression is useful to compare the properties of the dPIRM and dZIPIRM, as explained in Section 3.2.
Next, for the parameter of the dZIPIRM, we use the proportionality relation:
Therefore, the full conditional distribution of becomes the Beta distribution as follows:
(21) 
Finally, we consider the full conditional distribution of the latent variable , for which we consider the proportionality relation:
The full conditional distribution of becomes
3.2 Interpreting the properties of the model using the full conditional distribution
In this section, we clarify and compare the properties of the two models by studying the full conditional distributions of the parameters derived in Section 3.1. First, we compare
in the dPIRM and dZIPIRM to explain the properties of each model. The expected value and variance of
in the dPIRM from Equation (17), denoted by , is expressed as follows:(22) 
and represent the total value and number of belonging to the th row and th column cluster, respectively. Ignoring and , we can interpret Equation (22) as indicating that the expected value of is the average value of belonging to the th row and th column cluster through time. Note that and are hyperparameters that adjust the average value and must be determined in advance. Further, from Equation (22), we can see how and affect the estimated . Specifically, becomes larger when , while becomes smaller when . In addition, the larger the values of and , the larger the amount of prior information. Thus, by deriving explicit relationship between parameters and hyperparameters, users can have intuition for how hyperparameters affect the estimation, and then it helps them select hyperparameters.
Next, we consider in Equation (20) of the dZIPIRM, denoted by . As before, the expected value and variance of are derived to be
(23) 
The numerator in Equation (23) is the same as in Equation (22). represents the number of generated from the Poisson distribution (), which belongs to the th row and th column cluster. Ignoring and , Equation (23) indicates the average value of considered to be generated from the Poisson distribution (), which belongs to the th row and th column cluster over time.
From Equations (22) and (23), when we compare in the dPIRM and dZIPIRM, the following theorem is derived.
Theorem 1.
(24)  
(25) 
where
and then .
The proof of Theorem 1 is presented in Appendix.
From Equation (24) we obtain an intuitive interpretation of . Ignoring , represents the proportion of data that belongs to the th row and th column cluster and that are generated from a Poisson distribution (). That is, in the dZIPIRM, the more data in a cluster generated from the Degenerate distribution, the larger tends to be compared with . Conversely, the more data generated from the Poisson distribution, the closer tends to be to . In short, under the estimation framework using the full conditional distribution, the estimated tends to become larger than because in the dZIPIRM, the zero data, which are considered to be generated from the Degenerate distribution, are not taken into account to estimate . This theoretical result is consistent with the model interpretation in Section 2.
From Equation (25), we may also say that the dispersion of tends to be larger than that of . However, as shown in the simulation study in Section 4, this result does not necessarily affect the clustering result.
4 Simulation Study
In this section, we conduct a simulation to evaluate the fit to the data and clustering accuracy of the dPIRM and dZIPIRM.
4.1 Data generation and setting
We prepared several synthetic datasets. To create these datasets, we first stipulated that the number of clusters is , the number of time steps is , and the number of objects is . Next, we generated
from the discrete uniform distribution
(, ), and then from (, ), as shown in Figure 2. When the time changes from to , we randomly move objects to other clusters in three patterns of movement ratio , where . Finally, we randomly convert the numerical value of the data into zero with a zero ratio , where . We thus generate data for all nine () settings.To estimate the parameters, we need to prespecify the hyperparameters. In Section 3.2, we showed the effect of the hyperparameters on the expected value. In this simulation, since there is no prior information, we reduce the influence of the hyperparameters as much as possible. Therefore, to reduce the influence of hyperparameters, we set . Similarly, we set as well as .
We compare the PIRM, dPIRM, and dZIPIRM in this scenario by sampling the parameters 300 times for each method. We use two measurements for the evaluation. One is the Rand index (RI), which computes the similarity between the true and estimated clustering results [13, 7]. The RI is coded one (zero) when two clustering results match (do not match). As with the computation of the RI in the dIRM, we compute the RI between the assignment of correct answers and the estimated for each time step, and then average the RI for time points. For the evaluation of cluster allocation in PIRM, where one cluster assignment is obtained in each time step (and thus T cluster assignments in total), we calculate RIs between the cluster assignment obtained by PIRM and true cluster assignment at each time point, and pick up the highest RI result of all RI results. The second measurement is loglikelihood, for which a large value indicates that the model fits the data well [7].
We calculate the mean and standard deviation for the two measurements above for the data generated for 50 data points in each of the nine settings described earlier.
4.2 Results
PIRM  dPIRM  dZIPIRM  

0.1  0.3  0.79(0.16)  0.89(0.14)  0.91(0.13) 
0.5  0.77(0.16)  0.94(0.10)  0.93(0.10)  
0.7  0.76(0.18)  0.90(0.14)  0.94(0.11)  
0.2  0.3  0.77(0.16)  0.87(0.15)  0.96(0.07) 
0.5  0.76(0.16)  0.94(0.10)  0.93(0.11)  
0.7  0.76(0.16)  0.92(0.12)  0.96(0.08)  
0.3  0.3  0.73(0.16)  0.89(0.14)  0.88(0.13) 
0.5  0.73(0.17)  0.89(0.14)  0.90(0.13)  
0.7  0.74(0.17)  0.87(0.15)  0.90(0.14) 
PIRM  dPIRM  dZIPIRM  

0.1  0.3  2.28(0.38)  2.04(0.28)  2.01(0.25) 
0.5  2.23(0.40)  2.02(0.27)  2.02(0.27)  
0.7  2.26(0.32)  2.10(0.25)  2.09(0.25)  
0.2  0.3  2.24(0.33)  2.04(0.26)  2.00(0.25) 
0.5  2.20(0.33)  2.02(0.23)  2.04(0.25)  
0.7  2.20(0.30)  2.04(0.27)  2.03(0.29)  
0.3  0.3  2.14(0.25)  2.12(0.22)  2.11(0.22) 
0.5  2.22(0.25)  2.09(0.27)  2.08(0.28)  
0.7  2.18(0.26)  2.02(0.26)  2.03(0.25) 
Tables 1 and 2 show the results for the RI and loglikelihood, respectively. represents the rate at which the object moved between clusters at each time step and represents the ratio of objects converted into zeros.
As shown in Table 1, the RIs of the dPIRM and dZIPIRM are higher than that of the PIRM in every setting. This finding means that models incorporating time structures such as the dPIRM and dZIPIRM can cluster data better than those that cluster data at each time step, the PIRM. When comparing the dPIRM and dZIPIRM, the dZIPIRM has a higher RI than the dPIRM in several settings. Furthermore, the dZIPIRM often yields a lower standard deviation than the dPIRM. This finding means that the dZIPIRM can perform stable clustering with accurate estimates even when many zero values exist in the data.
Table 2 shows that the dPIRM and dZIPIRM yield higher loglikelihood values than the PIRM in any setting, just as with the RI. This means that we can better describe the assumed scenario by employing models incorporating time structures such as the dPIRM and dZIPIRM. When , the dZIPIRM yields high loglikelihood values for every zero ratio. This finding shows that the dZIPIRM fits the data well when considering zeroinflated data for objects unlikely to move at each time step. Further, the dZIPIRM has higher loglikelihood values than the dPIRM for many patterns, like the RI, while the standard deviation is similar between the dPIRM and dZIPIRM, unlike the RI.
These results suggest that both the dPIRM and the dZIPIRM are superior to the PIRM in terms of RI and loglikelihood for this simulation setting. In addition, the dZIPIRM provides an accurate and stable clustering allocation as well as a good data fit compared with the dPIRM when time structures and zeroinflated data are assumed.
5 Application
In this section, we illustrate how the dPIRM and dZIPIRM may be applied to POS data to identify a group of products likely to be purchased simultaneously.
5.1 Data and setting
We analyze the POS data of a fashion brand in Japan, provided by Sothink Co., Ltd. These data provide information for two store types: shopinshop and franchise. Since fashion brand products are sensitive to the influence of the season in Japan, it is also important to consider marketing strategy changes depending on the season.
The data cover 78,807 customers and 12,937 products. Product categories include short pants, small items, and shortsleeved Tshirts. We view product categories as objects in the model and count simultaneous purchases by product category.
We formulate a matrix showing the number of simultaneous purchases of 40 types of objects every month from January to June 2016 (, ). That is, we create an relational count data matrix for time steps. The zeroelement ratio for the entire matrix is 0.52. We analyze the resulting dataset using the dPIRM and dZIPIRM. Since there is no prior hyperparameter information, we reuse the simulation approach.
5.2 Results
In this section, we first compare the loglikelihood values of the PIRM, dPIRM, and dZIPIRM to assess the fit of each model.
PIRM  dPIRM  dZIPIRM  
Loglikelihood  6.88  4.84  3.84 
Table 3 shows that the dZIPIRM has the highest loglikelihood, meaning that it provides the better fit to the data. Hence, assuming a ZIP and a timevarying structure improves the overall fit.
January  February  March  April  May  June  
cluster 1  small item  small item  small item  small item  small item  small item 
tie  tie  tie  tie  tie  tie  
suit  suit  suit  suit  suit  suit  
casual shirt  casual shirt  casual shirt  casual shirt  casual shirt  casual shirt  
jeans  jeans  jeans  jeans  jeans  
slacks  slacks  slacks  
short sleeve Tshirt  short sleeve Tshirt  short sleeve Tshirt  
short pants  
cluster 2  short pants  short pants  
long sleeve Tshirt  
cluster 3  slacks  slacks  
short sleeve Tshirt  short sleeve Tshirt  
cluster 4  short pants  short pants  long sleeve Tshirt  long sleeve Tshirt  
cluster 5  jeans  slacks  short pants  
short sleeve Tshirt  
cluster 6  long sleeve Tshirt  long sleeve Tshirt  long sleeve Tshirt 
Next, we show the results for the dPIRM. The number of clusters in the dPIRM is estimated to be six. Figure 3 shows the estimated corresponding to each cluster combination, while Figure 4 shows the total number of objects in each cluster for the sixmonth sample period. Table 4 shows the objects belonging to the corresponding clusters. Since the parameter represents the strength of the relationship, if the value of the estimated is large, the objects in Cluster and Cluster are more likely to be purchased simultaneously.
Figure 3 shows that the objects in Cluster 1 are more likely to be purchased with those in Cluster 1 simultaneously (i.e., is high). Further, the objects in Cluster 3 are more likely to be purchased simultaneously with those in Cluster 1 and Cluster 5 (i.e., and are high), while they are unlikely to be purchased with other items in Cluster 3 (i.e., is low). The items in Cluster 5 are more likely to be bought with items in Cluster 1, followed in order of those in Clusters 5 and 3 (i.e., ).
From the number of objects belonging to each cluster in Figure 4, we see that the number in Cluster 1 is much larger than that for the other clusters. Therefore, the objects in Cluster 1 are more likely to be purchased simultaneously with several other types of objects. As shown in Table 4, there are two types of objects in Cluster 1: objects that remain in Cluster 1 for a long period and items that move from or to Cluster 1 depending on the season.
Examples of objects that remain in Cluster 1 for a long period include small items such as handkerchiefs and necklaces, which may be interpreted as objects likely to be purchased along with other items throughout the year. We may also interpret items such as ties as being more likely to be bought with a suit throughout the year.
Meanwhile, items that move from or to Cluster 1 depending on the season, such as slacks and shortsleeved Tshirts, belong to Clusters 3 and 5 at the beginning of the year, but move to Cluster 1 from April onwards. This date is the turning point of the season in Japan, when spring starts and temperatures rise. This means that slacks and shortsleeved Tshirts are not purchased much in the winter, whereas in the spring they are more likely to be bought along with other objects.
January  February  March  April  May  June  
cluster 1  small item  small item  small item  small item  small item  small item 
tie  tie  tie  tie  tie  tie  
suit  suit  suit  suit  suit  suit  
casual shirt  casual shirt  casual shirt  casual shirt  
jeans  jeans  jeans  jeans  
slacks  slacks  slacks  
short sleeve Tshirt  short sleeve Tshirt  short sleeve Tshirt  
short pants  short pants  
cluster 2  casual shirt  short sleeve Tshirt  short sleeve Tshirt  short sleeve Tshirt  
jeans  short pants  
cluster 4  casual shirt  
jeans  
cluster 7  slacks  
cluster 9  slacks  
cluster 10  short sleeve Tshirt  short sleeve Tshirt  
cluster 11  short sleeve Tshirt  short pants  
cluster 12  slacks  long sleeve Tshirt  short pants  
cluster 13  short pants  short pants  short pants  long sleeve Tshirt  long sleeve Tshirt  long sleeve Tshirt 
long sleeve Tshirt  long sleeve Tshirt 
We now show the results for the dZIPIRM. The estimated number of clusters is 13. Figures 5, 6 and Table 5 for the dZIPIRM follow Figures 3, 4 and Table 4 for the dPIRM. Figure 5 shows that the number of clusters in the dZIPIRM is larger than that in the dPIRM because of the increase in cluster expressiveness, which in turn is a consequence of the additional flexibility provided by the assumption that there are many zeros.
Beyond the estimated number of clusters, the results for the dZIPIRM are similar as those for the dPIRM. For example, Cluster 1 of the dZIPIRM is similar to Cluster 1 of the dPIRM (e.g., always including small items, ties, and suits), while some objects move from cluster to cluster depending on the season. However, using the dZIPIRM, we may interpret the result in more detail. For example, the dPIRM could not distinguish between the behavior of slacks and shortsleeved Tshirts, whereas the dZIPIRM can interpret the difference between these objects. Specifically, in January, the category “slacks” belongs to Cluster 12, while the category “shortsleeved Tshirts” belongs to Cluster 10. Products in Cluster 12 are unlikely to be purchased with objects in any cluster. Meanwhile, objects in Cluster 10 are likely to be purchased with objects in Cluster 2 (e.g., casual shirts and jeans). Thus, by assuming a zeroinflated setting using the dZIPIRM, we obtain a result that is easier to interpret.
From the above analysis, we may conclude that since the dPIRM yields fewer clusters, only rough trends may be modeled. However, as the dZIPIRM yields more clusters, a more detailed interpretation of the clusters may be obtained. As Table 3 shows, the dZIPIRM fits the data better than the dPIRM. Despite this, as is typically the case with empirical data, it is difficult to objectively validate which method should be preferred. Nevertheless, this empirical application shows that both the dPIRM and the dZIPIRM allow us to see how customer buying behavior changes with the seasons. In addition, by using the dZIPIRM, we obtain a clustering result that provides a more detailed interpretation of the clusters when the data contain many zeros.
6 Conclusion and Discussion
We proposed two biclustering methods for count data considering changes over time. The first method, the dPIRM, enabled the biclustering of timevarying count data by using the time structure and distribution information for these data. The second method, the dZIPIRM, extended the dPIRM to express zeroinflated real data. To derive the posterior distribution and clarify the properties of the parameters, zeroinflated data were formulated by using latent variables.
We derived the explicit posterior distributions of the parameters and outlined their properties under the estimation framework using a full conditional distribution. Specifically, by comparing the expected value and variance of for both the dPIRM and the dZIPIRM, we showed that how the specific model assumptions affect the estimation result. We also explained the theoretical relationship between the two models. Although this theorem is specifically derived to compare our two proposed model, dPIRM and dZIPIRM, this could be also applied for other ZIPrelated model.
By incorporating a time structure, we further showed that both the dPIRM and the dZIPIRM outperform the PIRM in terms of the RI and loglikelihood values in a simulation study. We also observed that the dZIPIRM yields more stable and better clustering allocations as well as a better data fit than the dPIRM.
For the empirical data, the dZIPIRM, which assumes the properties of the ZIP and a timevarying structure, resulted in the best data fit. Moreover, the clustering result of the dZIPIRM is similar to that of the dPIRM. However, using the dZIPIRM allows us to interpret the cluster structure in more detail, as this model relies on more flexible assumptions than the dPIRM.
We consider two future works. At first, although the row and column objects in the dPIRM and dZIPIRM may differ, for comprehensibility, we analyzed the data using the same object for the rows and columns. Therefore, future work will focus on real data in situations where the objects of the rows and columns are different.
Future research will also study how the choice of hyperparameters affects the estimation results in a simulation setting (i.e., how the estimation result changes depending on the choice of hyperparameters). In Section 3.2, we referred to the relationship between the hyperparameters and parameters, and we believe further work on this area is needed to provide users with better guidance about hyperparameter choice.
Appendix
In this Appendix, we show proofs for Equation (19) and Theorem 1. First, Equation (19) can be derived as follows.
Proof.
∎
Next, proof of Theorem 1 is derived as follows.
References
 [1] Angers, J. F., Biswas, A., 2003. A Bayesian analysis of zeroinflated generalized Poisson model. Computational Statistics and Data Analysis, 42, 37–46.
 [2] Bhattacharya, A., Clarke, B. S., Datta, G., 2008. A Bayesian test for excess zeros in a zeroinflated power series distribution. IMS Collections, 1, 89–104.

[3]
Fox, E. B., Sudderth, E. B., Jordan, M. I., Willsky, A. S., 2008.
An HDPHMM for systems with state persistence.
In
Proceedings of the 25th International Conference on Machine Learning
, 312–319.  [4] Fox, J. P., 2013. Multivariate zeroinflated modeling with latent predictors: modeling feedback behavior. Computational Statistics and Data Analysis, 68, 361–374.
 [5] Fu, W., Song, L., Xing, E. P., 2009. Dynamic mixed membership blockmodel for evolving networks. In Proceedings of the 26th Annual International Conference on Machine Learning, 329–336.
 [6] Hartigan, J. A., 1972. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337), 123–129.
 [7] Ishiguro, K., Iwata, T., Ueda, N., Tenenbaum, J. B., 2010. Dynamic infinite relational model for timevarying relational data analysis. In Proceedings of Advances in Neural Information Processing Systems, 919–927.

[8]
Ishiguro, K., Ueda, N., Sawada, H., 2012.
Subset infinite relational models.
In
Proceedings of Artificial Intelligence and Statistics
, 547–555.  [9] Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., Ueda, N., 2006. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence, 381–388.
 [10] Lambert, D., 1992. Zeroinflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14.
 [11] Lee, K., Joo, Y., Song, J. J., Harper, D. W., 2011. Analysis of zeroinflated clustered count data: a marginalized model approach. Computational Statistics and Data Analysis, 55, 824–837.
 [12] Liu, Y., Tian, G. L., 2015. Type I multivariate zeroinflated Poisson distribution with applications. Computational Statistics and Data Analysis, 83, 200–222.
 [13] Rand, W. M., 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850.
 [14] Boris, M., 1996. Mathematical Classification and Clustering. Kluwer academic publishers, Norwell, USA.
 [15] Nowicki, K., Snijders, T. A. B., 2001. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455), 1077–1087.

[16]
Olteanu, M., Ridgway, J., 2012. Hidden markov models for time series of counts with excess zeros. In
Proceedings of 20th European Symposium on Artificial Neural Networks
, 133–138.  [17] Sethuraman, J., 1994. A constructive definition of Dirichlet piors. Statistica Sinica, 4, 639–650.
 [18] Teh, Y. W., Jordan, M. I., Beal, M. J., Blei, D. M., 2007. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.
 [19] Van Gael, J., Saatci, Y., Teh, Y. W., Ghahramani, Z., 2008. Beam sampling for the infinite hidden Markov model. In Proceedings of the 25th International Conference on Machine Learning, 1088–1095.
 [20] Wang, E., Liu, D., Sliva, J., Carin, L., Dunson, D. B., 2010. Joint analysis of timeevolving binary matrices and associated documents. In Proceedings of Advances in Neural Information Processing Systems, 2370–2378.
 [21] Wu, X., Li, H., 2017. Topic mover’s distance based document classification. 2017 IEEE 17th International Conference on Communication Technology (ICCT), 1998–2002.
 [22] Yang, T., Chi, Y., Zhu, S., Gong, Y., Jin, R., 2009. A Bayesian approach toward finding communities and their evolutions in dynamic social networks. In Proceedings of the 2009 SIAM International Conference on Data Mining, 990–1001.
Comments
There are no comments yet.