 # Bi-clustering for time-varying relational count data analysis

Relational count data are often obtained from sources such as simultaneous purchase in online shops and social networking service information. Bi-clustering such relational count data reveals the latent structure of the relationship between objects such as household items or people. When relational count data observed at multiple time points are available, it is worthwhile incorporating the time structure into the bi-clustering result to understand how objects move between the cluster over time. In this paper, we propose two bi-clustering methods for analyzing time-varying relational count data. The first model, the dynamic Poisson infinite relational model (dPIRM), handles time-varying relational count data. In the second model, which we call the dynamic zero-inflated Poisson infinite relational model, we further extend the dPIRM so that it can handle zero-inflated data. Proposing both two models is important as zero-inflated data are often encountered, especially when the time intervals are short. In addition, by explicitly deriving the relevant full conditional distributions, we describe the features of the estimated parameters and, in turn, the relationship between the two models. We show the effectiveness of both models through a simulation study and a real data example.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Relational data contain information about the relationship between objects and may be obtained from, for example, point-of-sale (POS) information and the social networking service (SNS) community. These types of data are often represented as a matrix in which each element indicates the relationship between the row and column elements (e.g., the presence or absence of a relationship or its strength). Bi-clustering is one way to cluster elements along the row and column dimensions of a matrix simultaneously [14, 6]. By applying bi-clustering to relational data, we can ascertain the latent structure of the relationship. Indeed, many probabilistic models using bi-clustering of relational data have been proposed [9, 15, 8].

Another situation commonly found in practice is when relationships change over time. For example, in the case of simultaneous purchases of products, the relationships between them may change according to events such as the season. In the SNS community, on the contrary, human relationships change through events such as career moves. In this way, the relationship between objects often changes through an event. In this paper, we call relational data obtained at multiple time points “time-varying relational data,” as in .

A naive approach to bi-clustering time-varying relational data is to apply bi-clustering to the data at each time step separately. However, in practice, this leads to unstable clustering results , and it becomes difficult to interpret clusters across time points since cluster structures are inconsistent over time. On the contrary, clustering using a time-varying structure [20, 5] can help detect changes in relationships due to specific events.

Several bi-clustering methods exist for time-varying relational data [22, 20, 5]. First, the dynamic infinite relational model (dIRM) , which extends HMM-based method [18, 3] to relationship data, is especially useful since it can automatically determine the number of clusters from the data. The dIRM method can handle relational data containing either a zero or a one (essentially whether a relationship exists). However, in practice, we often obtain relational count data. For example, if each element of a data matrix represents how many times two products (corresponding to the row and column of the matrix) are simultaneously purchased, this data can be considered to be relational count data between the products. Since count data contain more information than binary data, it is preferable to use a model that can handle relational count data when we have count information instead of only zeros and ones.

Based on the foregoing, in this paper, we propose two new methods for the bi-clustering of count data under time changes. First, we extend the Poisson infinite relational model (PIRM) , a bi-clustering method for relational count data that does not consider bi-clustering relationships that change over time, to be able to handle time changes to a model. We call this first model the dynamic Poisson infinite relational model (dPIRM). Second, we propose a dPIRM that can handle data containing many zeros, which we call the dynamic zero-inflated Poisson infinite relational model (dZIPIRM). The latter extension is especially important for count data, as zero-inflated count data are often obtained in practice, especially when the interval between observation times is short [1, 11, 4].

The remainder of the paper is organized as follows. In Section 2, we begin by considering a bi-clustering model of time series count data without zero-inflated characteristics. Next, we propose a bi-clustering model for zero-inflated time-varying count data. In Section 3, we derive the posterior distributions of the relevant parameters and clarify their properties. In Section 4, we illustrate how our proposed models can handle numerically simulated data. In Section 5, we show how the two proposed bi-clustering methods may be used with real data. Finally, Section 6 concludes.

## 2 Bi-clustering for time-varying count data

In this section, we propose two bi-clustering methods. We first introduce one for time-varying count data and then the zero-inflated Poisson model (ZIP)  to extend the first method to be able to handle zero-inflated count data.

### 2.1 The dPIRM

We denote the data by , where represents the number of row objects, represents the number of column objects, and represents the time horizon. We can assume that without loss of generality. In addition, since relational count data are not necessarily limited to an undirected graph, we allow asymmetric relations (i.e., for all ).

We propose a bi-clustering model of time-varying count data under the following three assumptions:

(A1)

Cluster structures remain consistent over time.

(A2)

Objects can move between clusters at each time step.

(A3)

The strength of the relationship between clusters can be expressed from zero to infinity.

We describe this dPIRM as follows:

 β|γ ∼Stick(γ) (1) πtk|α0,κ,β ∼DP(α0+κ,α0β+κδkα0+κ) (2) zti|z(t−1)i,Πt ∼Multinomial(πtz(t−1)i) (3) λkℓ|a,b ∼Gamma(a,b) (4) xtij|Zt,{λkℓ} ∼Poisson(λztiztj). (5)

Here, , .

In Equation (1), the term “Stick” denotes a division formula known as the stick-breaking process . This process can be described as , where . is a shape parameter. represents time-averaged membership of the clusters (A1), and the are guaranteed to satisfy .

DP in Equation (2) represents the Dirichlet process. plays the role of a concentration parameter in the Dirichlet distribution. Thanks to the DP, the number of clusters is automatically determined. in , , ,

represents the transition probability that an object belonging to cluster

at time moves to cluster at time . is guaranteed to satisfy and . is an indicator function taking the value of one when it belongs to cluster and zero otherwise. adjusts how likely it is for objects to stay in the same cluster at each time step. In Equation (3), cluster assignment is generated with a multinomial distribution using to incorporate the time structure. This enables objects to move between clusters over time, following (A2).

In Equation (4), represents the strength of the relationship between the cluster indicated by the th row and th column. and

represent the shape and scale parameters of the Gamma distribution, respectively. By generating

from a Gamma distribution, it is possible to represent relationships from zero to infinity, following (A3). In Equation (5), count data are generated with the parameter .

Next, we consider the relationship between existing models (i.e., the dIRM and PIRM) and the dPIRM. For the dIRM, the Gamma distribution for the prior of in Equation (4

) of the dPIRM is replaced with a Beta distribution, while the Poisson distribution in Equation (

5

) is replaced with the Bernoulli distribution. On the contrary, for the PIRM, the parameter

in Equation (1) in the dPIRM is only used as a parameter for a multinomial distribution instead of in Equation (3). Moreover, the PIRM framework does not consider the time structure as in Equation (2).

Using the dPIRM, we can handle time-varying relational count data. However, in empirical applications, zero-inflated count data are often encountered, especially when the interval between time steps is short. In such cases, the approximation accuracy of the model is significantly low, as the dPIRM cannot handle zero-inflated count data. For this reason, we propose a model to address this issue in the next subsection.

### 2.2 The dZIPIRM

In this section, we extend the dPIRM to handle zero-inflated count data by proposing the dZIPIRM. In addition to Assumptions (A1)–(A3), this extension further assumes the following:

(A4)

Zero-inflated data are expressed by assuming mixture distributions of count data and zero data.

Given time-varying count data , the dZIPIRM model is defined as follows:

 β|γ ∼Stick(γ) (6) πtk|α0,κ,β ∼DP(α0+κ,α0β+κδkα0+κ) (7) zti|z(t−1)i,Πt ∼Multinomial(πtz(t−1)i) (8) wtij|c,d ∼Beta(c,d) (9) rtij|wtij ∼Bernoulli(wtij) (10) λkℓ|a,b ∼Gamma(a,b) (11) xtij|Zt,{λkℓ},rtij ∼{Degenerate(xtij=0)(rtij=0)Poisson(λztiztj)(rtij=1). (12)

We assume that Equations (6), (7), (8), and (11) follow the same generator processes as the dPIRM. Equations (9), (10), and (12) represent the extension of the dPIRM to zero-inflated count data. Before explaining the model in detail, we first describe how we handle zero-inflated data with this model.

First, we introduce the ZIP, which is often used to model zero-inflated data. In the Poisson distribution, we assume that zeros and nonzeros are obtained from the same generator process. On the contrary, in the ZIP, count data are generated from, mixture of a Poisson distribution with weight and a Degenerate distribution for with weight . The Degenerate distribution for denotes a distribution of . Letting , , and , the probability function can be written as follows:

 P(x|λ,w)=(1−w)I(x=0)+wλxe−λx!. (13)

is an indicator function.

A naive approach to extending the dPIRM for zero-inflated data is to simply assume that data follows the ZIP. However, this assumption causes a problem when estimating the parameters. Specifically, to estimate the parameters using Gibbs sampling, a technique often used in Bayesian models, we need to derive the full conditional distribution. In general, however, it tends to be difficult to derive full conditional distributions explicitly when the probability is represented as a sum, as in Equation (13) . However, if we can derive the full conditional distribution, we can easily interpret the parameter properties. Therefore, to ensure tractability and interpretability, we introduce the latent variable [16, 12]. Letting , we consider a model by setting the distribution as follows:

 r|w ∼Bernoulli(w) (14) x|λ,r ∼{Degenerate(x=0)(r=0)Poisson(λ)(r=1). (15)

In Equation (15), the value of generated by Equation (14) determines whether the data follow the Poisson distribution or the Degenerate distribution for . Thus, given and , the joint probability of and is

 P(x,r|λ,w) =P(x|λ,r)P(r|w) =(λxe−λx!)rI(x=0)1−rwr(1−w)1−r ={wλxe−λx!}r{(1−w)I(x=0)}1−r. (16)

In this way, the probability can be represented in the form of a product, and the joint probability of the parameters and data (i.e.,

) is also expressed as a product. This solves the problem of the joint distribution in Equation (

13) being expressed as a sum.

Furthermore, when we sum out all the possible realized values of in Equation (16), we obtain the formula on the left-hand side of the ZIP as follows:

 ∑rP(x,r|λ,w)=(1−w)I(x=0)+wλxe−λx!.

The resulting formula is equivalent to the ZIP in Equation (13). In other words, by assuming Equations (14) and (15), we obtain a model with the property of the ZIP while deriving an explicit posterior distribution. For this reason, the distributions in Equations (14) and (15) are assumed to hold in Equations (10) and (12).

Since Equations (6), (7), (8), and (11) are equivalent to the dPIRM, we explain Equations (9), (10), and (12) in detail. In Equation (9), we use the Beta distribution for the prior distribution of

. Since the Beta distribution is the conjugate prior distribution of the Bernoulli distribution, we can analytically derive the posterior distribution, with

and as the shape parameters. In Equations (10) and (12), determines whether follows the Degenerate distribution for or the Poisson distribution, following (A4). This indicates that in this model, zero data, which is generated from the Degenerate distribution (), is not taken into account for estimation. In other words, when estimating the parameters in the dZIPIRM, we can remove the effect of zeros that follow the Degenerate distribution. This intuitive interpretation will be justified by Theorem 1 in Section 3.2.

Figure 1

illustrates the parameter relations in the PIRM, dPIRM, and dZIPIRM. The circle nodes denote the variables, square nodes are the hyperparameters, and shaded nodes indicate the observations. The figure illustrates the differences between the data generation processes considered in each model, showing that the dZIPIRM has a structure in which

depends on .

## 3 Sampling parameters

In this section, we derive the full conditional distribution of the parameter of the dPIRM as well as the parameters , , and of the dZIPIRM. We then explain their properties. Deriving the full conditional distributions allows us to understand the properties of the parameters when the other parameters and data are fixed. Moreover, by comparing for the dZIPIRM and dPIRM, we can appreciate how the models differ.

We use beam sampling [19, 7] to sample the parameters, as it is faster and more efficient than Gibbs sampling when there is strong series correlation between and . Indeed, convergence is significantly slowed in this case for Gibbs sampling.

### 3.1 Derivation of the full conditional distribution

In this section, we derive the full conditional distribution of the parameters in the dPIRM and , , and in the dZIPIRM. Since the full conditional distribution of the other parameters in the dPIRM and dZIPIRM are the same as in the dIRM, we refer the reader to  for the derivation of those distributions. In the derivation of the full conditional distribution, it is sufficient to consider the proportionality relation between the parameters for the calculation. To simplify the notation, we define the following set, .

First, we explain the parameters of the dPIRM. in the dPIRM is represented by the following proportionality relation:

 h(λ|x,Z) ∝λa−1kℓe−bλkℓ∏t∏i;zti=k∏j;ztj=ℓe−λztiztjλxtijztiztj ∝λ(a+∑t,i,j∈Ckℓxtij)−1kℓe−[b+∑t,i,jI{t,i,j∈Ckℓ}]λkℓ.

Therefore, the full conditional distribution of in the dPIRM becomes

 λkℓ ∼Gamma⎛⎝a+∑t,i,j∈Ckℓxtij,b+∑tijI{t,i,j∈Ckℓ}⎞⎠. (17)

Next, we derive the full conditional distributions of , , and in the dZIPIRM. Unlike the dPIRM, the derivation of these full conditional distributions is not straightforward. However, thanks to the formulation of the dZIPIRM in Section 2, setting conjugate prior distributions for the prior distribution of the parameters allows us to analytically derive the full conditional distribution of the parameters in the dZIPIRM. First, in the dZIPIRM is represented by the following proportionality relation:

 h(λ|w,r,x,Z) ∝λa−1kℓe−bλkℓ∏t∏i;zti=k∏j;ztj=ℓe−λztiztjrtijλrtijxtijztiztj ∝λ(a+∑t,i,j∈Ckℓrtijxtij)−1kℓe−[b+∑t,i,j∈Ckℓrtij]λkℓ.

Following a similar reasoning as in Equation (17), the full conditional distribution of in the dZIPIRM may be expressed as follows:

 λkℓ ∼Gamma⎛⎝a+∑t,i,j∈Ckℓrtijxtij,b+∑t,i,j∈Ckℓrtij⎞⎠. (18)

This expression may be further simplified as follows. From Equation (12):

 xtij=0⟹{rtij=1(Poisson)rtij=0(Degenerate),xtij>0⟹rtij=1(Poisson).

Using this, we obtain

 ∑t,i,j∈Ckℓrtijxtij=∑t,i,j∈Ckℓxtij, (19)

which we prove in Appendix. Using Equation (19), Equation (18) may be rewritten as follows:

 λkℓ ∼Gamma⎛⎝a+∑t,i,j∈Ckℓxtij,b+∑t,i,j∈Ckℓrtij⎞⎠. (20)

This expression is useful to compare the properties of the dPIRM and dZIPIRM, as explained in Section 3.2.

Next, for the parameter of the dZIPIRM, we use the proportionality relation:

 h(w|λ,r,x) ∝∏t∏i∏jwrtijtij(1−wtij)1−rtijwc−1tij(1−wtij)d−1 ∝∏t∏iwc+rtij−1tij(1−wtij)d+1−rtij−1.

Therefore, the full conditional distribution of becomes the Beta distribution as follows:

 wtij ∼Beta(c+rtij,d+(1−rtij)). (21)

Finally, we consider the full conditional distribution of the latent variable , for which we consider the proportionality relation:

 h(r|x,λ,w) ∝∏t∏i∏j(e−λztiztj(λztiztj)xtij)rtij ×(I(xtij=0))(1−rtij)wrtijtij(1−wtij)1−rtij ∝∏t∏i∏j{wtije−λztiztj(λztiztj)xtij}rtij ×{(1−wtij)I(xtij=0)}(1−rtij).

The full conditional distribution of becomes

 rtij ∼Bernoulli⎛⎝wtije−λztiztj(λztiztj)xtijwtije−λztiztj(λztiztj)xtij+(1−wtij)I(xtij=0)⎞⎠.

### 3.2 Interpreting the properties of the model using the full conditional distribution

In this section, we clarify and compare the properties of the two models by studying the full conditional distributions of the parameters derived in Section 3.1. First, we compare

in the dPIRM and dZIPIRM to explain the properties of each model. The expected value and variance of

in the dPIRM from Equation (17), denoted by , is expressed as follows:

 (22)

and represent the total value and number of belonging to the th row and th column cluster, respectively. Ignoring and , we can interpret Equation (22) as indicating that the expected value of is the average value of belonging to the th row and th column cluster through time. Note that and are hyperparameters that adjust the average value and must be determined in advance. Further, from Equation (22), we can see how and affect the estimated . Specifically, becomes larger when , while becomes smaller when . In addition, the larger the values of and , the larger the amount of prior information. Thus, by deriving explicit relationship between parameters and hyperparameters, users can have intuition for how hyperparameters affect the estimation, and then it helps them select hyperparameters.

Next, we consider in Equation (20) of the dZIPIRM, denoted by . As before, the expected value and variance of are derived to be

 E[λ(dZIP)kℓ]=a+∑t,i,j∈Ckℓxtijb+∑t,i,j∈Ckℓrtij,V[λ(dZIP)kℓ]=a+∑t,i,j∈Ckℓxtij(b+∑t,i,j∈Ckℓrtij)2. (23)

The numerator in Equation (23) is the same as in Equation (22). represents the number of generated from the Poisson distribution (), which belongs to the th row and th column cluster. Ignoring and , Equation (23) indicates the average value of considered to be generated from the Poisson distribution (), which belongs to the th row and th column cluster over time.

From Equations (22) and (23), when we compare in the dPIRM and dZIPIRM, the following theorem is derived.

###### Theorem 1.
 E[λ(dP)kℓ] =αkℓE[λ(dZIP)kℓ], (24) V[λ(dP)kℓ] =α2kℓV[λ(dZIP)kℓ], (25)

where

 αkℓ=b+∑t,i,j∈Ckℓrtijb+∑t,i,jI{t,i,j∈Ckℓ},

and then .

The proof of Theorem 1 is presented in Appendix.

From Equation (24) we obtain an intuitive interpretation of . Ignoring , represents the proportion of data that belongs to the th row and th column cluster and that are generated from a Poisson distribution (). That is, in the dZIPIRM, the more data in a cluster generated from the Degenerate distribution, the larger tends to be compared with . Conversely, the more data generated from the Poisson distribution, the closer tends to be to . In short, under the estimation framework using the full conditional distribution, the estimated tends to become larger than because in the dZIPIRM, the zero data, which are considered to be generated from the Degenerate distribution, are not taken into account to estimate . This theoretical result is consistent with the model interpretation in Section 2.

From Equation (25), we may also say that the dispersion of tends to be larger than that of . However, as shown in the simulation study in Section 4, this result does not necessarily affect the clustering result.

Next, we consider in Equation (21). Again, the expected value and variance of is expressed as

 E[wtij]=c+rtijc+d+1,V[wtij]=(c+rtij)(d+(1−rtij))(c+d+1)2(c+d+2). (26)

From Equation (26), we can see how the hyperparameters and affect the estimation. Especially, if and are set to be equal and small, the estimation of is less affected by the hyperparameters and .

## 4 Simulation Study

In this section, we conduct a simulation to evaluate the fit to the data and clustering accuracy of the dPIRM and dZIPIRM.

### 4.1 Data generation and setting

We prepared several synthetic datasets. To create these datasets, we first stipulated that the number of clusters is , the number of time steps is , and the number of objects is . Next, we generated

from the discrete uniform distribution

(, ), and then from (, ), as shown in Figure 2. When the time changes from to , we randomly move objects to other clusters in three patterns of movement ratio , where . Finally, we randomly convert the numerical value of the data into zero with a zero ratio , where . We thus generate data for all nine () settings.

To estimate the parameters, we need to prespecify the hyperparameters. In Section 3.2, we showed the effect of the hyperparameters on the expected value. In this simulation, since there is no prior information, we reduce the influence of the hyperparameters as much as possible. Therefore, to reduce the influence of hyperparameters, we set . Similarly, we set as well as .

We compare the PIRM, dPIRM, and dZIPIRM in this scenario by sampling the parameters 300 times for each method. We use two measurements for the evaluation. One is the Rand index (RI), which computes the similarity between the true and estimated clustering results [13, 7]. The RI is coded one (zero) when two clustering results match (do not match). As with the computation of the RI in the dIRM, we compute the RI between the assignment of correct answers and the estimated for each time step, and then average the RI for time points. For the evaluation of cluster allocation in PIRM, where one cluster assignment is obtained in each time step (and thus T cluster assignments in total), we calculate RIs between the cluster assignment obtained by PIRM and true cluster assignment at each time point, and pick up the highest RI result of all RI results. The second measurement is log-likelihood, for which a large value indicates that the model fits the data well .

We calculate the mean and standard deviation for the two measurements above for the data generated for 50 data points in each of the nine settings described earlier.

### 4.2 Results

Tables 1 and 2 show the results for the RI and log-likelihood, respectively. represents the rate at which the object moved between clusters at each time step and represents the ratio of objects converted into zeros.

As shown in Table 1, the RIs of the dPIRM and dZIPIRM are higher than that of the PIRM in every setting. This finding means that models incorporating time structures such as the dPIRM and dZIPIRM can cluster data better than those that cluster data at each time step, the PIRM. When comparing the dPIRM and dZIPIRM, the dZIPIRM has a higher RI than the dPIRM in several settings. Furthermore, the dZIPIRM often yields a lower standard deviation than the dPIRM. This finding means that the dZIPIRM can perform stable clustering with accurate estimates even when many zero values exist in the data.

Table 2 shows that the dPIRM and dZIPIRM yield higher log-likelihood values than the PIRM in any setting, just as with the RI. This means that we can better describe the assumed scenario by employing models incorporating time structures such as the dPIRM and dZIPIRM. When , the dZIPIRM yields high log-likelihood values for every zero ratio. This finding shows that the dZIPIRM fits the data well when considering zero-inflated data for objects unlikely to move at each time step. Further, the dZIPIRM has higher log-likelihood values than the dPIRM for many patterns, like the RI, while the standard deviation is similar between the dPIRM and dZIPIRM, unlike the RI.

These results suggest that both the dPIRM and the dZIPIRM are superior to the PIRM in terms of RI and log-likelihood for this simulation setting. In addition, the dZIPIRM provides an accurate and stable clustering allocation as well as a good data fit compared with the dPIRM when time structures and zero-inflated data are assumed.

## 5 Application

In this section, we illustrate how the dPIRM and dZIPIRM may be applied to POS data to identify a group of products likely to be purchased simultaneously.

### 5.1 Data and setting

We analyze the POS data of a fashion brand in Japan, provided by Sothink Co., Ltd. These data provide information for two store types: shop-in-shop and franchise. Since fashion brand products are sensitive to the influence of the season in Japan, it is also important to consider marketing strategy changes depending on the season.

The data cover 78,807 customers and 12,937 products. Product categories include short pants, small items, and short-sleeved T-shirts. We view product categories as objects in the model and count simultaneous purchases by product category.

We formulate a matrix showing the number of simultaneous purchases of 40 types of objects every month from January to June 2016 (, ). That is, we create an relational count data matrix for time steps. The zero-element ratio for the entire matrix is 0.52. We analyze the resulting dataset using the dPIRM and dZIPIRM. Since there is no prior hyperparameter information, we reuse the simulation approach.

### 5.2 Results

In this section, we first compare the log-likelihood values of the PIRM, dPIRM, and dZIPIRM to assess the fit of each model.

Table 3 shows that the dZIPIRM has the highest log-likelihood, meaning that it provides the better fit to the data. Hence, assuming a ZIP and a time-varying structure improves the overall fit. Figure 3: Heatmap of the estimated λkℓ in the dPIRM Figure 4: Total number of objects belonging to the clusters in the dPIRM

Next, we show the results for the dPIRM. The number of clusters in the dPIRM is estimated to be six. Figure 3 shows the estimated corresponding to each cluster combination, while Figure 4 shows the total number of objects in each cluster for the six-month sample period. Table 4 shows the objects belonging to the corresponding clusters. Since the parameter represents the strength of the relationship, if the value of the estimated is large, the objects in Cluster and Cluster are more likely to be purchased simultaneously.

Figure 3 shows that the objects in Cluster 1 are more likely to be purchased with those in Cluster 1 simultaneously (i.e., is high). Further, the objects in Cluster 3 are more likely to be purchased simultaneously with those in Cluster 1 and Cluster 5 (i.e., and are high), while they are unlikely to be purchased with other items in Cluster 3 (i.e., is low). The items in Cluster 5 are more likely to be bought with items in Cluster 1, followed in order of those in Clusters 5 and 3 (i.e., ).

From the number of objects belonging to each cluster in Figure 4, we see that the number in Cluster 1 is much larger than that for the other clusters. Therefore, the objects in Cluster 1 are more likely to be purchased simultaneously with several other types of objects. As shown in Table 4, there are two types of objects in Cluster 1: objects that remain in Cluster 1 for a long period and items that move from or to Cluster 1 depending on the season.

Examples of objects that remain in Cluster 1 for a long period include small items such as handkerchiefs and necklaces, which may be interpreted as objects likely to be purchased along with other items throughout the year. We may also interpret items such as ties as being more likely to be bought with a suit throughout the year.

Meanwhile, items that move from or to Cluster 1 depending on the season, such as slacks and short-sleeved T-shirts, belong to Clusters 3 and 5 at the beginning of the year, but move to Cluster 1 from April onwards. This date is the turning point of the season in Japan, when spring starts and temperatures rise. This means that slacks and short-sleeved T-shirts are not purchased much in the winter, whereas in the spring they are more likely to be bought along with other objects. Figure 5: Heatmap of the estimated λkℓ in the dZIPIRM Figure 6: Total number of objects belonging to the clusters in the dZIPIRM

We now show the results for the dZIPIRM. The estimated number of clusters is 13. Figures 5, 6 and Table 5 for the dZIPIRM follow Figures 3, 4 and Table 4 for the dPIRM. Figure 5 shows that the number of clusters in the dZIPIRM is larger than that in the dPIRM because of the increase in cluster expressiveness, which in turn is a consequence of the additional flexibility provided by the assumption that there are many zeros.

Beyond the estimated number of clusters, the results for the dZIPIRM are similar as those for the dPIRM. For example, Cluster 1 of the dZIPIRM is similar to Cluster 1 of the dPIRM (e.g., always including small items, ties, and suits), while some objects move from cluster to cluster depending on the season. However, using the dZIPIRM, we may interpret the result in more detail. For example, the dPIRM could not distinguish between the behavior of slacks and short-sleeved T-shirts, whereas the dZIPIRM can interpret the difference between these objects. Specifically, in January, the category “slacks” belongs to Cluster 12, while the category “short-sleeved T-shirts” belongs to Cluster 10. Products in Cluster 12 are unlikely to be purchased with objects in any cluster. Meanwhile, objects in Cluster 10 are likely to be purchased with objects in Cluster 2 (e.g., casual shirts and jeans). Thus, by assuming a zero-inflated setting using the dZIPIRM, we obtain a result that is easier to interpret.

From the above analysis, we may conclude that since the dPIRM yields fewer clusters, only rough trends may be modeled. However, as the dZIPIRM yields more clusters, a more detailed interpretation of the clusters may be obtained. As Table 3 shows, the dZIPIRM fits the data better than the dPIRM. Despite this, as is typically the case with empirical data, it is difficult to objectively validate which method should be preferred. Nevertheless, this empirical application shows that both the dPIRM and the dZIPIRM allow us to see how customer buying behavior changes with the seasons. In addition, by using the dZIPIRM, we obtain a clustering result that provides a more detailed interpretation of the clusters when the data contain many zeros.

## 6 Conclusion and Discussion

We proposed two bi-clustering methods for count data considering changes over time. The first method, the dPIRM, enabled the bi-clustering of time-varying count data by using the time structure and distribution information for these data. The second method, the dZIPIRM, extended the dPIRM to express zero-inflated real data. To derive the posterior distribution and clarify the properties of the parameters, zero-inflated data were formulated by using latent variables.

We derived the explicit posterior distributions of the parameters and outlined their properties under the estimation framework using a full conditional distribution. Specifically, by comparing the expected value and variance of for both the dPIRM and the dZIPIRM, we showed that how the specific model assumptions affect the estimation result. We also explained the theoretical relationship between the two models. Although this theorem is specifically derived to compare our two proposed model, dPIRM and dZIPIRM, this could be also applied for other ZIP-related model.

By incorporating a time structure, we further showed that both the dPIRM and the dZIPIRM outperform the PIRM in terms of the RI and log-likelihood values in a simulation study. We also observed that the dZIPIRM yields more stable and better clustering allocations as well as a better data fit than the dPIRM.

For the empirical data, the dZIPIRM, which assumes the properties of the ZIP and a time-varying structure, resulted in the best data fit. Moreover, the clustering result of the dZIPIRM is similar to that of the dPIRM. However, using the dZIPIRM allows us to interpret the cluster structure in more detail, as this model relies on more flexible assumptions than the dPIRM.

We consider two future works. At first, although the row and column objects in the dPIRM and dZIPIRM may differ, for comprehensibility, we analyzed the data using the same object for the rows and columns. Therefore, future work will focus on real data in situations where the objects of the rows and columns are different.

Future research will also study how the choice of hyperparameters affects the estimation results in a simulation setting (i.e., how the estimation result changes depending on the choice of hyperparameters). In Section 3.2, we referred to the relationship between the hyperparameters and parameters, and we believe further work on this area is needed to provide users with better guidance about hyperparameter choice.

## Appendix

In this Appendix, we show proofs for Equation (19) and Theorem 1. First, Equation (19) can be derived as follows.

###### Proof.
 ∑t,i,j∈Ckℓrtijxtij =∑t,i,j∈Ckℓ;xtij=0rtijxtij+∑t,i,j∈Ckℓ;xtij>0rtijxtij =∑t,i,j∈Ckℓ;xtij>0rtijxtij =∑t,i,j∈Ckℓ;xtij>0xtij(xtij>0⟹rtij=1) =∑t,i,j∈Ckℓ;xtij=0xtij+∑t,i,j∈Ckℓ;xtij>0xtij =∑t,i,j∈Ckℓxtij

Next, proof of Theorem 1 is derived as follows.

###### Proof.
 ∑t,i,j∈Ckℓrtij=∑t,i,jI{t,i,j∈Ckℓ}−∑t,i,j∈Ckℓ(1−rtij) (27)

Here, we derive Eq. (24)

 E[λ(dP)kℓ] =a+∑t,i,j∈Ckℓxtijb+∑t,i,jI{t,i,j∈Ckℓ} =a+∑t,i,j∈Ckℓxtijb+∑t,i,j∈Ckℓrtij +(a+∑t,i,j∈Ckℓxtijb+∑t,i,jI{t,i,j∈Ckℓ}−a+∑t,i,j∈Ckℓxtijb+∑t,i,j∈Ckℓrtij) =E[λ(dZIP)kℓ]−[⎛⎝a+∑t,i,j∈Ckℓxtij⎞⎠ (byEq.(???)) =(1−∑t,i,j∈Ckℓ(1−rtij)b+∑t,i,jI{t,i,j∈Ckℓ})E[λ(dZIP)kℓ] =(b+∑t,i,j∈Ckℓrtijb+∑t,i,jI{t,i,j∈Ckℓ})E[λ(dZIP)kℓ]

Here, we derive Eq. (25)

 V[λ(dP)kℓ] =a+∑t,i,j∈Ckℓxtij(b+∑t,i,jI{t,i,j∈Ckℓ})2 =a+∑t,i,j∈Ckℓxtij(b+∑t,i,j∈Ckℓrtij)2 +⎛⎜ ⎜⎝a+∑t,i,j∈Ckℓxtij(b+∑t,i,jI{t,i,j∈Ckℓ})2−a+∑t,i,j∈Ckℓxtij(b+∑t,i,j∈Ckℓrtij)2⎞⎟ ⎟⎠ =V[λ(dZIP)kℓ]−[⎛⎝a+∑t,i,j∈Ckℓxtij⎞⎠ ×(b+∑t,i,jI{t,i,j∈Ckℓ})2−(b+∑t,i,j∈Ckℓrtij)2(b+∑t,i,jI{t,i,j∈Ckℓ})2(b+∑t,i,j∈Ckℓrtij)2] =V[λ(dZIP)kℓ] −[(2b+∑t,i,jI{t,i,j∈Ckℓ}+∑t,i,j∈Ckℓrtij)(b+∑t,i,jI{t,i,j∈Ckℓ})2 ×∑t,i,j∈Ckℓ(1−rtij)V[λ(dZIP)kℓ]] (byEq.(???)) =(b+∑t,i,j∈Ckℓrtijb+∑t,i,jI{t,i,j∈Ckℓ})2V[λ(dZIP)kℓ]
 0 ≤∑t,i,j∈Ckℓrtij≤∑t,i,jI{t,i,j∈Ckℓ}(Eq.(???)) 0 ≤b+∑t,i,j∈Ckℓrtijb+∑t,i,jI{t,i,j∈Ckℓ}≤1

## References

•  Angers, J. F., Biswas, A., 2003. A Bayesian analysis of zero-inflated generalized Poisson model. Computational Statistics and Data Analysis, 42, 37–46.
•  Bhattacharya, A., Clarke, B. S., Datta, G., 2008. A Bayesian test for excess zeros in a zero-inflated power series distribution. IMS Collections, 1, 89–104.
•  Fox, E. B., Sudderth, E. B., Jordan, M. I., Willsky, A. S., 2008. An HDP-HMM for systems with state persistence. In

Proceedings of the 25th International Conference on Machine Learning

, 312–319.
•  Fox, J. P., 2013. Multivariate zero-inflated modeling with latent predictors: modeling feedback behavior. Computational Statistics and Data Analysis, 68, 361–374.
•  Fu, W., Song, L., Xing, E. P., 2009. Dynamic mixed membership blockmodel for evolving networks. In Proceedings of the 26th Annual International Conference on Machine Learning, 329–336.
•  Hartigan, J. A., 1972. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337), 123–129.
•  Ishiguro, K., Iwata, T., Ueda, N., Tenenbaum, J. B., 2010. Dynamic infinite relational model for time-varying relational data analysis. In Proceedings of Advances in Neural Information Processing Systems, 919–927.
•  Ishiguro, K., Ueda, N., Sawada, H., 2012. Subset infinite relational models. In

Proceedings of Artificial Intelligence and Statistics

, 547–555.
•  Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., Ueda, N., 2006. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence, 381–388.
•  Lambert, D., 1992. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14.
•  Lee, K., Joo, Y., Song, J. J., Harper, D. W., 2011. Analysis of zero-inflated clustered count data: a marginalized model approach. Computational Statistics and Data Analysis, 55, 824–837.
•  Liu, Y., Tian, G. L., 2015. Type I multivariate zero-inflated Poisson distribution with applications. Computational Statistics and Data Analysis, 83, 200–222.
•  Rand, W. M., 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850.
•  Boris, M., 1996. Mathematical Classification and Clustering. Kluwer academic publishers, Norwell, USA.
•  Nowicki, K., Snijders, T. A. B., 2001. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455), 1077–1087.
• 

Olteanu, M., Ridgway, J., 2012. Hidden markov models for time series of counts with excess zeros. In

Proceedings of 20th European Symposium on Artificial Neural Networks

, 133–138.
•  Sethuraman, J., 1994. A constructive definition of Dirichlet piors. Statistica Sinica, 4, 639–650.
•  Teh, Y. W., Jordan, M. I., Beal, M. J., Blei, D. M., 2007. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.
•  Van Gael, J., Saatci, Y., Teh, Y. W., Ghahramani, Z., 2008. Beam sampling for the infinite hidden Markov model. In Proceedings of the 25th International Conference on Machine Learning, 1088–1095.
•  Wang, E., Liu, D., Sliva, J., Carin, L., Dunson, D. B., 2010. Joint analysis of time-evolving binary matrices and associated documents. In Proceedings of Advances in Neural Information Processing Systems, 2370–2378.
•  Wu, X., Li, H., 2017. Topic mover’s distance based document classification. 2017 IEEE 17th International Conference on Communication Technology (ICCT), 1998–2002.
•  Yang, T., Chi, Y., Zhu, S., Gong, Y., Jin, R., 2009. A Bayesian approach toward finding communities and their evolutions in dynamic social networks. In Proceedings of the 2009 SIAM International Conference on Data Mining, 990–1001.