    # A model selection approach for clustering a multinomial sequence with non-negative factorization

We consider a problem of clustering a sequence of multinomial observations by way of a model selection criterion. We propose a form of a penalty term for the model selection procedure. Our approach subsumes both the conventional AIC and BIC criteria but also extends the conventional criteria in a way that it can be applicable also to a sequence of sparse multinomial observations, where even within a same cluster, the number of multinomial trials may be different for different observations. In addition, as a preliminary estimation step to maximum likelihood estimation, and more generally, to maximum L_q estimation, we propose to use reduced rank projection in combination with non-negative factorization. We motivate our approach by showing that our model selection criterion and preliminary estimation step yield consistent estimates under simplifying assumptions. We also illustrate our approach through numerical experiments using real and simulated data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider a problem of clustering a sequence of multinomial observations. To be specific, consider a sequence

of independent multinomial random vectors taking values in

for some , where . For each , each is a multinomial random vector such that the number of trials is

and the success probability vector is

. To simplify our notation, we write . We allow the value of to depend on the value of and similarly, we allow the value of to depend on the value of . Moreover, to formulate our clustering problem, we assume that there is a finite collection of -dimensional probability vectors such that . Since , it follows that for each , there exists such that . For each , we let provided that , and to simplify our notation, we may also write to mean .

We assume that the value of , and are unknown, but the value of is observed and forms the basis for statistical inference. Let , and let be the set of all possible values for . Note that can be represented with an element in and can be represented with an element in , i.e., a non-negative matrix, where each column sums to . Since the value of is assumed to be unknown, from a parameter estimation point of view, one also must consider the set for all as a potential set to which the true parameter belongs. Then, we let

 Θ=∪TK=1Θ(K). (1)

Henceforth, we write and for the parameter that generates the data . The estimates of , and are denoted by , and respectively, and we now use the letters , , for a generic value that can take respectively.

In this paper, we propose to take , and to be solutions to the optimization problems specified in (4) and (5). To this end, the rest of this paper is organized as follows.

In Section 2.1, we present the overall description of our approach, specifically introducing (4) and (5). In Section 2.2, we present a preliminary estimation technique, which can be used prior to performing a numerical search for the solution to (4). In Section 2.3, we specify the penalty term for our model selection criteria in (5).

In Section 3.1, we motivate our choice of the penalty term in (9) through Theorem 1 and 2. In Section 3.2, we motivate, in Theorem 3, our usage of the reduced rank projection step within our estimation steps.

In Section 4

, we compare our model selection criterion with the conventional AIC via a Monte Carlo simulation experiment. We also study, through our approach, a two-sample test problem for comparing two graphs. This is done using simulated data sets, as well as a real data set involving the chemical and electrical connectivity structure of neurons of a C. elegan. Lastly, we apply our technique to a problem of determining the model dimension associated with the so-called Swimmer dataset which is well known to the non-negative factorization community (c.f.

)

## 2 Background materials

### 2.1 General framework

To begin, we represent the sequence

as an integer-valued random matrix

so that the element in the th column of is . With slight abuse of notation, we denote the th row and the th column of by . Since the sample value of is known, it follows that the sample values of are known. Then, it follows that, denoting by the matrix whose th column is given by , we have

 Eθ[Xdiag(N1,…,NT)−1]=P, (2)

where is a diagonal matrix such that its th diagonal is and the expectation is taken with respect to the probability measure specified by . Moreover, in general, it can be seen that can be factored as a product of two column stochastic non-negative matrices, namely, and . Specifically,

 P=WH, (3)

where for each , the th column of is and for each , the th column of is the basis vector of such that its -th entry is if and only if .

In light of (2) and (3), when is observed without noise, i.e., , and given that the value of is known, application of a non-negative factorization algorithm can recover and from . However, in general, is random. Specifically, we have that for each ,

 fX(X1,X2,…,XT|θ)=T∏t=1((NtXt)d∏i=1PXitit),

where for simplicity, we write

 (NtXt)=(NtX1t,X2t,…,Xdt).

Alternatively, we may also write, by grouping according to the value of , that

 fX(X1,X2,…,XT|θ)=(K∏k=1d∏i=1Q∑t∈k(κ)Xitik)T∏t=1(NtXt).

For simplicity, we may write

 N=T∑t=1Nt.

Then, for each , let be an maximum estimate of with the restriction that the value of must be an element of , for some value of (c.f. ). Specifically, for each and , we denote by , an maximum estimate of given , and we have

 ˆθ(K;q)∈argmaxθ∈Θ(K)T∑t=1d∑i=1Xi(t)⎛⎜⎝Q1−qi,κ(t)−11−q⎞⎟⎠. (4)

Note that by taking to in limit, then we see that

 limq→1T∑t=1d∑i=1Xi(t)⎛⎜⎝Q1−qi,κ(t)−11−q⎞⎟⎠ = T∑t=1d∑i=1Xi(t)log(Qi,κ(t)),

and as such, reduces to a maximum likelihood estimate. When the value of is clear from context, we suppress from and write instead.

Then, we let be the smallest values of that minimizes the value of the following expression:

 Δ(K):=T∑t=1Dt(Xt,ˆQˆκ(t))+penalty(K), (5)

where is assumed to be chosen a priori, , and

 Dt(Xt,ˆQˆκ(t)) :=DKL(˜Pt||ˆQˆκ(t))−d∑i=1˜Pitlog(˜Pit) =−d∑i=1˜Pitlog(ˆQiˆκ(t)),

denoting by

from . For reference, we let

 D(ˆQ,ˆκ):=D(ˆQ,ˆκ;X):=T∑t=1Dt(Xt,ˆQˆκ(t)).

### 2.2 Preliminary estimation prior to MLqE

For our model selection problem, for each , an estimator of as function of must minimize the size of error while also allowing for non-negative factorization, i.e., where and are and non-negative matrices. Directly computing an to achieve this can be done numerically with varying degree of complexity, but in all cases, starting the search for near can be beneficial. To achieve this approximately for initializing our search algorithm for numerical experiments, we propose a multi-step procedure in which and are used together. For more details on (and also on USVT, a related approach), we direct the reader to  (and ), and for , to ,  and .

Iteratively searching for a solution to the estimation problem in (4) can be computationally expensive. An approximate solution, which can be used as the initial point of the search, can be obtained in four steps, which are listed in Algorithm 1 collectively for convenience.

First we take a reduced rank projection of at the rank

. Specifically, we first compute the singular value decomposition of

with the diagonal of being sorted in a decreasing order, e.g. . Then, we take

 ˆX=OptSpace(X;K):=ˆUˆΣˆV⊤, (6)

where is the upper block of , and and are the first columns of and respectively.

Because need not be non-negative, we then reset the negative entries of to zero. However, the resetting the negative entries of to zero can change the rank of .

To correct this, after computing , we further perform non-negative factorization, which means we minimize by running over all possible pairs of matrix and matrix (c.f. ).

In the last step, we define by letting, for each , if and only if for all , where a tie, if any exists, is resolved by uniformly choosing among the tied indices. Then, estimate .

The aforementioned steps for obtaining and are collectively denoted as in the listing of Algorithm 1.

Upon obtaining the initial value of , one can perform a numerical iterative search for , for example, using a variational method, an EM algorithm, an MCMC method, or a brute force iterative search. For an interested reader, in Section D, we outline an objective function to be maximized for an MCMC approach. In all cases, it is known that a good initialization of the chosen algorithm can improve its rate of convergence as well as allowing the algorithm to avoid a local stationary point. On the other hand, the number of clusters to be estimated still needs to be supplied, for each of these algorithms.

### 2.3 Model selection criterion

For many application, the following standard model selection criteria are often used:

 ΔAIC(K)=−log(fX(X|ˆθ(K)))+penaltyAIC(K), (7) ΔBIC(K)=−log(fX(X|ˆθ(K)))+penaltyBIC(K), (8)

where

, ,

and is chosen to be an MLE. Their derivation is based on analysis of an appropriate expected discrepancy (c.f. ) for a Gaussian regression model. In this section, we also follow this general approach, catering to our model.

Our model-based information criterion is obtained by appropriately penalizing the weighted log-likelihood of the multinomial model as specified in (5). Specifically, we consider, for each and ,

 penalty(K;s,γ):=γK∑k=1ˆZ(K)k−1(ˆN(K)k)s, (9)

where is the estimate of assuming that , is the matrix such that , , . In words, is the number of “successes” from the th cluster specified by . Also, counts the number of non-zero entries from the th cluster’s success probability vector , as specified by .

We detail our motivation for (9) by way of Theorem 1 and 2. Intuitively, as increases, the term in (9) is expected to increase for a larger value of especially when and for for many values of . In other words, when some columns of are “overly” similar to each other, the penalty term becomes more prominent (c.f. Table IV).

In Section E, we reduce (9) to and in (7) and (8) respectively, under some simplifying assumptions. However, for clustering a sequence of sparse multinomial data, the penalty terms in (7) and (8) that are appropriate for classical normal regression problems, can over-penalize, especially when the probability vectors contain many zeros (c.f. Figure 3).

## 3 Theoretical results

The main theoretical results of this paper are twofold. First, we motivate a particular choice of the form of the penalty term in (9

) through an asymptotic analysis of

, under simplifying assumptions that , and . Following , we use the superscripted as a mnemonic for “merging”, and use the superscripted as a mnemonic for “splitting”. Second, we motivate the reduced rank projection approach for initializing the numerical search of the maximum likelihood estimate .

### 3.1 Asymptotic derivation of the penalty term

In this section, we motivate a specific choice for the penalty term, , by computing the asymptotic form of the expected weighted discrepancy of while taking the value of to along some sequence of index . Let where is associated with some through (2) and (3), the expectation is taken with respect to , whence , and .

###### Theorem 1.

Suppose that

1. for each , we have such that for each , the number of trials is the same for all ,

2. for each , there exists such that for each

 ¯¯¯λk=limℓN(ℓ)t/ℓ.

Then,

 limℓ→∞ℓ(E[φ(ˆP)]−φ(P∗))=12K∗∑k=1¯¯¯¯Z∗k−1¯¯¯nk¯¯¯λk,

where and

 ¯¯¯¯Z∗k:=¯¯¯¯Z∗k(Q∗):=d∑i=11{W∗i,k>0}. (10)

Theorem 1 suggests for the value of the pair in (9). More importantly, we note that the non-zero entries do not contribute to the value of in (10).

For the rest of this section, we further study, through Theorem 2, the question of for what values of and , we can expect to see that chosen according to (5) with the penalty term specified by (9), estimate the true value with high probability.

Specifically, denoting by , in Theorem 2, we study the case in which choosing will lead to a model selection criterion that tends

(i)

not to under-estimate the value of when the alternative model is one that the st and the th clusters are “merged” into one,

(ii)

not to over-estimate the value of , when the alternative model is one that for some and , and , i.e., the th cluster is “split” into two.

###### Theorem 2.

Let and , where we let

1. be such that for all that and for all that or ,

2. be such that for and is a probability vector,

3. be such that for all that and for all that , with being surjective,

4. be such that for and then let .

Suppose that for each , and that

 DKL(θ∗||θ∗,m)≥minθ∈Θ(K∗−1)DKL(θ∗||θ)>0, (11) DKL(θ∗||θ∗,s)=minθ∈Θ(K∗+1)DKL(θ∗||θ). (12)

Then,

 limN→∞P[Δ∗,m(K∗−1)>Δ∗(K∗)]=1, (13) limN→∞P[Δ∗,s(K∗+1)>Δ∗(K∗)]=1, (14)

where

 Δ∗(K∗):=T∑t=1Dt(Xt,Q∗κ∗(t))+penalty∗(K∗), Δ∗,m(K∗−1):=T∑t=1Dt(Xt,Q∗,mκ∗,m(t))+penalty∗,m(K∗−1), Δ∗,s(K∗+1):=T∑t=1Dt(Xt,Q∗,sκ∗,s(t))+penalty∗,s(K∗+1),

with

 penalty∗(K∗):=log(N)K∗∑k=1¯¯¯¯Zk(Q∗)−1(∑t∈k(κ∗)Nt)1/2, penalty∗,m(K∗−1):=log(N)K∗−1∑k=1¯¯¯¯Zk(Q∗,m)−1(∑t∈k(κ∗,m)Nt)1/2, penalty∗,s(K∗+1)=log(N)K∗+1∑k=1¯¯¯¯Zk(Q∗,s)−1(∑t∈k(κ∗,s)Nt)1/2.

In other words, in Theorem 2, assuming that is such that its takes a form of and its takes a form of , it follows that as , with high probability, , suggesting . Similarly, in Theorem 2, assuming that is such that its takes a form of and its takes a form of , and that , it follows that as , with high probability, , suggesting . Also, (5) with the penalty term specified by (9) with yields the specific form of , and in Theorem 2.

While proven under simplifying assumptions, through Theorem 1 and 2, we propose to choose the value of to be for consistence estimation of .

On the other hand, as discussed in Section E, under some simplifying assumption, can be associated with the conventional AIC, and similarly, can be associated with the conventional BIC. For and , following the proof of Theorem 2, one can show results similar to (13) while the probability in (14) is positive but can be strictly less than . For our numerical experiments in Section 4, to be comparable to the conventional AIC, we take and we give a preference to a smaller value for if a near-tie occurs.

### 3.2 Reduced rank projection as a smoothing routine

The motivation behind using a reduced rank projection step is to remove random variation. As discussed in , when there is no missing entries in , algorithm is equivalent to performing reduced-rank projection (or equivalently, singular value thresholding at a fixed rank). Specifically, it can be seen from [3, Theorem 4.4], that provided that a random matrix is assumed to be bounded appropriately and that its entries

are independent random variables, using

yields a consistent estimate of under various conditions.

To give a more precise statement of our contribution on the topic, we begin by introducing some notation. Given a constant , for each and , let

 Yi,t:=Xi,t∧C:=min{Xi,t,C}.

Then, we let be the result of a single iteration of the singular value threholding of using the (true) value of of the matrix . We will suppress, in our notation, the dependence of , , and on for simplicity.

Truncating each at yields an estimate that is biased due to truncation while its effect may diminish as the value of increases. We present an asymptotic result in which is allowed to grow as a function of and under several simplifying assumptions.

Our first simplifying assumption is that the mean matrix has a “finite” block structure, or equivalently, a “finite” checker-board type pattern. Specifically, we assume that partition the index set , where the value of is constant and does not depend on , and , Next, we assume also that for each , there exists a pair such that for each , and . We assume that the values of are constant and do not depend on , and .

We suppress in our notation, the dependence of , , and on and for simplicity. Also, means that the pair is indexed by , so that .

###### Theorem 3.

Suppose that are independent Poisson random variables, and that the rank of is .

If and ,

then

 limT∧d→∞MSE(ˆY;X)=0,

where .

Note that generally, the rank of and for some cases, it is also possible to have the rank of . In Theorem 3, to simplify our analysis, we have assumed that the rank of is . Next, to consider Theorem 3 with respect to [3, Theorem 4.4], we note that the result in [3, Theorem 4.4] applies when the errors are independent while the entries of are correlated. Specifically, as a corollary to Theorem 3, we also have that, under the hypothesis of Theorem 3,

 limT∧d→∞1dTE[∥˜P−P∗∥2F]=0 (15)

where , since given the value of ,

 E[∥˜P−P∗∥2F|N]≤1minTt=1N2tE[∥ˆX−E[X]∥2F|N],

where . In this manner, in addition to giving a motivation to reduced rank projection as a smoothing routine, Theorem 3 can be of interest on its own.

## 4 Numerical results

### 4.1 Simulation experiments

#### 4.1.1 Simple experiment

In this section, through a simple numerical experiment, we reiterate our last observation made in Section 2.3. Consider a sequence of multinomial random vectors taking values in , where their (common) number of multinomial trials is . Specifically, the first success probability vector is proportional to the vector and the second probability vector is proportional to the vector , where for both cases, the number of entries with its value being is and the number of entries with its value being is . In other words, the value of is , whence in this case, our model selection procedure seeks to reach the minimum value of with . As shown in Table I, the value of is minimized at while the value of is minimized at .

More generally, in Table II, we allow the value of to grow, while keeping the first success probability vector to be a scalar multiple of the vector and keeping the second success probability vector to be a scalar multiple of , where for both cases, the number of entries with its value being is fixed at and the number of entries with its entries being and are the same or differ only by . A general pattern Table II is that for all values of , in comparison to , the conventional AIC, i.e. , performs poorly, and we attribute this to the fact that over-penalizes in comparison to .

#### 4.1.2 Comparison to rank determination strategies

We now present numerical results for comparing our approach to two conventional rank determination methods. Specifically, we denote our first baseline algorithm with (pamk o dist) and the second with (mclust o pca), where o denotes composition. These competing algorithms are often used in practice for choosing the rank of a (random) matrix. In comparison, we denote our model selection procedure by (aic o nmf). For (pamk o dist), one first computes the distance/dissimilarity matrix using pair-wise Euclidean/Frobenius distances between the columns of , and perform partition around medoids for clustering (c.f. ) together with “Silhouette” criterion (c.f. ) for deciding the number of clusters. For (mclust o pca), one first uses an “elbow-finding” algorithm to estimate the rank of the data matrix (c.f. ), say, by , and then, use a clustering algorithm (c.f. ) to cluster columns of into clusters.

The result of our experiment is summarized in Figure 1, which illustrates that in all cases, (aic o nmf) either outperforms or nearly on par with the two baseline algorithms.

To explain our result, we now specify the set-up for our Monte Carlo experiment. Our experiment is motivated by the real data experiment studied in Section 4.2, where a problem of comparing two graphs representing electrical and chemical neuron pathways of C. elegan is studied.

Specifically, we consider random graphs on vertices such that each has a block-structured pattern, i.e., a checker-board like pattern (c.f. Figure 3). For each , we take to be a (weighted) graph on vertices, where each is a Poisson random variable. To parameterize the block structures, we set the number of vertices , where .

The value of equals the number of rows in each block. Then, given a value for the intensity parameter , for each and , we let , where and are such that and . We take

 B(1):=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.10.0450.0150.190.0010.0450.050.0350.140.030.0150.0350.080.1050.040.190.140.1050.290.130.0010.030.040.130.09⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠, B(2):=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.190.140.290.1050.130.0010.030.130.040.090.0150.0350.1050.080.040.0450.050.140.0350.030.10.0450.190.0150.001⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

In Figure 1, the horizontal axis specifies the number of nodes after aggregation. For the level of aggregation (or equivalently, vertex-contraction), if the number of nodes after vertex-contraction is (i.e. the far right side of Figure 1), the original graph is reduced to a graph with vertices. Aggregation of edge weights is only done within the same block. Then, we take to be the non-negative matrix such that its th column is the vectorized version of the aggregation of . Our problem is then to estimate the number of clusters using data , where the correct value for is . In this particular case, is a rank- matrix, and as such, our problem can also be thought to be a problem of estimating the rank of after observing , whence (pamk o dist) and (mclust o pca) are applicable.

In summary, there are two parameters that we have varied, specifically, the level of intensity and the level of aggregation. The level of intensity is changed by the value of , which is distinguished in Figure 1 by the shape of points. Note that a bigger value for means more chance for each entry of taking a large integer value.

Then, as the performance index, we use the adjusted Rand index (ARI) values (c.f. ). In general, ARI takes a value in . The cases in which the value of ARI is close to is ideal, indicating that clustering is consistent with the truth, and the cases in which the value of ARI is less than are the cases in which its performance is worse than randomly assigned clusters. Then, to compare three algorithms, we compare the values of ARI given each fixed value of . Figure 1: Comparison of three approaches through ARI for the model selection performance. In all cases, our procedure either outperforms or nearly on par with the two baseline algorithms.

#### 4.1.3 Comparison to a non-parametric two-sample test procedure for random graphs

In this section, we consider a sequence of undirected loop-less (unweighted) random graphs such that is an element of a set of distinct matrices whose entries are probabilities. Then, we consider a problem of clustering graphs into finite number of groups from the data . This is an abstraction of a problem in neuroscience, where each can represents a copy of neuron-to-neuron interaction pattern, where each portraits a different mode of connectivity between neurons. For instance, in Section 4.2, the modes are the chemical and the electrical pathways.

Since each is undirected and loop-less, its adjacency matrix can be embedded as a vector in an element in