# An Online Expectation-Maximisation Algorithm for Nonnegative Matrix Factorisation Models

In this paper we formulate the nonnegative matrix factorisation (NMF) problem as a maximum likelihood estimation problem for hidden Markov models and propose online expectation-maximisation (EM) algorithms to estimate the NMF and the other unknown static parameters. We also propose a sequential Monte Carlo approximation of our online EM algorithm. We show the performance of the proposed method with two numerical examples.

## Authors

• 7 publications
• 8 publications
• 9 publications
• ### A State-Space Approach to Dynamic Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) has been actively investigated an...
08/31/2017 ∙ by Nasser Mohammadiha, et al. ∙ 0

• ### Nonnegative HMM for Babble Noise Derived from Speech HMM: Application to Speech Enhancement

Deriving a good model for multitalker babble noise can facilitate differ...
09/16/2017 ∙ by Nasser Mohammadiha, et al. ∙ 0

• ### Divergence-Based Motivation for Online EM and Combining Hidden Variable Models

Expectation-Maximization (EM) is the fallback method for parameter estim...
02/11/2019 ∙ by Ehsan Amid, et al. ∙ 0

• ### Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures-of-Experts Models

Mixture of Experts (MoE) are successful models for modeling heterogeneou...
10/29/2018 ∙ by Faicel Chamroukhi, et al. ∙ 6

• ### Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM

The Expectation Maximisation (EM) algorithm is widely used to optimise n...
03/23/2020 ∙ by Thomas Lartigue, et al. ∙ 0

• ### The Mixture of Markov Jump Processes: Monte Carlo Method and the EM Estimation

This paper discusses tractable development and statistical estimation of...
12/19/2018 ∙ by H. Frydman, et al. ∙ 0

• ### A Variational Expectation-Maximisation Algorithm for Learning Jump Markov Linear Systems

Jump Markov linear systems (JMLS) are a useful class which can be used t...
04/18/2020 ∙ by Mark P. Balenzuela, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

With the advancement of sensor and storage technologies, and with the cost of data acquisition dropping significantly, we are able to collect and record vast amounts of raw data. Arguably, the grand challenge facing computation in the 21st century is the effective handling of such large data sets to extract meaningful information for scientific, financial, political or technological purposes (Donoho, 2000). Unfortunately, classical batch processing methods are unable to deal with very large data sets due to memory restrictions and slow computational time.

One key approach for the analysis of large datasets is based on the matrix and tensor factorisation paradigm. Given an observed dataset

, where is a matrix of a certain dimension and each element of it corresponds to an observed data point, the matrix factorisation problem is the computation of matrix factors and such that is approximated by the matrix product , i.e.,

 Y≈BX

(Later we will make our notation and inferential goals more precise.) Indeed, many standard statistical methods such as clustering, independent components analysis (ICA), nonnegative matrix factorisation (NMF), latent semantic indexing (LSI), collaborative filtering can be expressed and understood as matrix factorisation problems

(Lee and Seung, 1999; Singh and Gordon, 2008; Koren et al., 2009).

Matrix factorisation models also have well understood probabilistic/statistical interpretations as probabilistic generative models and many standard algorithms mentioned above can also be derived as maximum likelihood or maximum a-posteriori parameter estimation procedures (Fevotte and Cemgil, 2009; Salakhutdinov and Mnih, 2008; Cemgil, 2009). The advantage of this interpretation is that it enables one to incorporate domain specific prior knowledge in a principled and consistent way. This can be achieved by building hierarchical statistical models to fit the specifics of the application at hand. Moreover, the probabilistic/statistical approach also provides a natural framework for sequential processing which is desirable for developing online algorithms that pass over each data point only once. While the development of effective online algorithms for matrix factorisation are of interest on their own, the algorithmic ideas can be generalised to more structured models such as tensor factorisations (e.g. see (Kolda and Bader, 2009)).

In this paper our primary interest is estimation of (rather than and ), which often is the main objective in NMF problems. We formulate the NMF problem as a maximum likelihood estimation (MLE) problem for hidden Markov models (HMMs). The advantage of doing so is that the asymptotic properties of MLE for HMM’s has been studied in the past by many authors and these results may be adapted to the NMF framework. We propose a sequential Monte Carlo (SMC) based online EM algorithm (Cappé, 2009; Del Moral et al., 2009) for the NMF problem. SMC introduces a layer of bias which decreases as the number of particles in the SMC approximation is increased.

In the literature, several online algorithms have been proposed for online computation of matrix factorisations. Mairal et al. (2010) propose an online optimisation algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples. A proof of convergence is presented for the Gaussian case. There are similar formulations applied to other matrix factorisation formulations, notably NMF (Lefevre et al., 2011) and Latent Dirichlet Allocation (Hoffman et al., 2010), as well as alternative views for NMF which are based on incremental subspace learning (Bucak and Gunsel, 2009). Although the empirical results of these methods suggest good performance, their asymptotic properties have not been established.

### 1.1 Notation

Let be a matrix. The ’th element of is . If (or ) is 1, then (or ). The ’th row of is . If and are both matrices, denotes element-by-element multiplication, i.e., ; (or ) means element-by-element division, in a similar way. () is a matrix of ’s (’s), where is abbreviated to . and

are the sets of nonnegative integers and real numbers. Random variables will be defined by using capital letters, such as

, etc., and their realisations will be corresponding small case letters (, etc.). The indicator function if , otherwise it is ; also, for a set , if , otherwise it is .

## 2 The Statistical Model for NMF

Consider the following HMM comprised of the latent processes and the observation process . The process is a Markov process of

nonnegative vectors with an initial density

and the transition density for

 (1)

where is a finite dimensional parameter which parametrizes the law of the Markov process. is a matrix of nonnegative integers, and its elements are independent conditioned on as follows:

 Zt|(Xt=xt)∼M∏m=1K∏k=1PO(zt(m,k);B(m,k)xt(k))

where is an nonnegative matrix. Here

denotes the Poisson distribution on

with intensity parameter

 PO(v;λ)=exp(vlogλ−λ−logv!),

The observation vector is conditioned on in a deterministic way

 Yt(m)=K∑k=1Zt(m,k),m=1,…,M.

This results in the conditional density of given , denoted by , being a multivariate Poisson density

 Yt|(Xt=xt)∼gB(yt|xt)=M∏m=1PO(yt(m);B(m,⋅)xt). (2)

Hence the likelihood of given can analytically be evaluated. Moreover, the conditional posterior distribution of given and has a factorized closed form expression:

 Zt|(Yt=yt,Xt=xt) ∼ πB(zt|yt,xt) (3) = M∏m=1M(zt(m,⋅);yt(m),ρt,m)

where and denotes a multinomial distribution defined by

 M(v;α,ρ)=Iα(K∑k=1vk)α!K∏k=1ρvkkvk!,

where is a realisation of the vector valued random variable , , and . It is a standard result that the marginal mean of the ’th component is .

Let denote all the parameters of the HMM. We can write the joint density of given as

 pθ(x1:t,z1:t, y1:t)=μψ(x1)gB(y1|x1)πB(z1|y1,x1) (4) ×t∏i=2fψ(xi|xi−1)gB(yi|xi)πB(zi|xi,yi).

From (4), we observe that the joint density of

 pθ(x1:t,y1:t)=μψ(x1)gB(y1|x1)t∏i=2fψ(xi|xi−1)gB(yi|xi)

defines the law of another HMM comprised of the latent process , with initial and transitional densities and , and the observation process with the observation density . Finally, the likelihood of data is given by

 pθ(y1:T)=Eψ[T∏t=1gB(yt|Xt)]. (5)

In this paper, we treat as unknown and seek for the MLE solution for it, which satisfies

 θ∗=argmaxθ∈Θpθ(y1:T). (6)

### 2.1 Relation to the classical NMF

In the classical NMF formulation (Lee and Seung, 1999, 2000), given a nonnegative matrix , we want to factorize it to and nonnegative matrices and such that the difference between and is minimised according to a divergence

 (B∗,X∗)=argminB,XD(Y||BX). (7)

One particular choice for is the generalised Kullback-Leibler (KL) divergence which is written as

 D(Y||U)=M∑m=1T∑t=1Y(m,t)logY(m,t)U(m,t)−Y(m,t)+U(m,t)

Noticing the similarity between the generalised KL divergence and the Poisson distribution, (Lee and Seung, 1999) showed that the minimisation problem can be formulated in a MLE sense. More explicitly, the solution to

 (B∗,X∗)=argmaxB,Xℓ(y1,…,yT|B,X), ℓ(y1,…,yT|B,X)=T∏t=1gB(yt|Xt) (8)

is the same as the solution to (7). In our formulation of the NMF problem,

is not a static parameter but it is a random matrix whose columns constitute a Markov process. Therefore, the formulation for MLE in our case changes to maximising the expected value of the likelihood in (

2.1) over the parameter with respect to (w.r.t.) the law of

 (B∗,ψ∗)=argmax(B,ψ)∈ΘEψ[ℓ(y1,…,yT|B,X)]. (9)

It is obvious that (6) and (9) are equivalent. We will see in Section 3 that the introduction of the additional process is necessary to perform MLE using the EM algorithm (see Lee and Seung (2000) for its first use for the problem stated in (7)).

## 3 EM algorithms for NMF

Our objective is to estimate the unknown given . The EM algorithm can be used to find the MLE for . We first introduce the batch EM algorithm and then explain how an online EM version can be obtained.

### 3.1 Batch EM

With the EM algorithm, given the observation sequence we increase the likelihood in (5) iteratively until we reach a maximal point on the surface of the likelihood. The algorithm is as follows:

Choose for initialisation. At iteration

• E-step:

Calculate the intermediate function which is the expectation of the log joint distribution of

with respect to the law of given .

 Q(θ(j);θ)=Eθ(j)[logpθ(X1:T,Z1:T,Y1:T)|Y1:T=y1:T)]
• M-step: The new estimate is the maximiser of the intermediate function

 θ(j+1)=argmaxθQ(θ(j);θ)

With a slight modification of the update rules found in Cemgil (2009, Section 2), one can show that for NMF models the update rule for reduces to calculating the expectations

 ˆS1,T = Eθ(j)[T∑t=1Xt∣∣ ∣∣Y1:T=y1:T], ˆS2,T = Eθ(j)[T∑t=1Zt∣∣ ∣∣Y1:T=y1:T]

and updating the parameter estimate for as

 B(j+1)=ˆS2,T/(1M[ˆS1,T]T).

Moreover, if the transition density belongs to an exponential family, the update rule for becomes calculating the expectation of a vector valued function

 ˆS3,T=Eθ(j)[T∑t=1s3,t(Xt−1,Xt)∣∣ ∣∣Y1:T=y1:T]

and updating the estimate for using a maximisation rule

 Λ:RJ→Ψ,ψ(j+1)=Λ(ˆS3,T).

Note that and

depend on the NMF model, particularly to the probability laws in (

1

) defining the Markov chain for

. Therefore, we have to find the mean estimates of the following sufficient statistics at time .

 S1,t(x1:t)=t∑i=1xi,S2,t(z1:t)=t∑i=1zi, S3,t(x1:t)=t∑i=1s3,t(xt−1,xt). (10)

Writing the sufficient statistics in additive forms as in (3.1) enables us to use a forward recursion to find the expectations of the sufficient statistics in an online manner. This leads to an online version of the EM algorithm as we shall see in the following section.

### 3.2 Online EM

To explain the methodology in a general sense, assume that we want to calculate the expectations of sufficient statistics of the additive form

 St(x1:t,z1:t)=t∑i=1si(xi−1,zi−1,xi,zi) (11)

w.r.t. the posterior density for a given parameter value . Letting for simplicity, we define the intermediate function

 Tt(ut)=∫St(u1:t)pθ(u1:t−1|y1:t−1,ut)du1:t−1.

One can show that we have the forward recursion (Del Moral et al., 2009; Cappé, 2011)

 Tt(ut) = ∫(Tt−1(ut−1)+st(ut−1,ut)) (12) ×pθ(ut−1|y1:t−1,ut)dut−1

with the convention . Hence, can be computed online, so are the estimates

 ˆSt=∫Tt(ut)pθ(ut|y1:t)dut.

We can decompose the backward transition density and the filtering density as

 pθ(xt−1,zt−1|y1:t−1,xt,zt) =πB(zt−1|xt−1,yt−1) ×pθ(xt−1|xt,y1:t−1), (13) pθ(xt,zt|y1:t) =πB(zt|xt,yt)pθ(xt|y1:t) (14)

where is defined in (3). From (3.1) we know that the required sufficient statistics are additive in the required form; therefore, the recursion in (12) is possible for the NMF model. The recursion for depends on the choice of the transition density ; however the recursions for and are the same for any model regardless of the choice of . For this reason, we shall have a detailed look at (12) for the first two sufficient statistics and .

For , notice from (3.2) that, does not depend on . Moreover, the sufficient statistic is not a function of . Therefore, in (3.2) integrates out, and is a function of only. Hence we will write it as . To sum up, we have the recursion

 T1,t(xt)=xt+∫T1,t−1(xt−1)pθ(xt−1|xt,y1:t−1)dxt−1.

For , we claim that where is a nonnegative matrix valued function depending on but not , and the recursion for is expressed as

 Ct(xt) = ∫(Ct−1(xt−1)+B⊙(yt−1xTt−1)(Bxt−1)1TK) ×pθ(xt−1|xt,y1:t−1)dxt−1

This claim can be verified by induction. Start with . Since , we immediately see that where . For general , assume that . Using (3.2),

 T2,t(xt,zt) =zt+∫(zt−1+Ct−1(xt−1))πB(zt−1|xt−1,yt−1) ×pθ(xt−1|xt,y1:t−1)dxt−1dzt−1

Now, observe that the ’th element of the integral is . So, we can write the integral as

 ∫zt−1πB(zt−1|xt−1,yt−1)dzt−1=B⊙(yt−1xTt−1)(Bxt−1)1TK

So we are done. Using a similar derivation and substituting (14) into (3.2), we can show that

 ˆS2,t = ∫(Ct(xt)+B⊙(ytxTt)(Bxt)1TK)pθ(xt|y1:t)dxt.

The online EM algorithm is a variation over the batch EM where the parameter is re-estimated each time a new observation is received. In this approach running averages of the sufficient statistics are computed (Elliott et al., 2002; Mongillo and Deneve, 2008; Cappé, 2009, 2011), (Kantas et al., 2009, Section 3.2.). Specifically, let , called the step-size sequence, be a positive decreasing sequence satisfying and . A common choice is for . Let be the initial guess of before having made any observations and at time , let be the sequence of parameter estimates of the online EM algorithm computed sequentially based on . Letting again to show for the general case, when is received, online EM computes

 Tγ,t(ut) =∫((1−γt)Tγ,t−1(ut−1)+γtst(ut−1,ut)) ×pθ1:t(ut−1|y1:t−1,ut)dut−1, (15) St =∫Tγ,t(ut)pθ1:t(ut|y1:t)dut (16)

and then applies the maximisation rule using the estimates . The subscript on the densities and indicates that these laws are being computed sequentially using the parameter at time , . (See Algorithm 3.3 for details.) In practice, the maximisation step is not executed until a burn-in time for added stability of the estimators as discussed in Cappé (2009).

The online EM algorithm can be implemented exactly for a linear Gaussian state-space model (Elliott et al., 2002) and for finite state-space HMM’s. (Mongillo and Deneve, 2008; Cappé, 2011). An exact implementation is not possible for NMF models in general, therefore we now investigate SMC implementations of the online EM algorithm.

### 3.3 SMC implementation of the online EM algorithm

Recall that is also a HMM with the initial and transition densities and in (1), and the observation density in (2). Since the conditional density has a close form expression, it is sufficient to have a particle approximation to only . This approximation can be performed in an online manner using a SMC approach. Suppose that we have the particle approximation to at time with particles

 pNθ(dx1:t|y1:t)=N∑i=1w(i)tδx(i)1:t(dx1:t),N∑i=1w(i)t=1, (17)

where is the ’th path particle with weight and is the dirac measure concentrated at . The particle approximation of the filter at time can be obtained from by marginalization

 pNθ(dxt|y1:t)=N∑i=1w(i)tδx(i)t(dxt).

At time , for each we draw from a proposal density with a possible implicit dependency on . We then update the weights according to the recursive rule:

 w(i)t+1∝w(i)tfψ(x(i)t+1|x(i)t)gB(yt+1|x(i)t+1)qθ(x(i)t+1|x(i)t).

To avoid weight degeneracy, at each time one can resample from (17) to obtain a new collection of particles with weights , and then proceed to the time . Alternatively, this resampling operation can be done according to a criterion which measures the weight degeneracy (Doucet et al., 2000). The SMC online EM algorithm for NMF models executing (3.2) and (16) based on the SMC approximation of in (17) is presented Algorithm 3.3. SMC online EM algorithm for NMF models

• E-step: If t = 1, initialise ; sample , and set , , , , . If ,

• For , sample and compute

 ˜T(i)1,t=(1−γt)T(i)1,t−1+γt˜x(i)t,
 ˜T(i)3,t=(1−γt)T(i)3,t−1+γts3,t(x(i)t−1,˜x(i)t)
 ˜C(i)t=(1−γt)C(i)t−1+(1−γt)γt−1Bt⊙(yt−1x(i)Tt−1)(Btx(i)t−1)1TK,
 ˜w(i)t∝w(i)t−1fψt(˜x(i)t|x(i)t−1)gBt(yt|˜x(i)t)qθt(˜x(i)t|x(i)t−1).
• Resample from particles for according to the weights to get for each with weight .

• M-step: If , set . Else, calculate using the particles before resampling

 S1,t=N∑i=1˜T1(i)t˜w(i)t,
 S2,t=N∑i=1⎛⎜ ⎜⎝˜C(i)t+γtBt⊙(yt˜x(i)Tt)(Bt˜x(i)t)1TK⎞⎟ ⎟⎠˜w(i)t
 S3,t=N∑i=1˜T3(i)t˜w(i)t,

update the parameter , , .

Algorithm 3.3 is a special application of the SMC online EM algorithm proposed in Cappé (2009) for a general state-space HMM, and it only requires computations per time step. Alternatively, one can implement an SMC approximation to the online EM algorithm, see Del Moral et al. (2009) for its merits and demerits over the current implementation. The is made possible by plugging the following SMC approximation to into (12)

 pNθ(dxt−1|xt,y1:t−1) = pNθ(dxt−1|y1:t−1)fψ(xt|xt−1)∫pNθ(dxt−1|y1:t−1)fψ(xt|xt−1).

## 4 Numerical examples

### 4.1 Multiple basis selection model

In this simple basis selection model, determines which columns of are selected to contribute to the intensity of the Poisson distribution for observations. For ,

 X1(k)∼μ(⋅),Prob(Xt(k)=i|Xt−1(k)=j)=P(j,i),

where is a distribution over and is such that and . Estimation of can be done by calculating

 ˆS3,T =Eθ[T∑i=1s3,i(Xi−1,Xi)∣∣ ∣∣Y1:T=y1:T], s3,t(xt,xt−1) =K∑k=1⎡⎢ ⎢ ⎢ ⎢⎣I(0,0)(xt−1(k),xt(k))I0(xt(k))I(1,1)(xt−1(k),xt(k))I1(xt(k))⎤⎥ ⎥ ⎥ ⎥⎦

and applying the maximisation rule where for this model is defined as

 Λ(ˆS3,t)=(ˆS3,t(1)/ˆS3,t(2),ˆS3,t(3)/ˆS3,t(4)).

Figure 1 shows the estimation results of the exact implementation of online EM (with and ) for the matrix (assuming known) given the matrix which is simulated .

### 4.2 A relaxation of the multiple basis selection model

In this model, the process is not a discrete one, but it is a Markov process on the unit interval . The law of the Markov chain for is as follows: for , , and

 Xt+1(k)|(Xt(k)=x) ∼ρ(x)U(0,x)+(1−ρ(x))U(x,1), ρ(x) ={α, if x≤0.51−α, if x>0.5.

When is close to , the process will spend most of its time around and with a strong correlation. (Figure 2 shows a realisation of for time steps when .) For estimation of , one needs to calculate

 ˆS3,T =E