1 Introduction
With the advancement of sensor and storage technologies, and with the cost of data acquisition dropping significantly, we are able to collect and record vast amounts of raw data. Arguably, the grand challenge facing computation in the 21st century is the effective handling of such large data sets to extract meaningful information for scientific, financial, political or technological purposes (Donoho, 2000). Unfortunately, classical batch processing methods are unable to deal with very large data sets due to memory restrictions and slow computational time.
One key approach for the analysis of large datasets is based on the matrix and tensor factorisation paradigm. Given an observed dataset
, where is a matrix of a certain dimension and each element of it corresponds to an observed data point, the matrix factorisation problem is the computation of matrix factors and such that is approximated by the matrix product , i.e.,(Later we will make our notation and inferential goals more precise.) Indeed, many standard statistical methods such as clustering, independent components analysis (ICA), nonnegative matrix factorisation (NMF), latent semantic indexing (LSI), collaborative filtering can be expressed and understood as matrix factorisation problems
(Lee and Seung, 1999; Singh and Gordon, 2008; Koren et al., 2009).Matrix factorisation models also have well understood probabilistic/statistical interpretations as probabilistic generative models and many standard algorithms mentioned above can also be derived as maximum likelihood or maximum aposteriori parameter estimation procedures (Fevotte and Cemgil, 2009; Salakhutdinov and Mnih, 2008; Cemgil, 2009). The advantage of this interpretation is that it enables one to incorporate domain specific prior knowledge in a principled and consistent way. This can be achieved by building hierarchical statistical models to fit the specifics of the application at hand. Moreover, the probabilistic/statistical approach also provides a natural framework for sequential processing which is desirable for developing online algorithms that pass over each data point only once. While the development of effective online algorithms for matrix factorisation are of interest on their own, the algorithmic ideas can be generalised to more structured models such as tensor factorisations (e.g. see (Kolda and Bader, 2009)).
In this paper our primary interest is estimation of (rather than and ), which often is the main objective in NMF problems. We formulate the NMF problem as a maximum likelihood estimation (MLE) problem for hidden Markov models (HMMs). The advantage of doing so is that the asymptotic properties of MLE for HMM’s has been studied in the past by many authors and these results may be adapted to the NMF framework. We propose a sequential Monte Carlo (SMC) based online EM algorithm (Cappé, 2009; Del Moral et al., 2009) for the NMF problem. SMC introduces a layer of bias which decreases as the number of particles in the SMC approximation is increased.
In the literature, several online algorithms have been proposed for online computation of matrix factorisations. Mairal et al. (2010) propose an online optimisation algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples. A proof of convergence is presented for the Gaussian case. There are similar formulations applied to other matrix factorisation formulations, notably NMF (Lefevre et al., 2011) and Latent Dirichlet Allocation (Hoffman et al., 2010), as well as alternative views for NMF which are based on incremental subspace learning (Bucak and Gunsel, 2009). Although the empirical results of these methods suggest good performance, their asymptotic properties have not been established.
1.1 Notation
Let be a matrix. The ’th element of is . If (or ) is 1, then (or ). The ’th row of is . If and are both matrices, denotes elementbyelement multiplication, i.e., ; (or ) means elementbyelement division, in a similar way. () is a matrix of ’s (’s), where is abbreviated to . and
are the sets of nonnegative integers and real numbers. Random variables will be defined by using capital letters, such as
, etc., and their realisations will be corresponding small case letters (, etc.). The indicator function if , otherwise it is ; also, for a set , if , otherwise it is .2 The Statistical Model for NMF
Consider the following HMM comprised of the latent processes and the observation process . The process is a Markov process of
nonnegative vectors with an initial density
and the transition density for(1) 
where is a finite dimensional parameter which parametrizes the law of the Markov process. is a matrix of nonnegative integers, and its elements are independent conditioned on as follows:
where is an nonnegative matrix. Here
denotes the Poisson distribution on
with intensity parameterThe observation vector is conditioned on in a deterministic way
This results in the conditional density of given , denoted by , being a multivariate Poisson density
(2) 
Hence the likelihood of given can analytically be evaluated. Moreover, the conditional posterior distribution of given and has a factorized closed form expression:
(3)  
where and denotes a multinomial distribution defined by
where is a realisation of the vector valued random variable , , and . It is a standard result that the marginal mean of the ’th component is .
Let denote all the parameters of the HMM. We can write the joint density of given as
(4)  
From (4), we observe that the joint density of
defines the law of another HMM comprised of the latent process , with initial and transitional densities and , and the observation process with the observation density . Finally, the likelihood of data is given by
(5) 
In this paper, we treat as unknown and seek for the MLE solution for it, which satisfies
(6) 
2.1 Relation to the classical NMF
In the classical NMF formulation (Lee and Seung, 1999, 2000), given a nonnegative matrix , we want to factorize it to and nonnegative matrices and such that the difference between and is minimised according to a divergence
(7) 
One particular choice for is the generalised KullbackLeibler (KL) divergence which is written as
Noticing the similarity between the generalised KL divergence and the Poisson distribution, (Lee and Seung, 1999) showed that the minimisation problem can be formulated in a MLE sense. More explicitly, the solution to
(8) 
is the same as the solution to (7). In our formulation of the NMF problem,
is not a static parameter but it is a random matrix whose columns constitute a Markov process. Therefore, the formulation for MLE in our case changes to maximising the expected value of the likelihood in (
2.1) over the parameter with respect to (w.r.t.) the law of(9) 
It is obvious that (6) and (9) are equivalent. We will see in Section 3 that the introduction of the additional process is necessary to perform MLE using the EM algorithm (see Lee and Seung (2000) for its first use for the problem stated in (7)).
3 EM algorithms for NMF
Our objective is to estimate the unknown given . The EM algorithm can be used to find the MLE for . We first introduce the batch EM algorithm and then explain how an online EM version can be obtained.
3.1 Batch EM
With the EM algorithm, given the observation sequence we increase the likelihood in (5) iteratively until we reach a maximal point on the surface of the likelihood. The algorithm is as follows:
Choose for initialisation. At iteration

Estep:
Calculate the intermediate function which is the expectation of the log joint distribution of
with respect to the law of given . 
Mstep: The new estimate is the maximiser of the intermediate function
With a slight modification of the update rules found in Cemgil (2009, Section 2), one can show that for NMF models the update rule for reduces to calculating the expectations
and updating the parameter estimate for as
Moreover, if the transition density belongs to an exponential family, the update rule for becomes calculating the expectation of a vector valued function
and updating the estimate for using a maximisation rule
Note that and
depend on the NMF model, particularly to the probability laws in (
1) defining the Markov chain for
. Therefore, we have to find the mean estimates of the following sufficient statistics at time .(10) 
Writing the sufficient statistics in additive forms as in (3.1) enables us to use a forward recursion to find the expectations of the sufficient statistics in an online manner. This leads to an online version of the EM algorithm as we shall see in the following section.
3.2 Online EM
To explain the methodology in a general sense, assume that we want to calculate the expectations of sufficient statistics of the additive form
(11) 
w.r.t. the posterior density for a given parameter value . Letting for simplicity, we define the intermediate function
One can show that we have the forward recursion (Del Moral et al., 2009; Cappé, 2011)
(12)  
with the convention . Hence, can be computed online, so are the estimates
We can decompose the backward transition density and the filtering density as
(13)  
(14) 
where is defined in (3). From (3.1) we know that the required sufficient statistics are additive in the required form; therefore, the recursion in (12) is possible for the NMF model. The recursion for depends on the choice of the transition density ; however the recursions for and are the same for any model regardless of the choice of . For this reason, we shall have a detailed look at (12) for the first two sufficient statistics and .
For , notice from (3.2) that, does not depend on . Moreover, the sufficient statistic is not a function of . Therefore, in (3.2) integrates out, and is a function of only. Hence we will write it as . To sum up, we have the recursion
For , we claim that where is a nonnegative matrix valued function depending on but not , and the recursion for is expressed as
This claim can be verified by induction. Start with . Since , we immediately see that where . For general , assume that . Using (3.2),
Now, observe that the ’th element of the integral is . So, we can write the integral as
So we are done. Using a similar derivation and substituting (14) into (3.2), we can show that
The online EM algorithm is a variation over the batch EM where the parameter is reestimated each time a new observation is received. In this approach running averages of the sufficient statistics are computed (Elliott et al., 2002; Mongillo and Deneve, 2008; Cappé, 2009, 2011), (Kantas et al., 2009, Section 3.2.). Specifically, let , called the stepsize sequence, be a positive decreasing sequence satisfying and . A common choice is for . Let be the initial guess of before having made any observations and at time , let be the sequence of parameter estimates of the online EM algorithm computed sequentially based on . Letting again to show for the general case, when is received, online EM computes
(15)  
(16) 
and then applies the maximisation rule using the estimates . The subscript on the densities and indicates that these laws are being computed sequentially using the parameter at time , . (See Algorithm 3.3 for details.) In practice, the maximisation step is not executed until a burnin time for added stability of the estimators as discussed in Cappé (2009).
The online EM algorithm can be implemented exactly for a linear Gaussian statespace model (Elliott et al., 2002) and for finite statespace HMM’s. (Mongillo and Deneve, 2008; Cappé, 2011). An exact implementation is not possible for NMF models in general, therefore we now investigate SMC implementations of the online EM algorithm.
3.3 SMC implementation of the online EM algorithm
Recall that is also a HMM with the initial and transition densities and in (1), and the observation density in (2). Since the conditional density has a close form expression, it is sufficient to have a particle approximation to only . This approximation can be performed in an online manner using a SMC approach. Suppose that we have the particle approximation to at time with particles
(17) 
where is the ’th path particle with weight and is the dirac measure concentrated at . The particle approximation of the filter at time can be obtained from by marginalization
At time , for each we draw from a proposal density with a possible implicit dependency on . We then update the weights according to the recursive rule:
To avoid weight degeneracy, at each time one can resample from (17) to obtain a new collection of particles with weights , and then proceed to the time . Alternatively, this resampling operation can be done according to a criterion which measures the weight degeneracy (Doucet et al., 2000). The SMC online EM algorithm for NMF models executing (3.2) and (16) based on the SMC approximation of in (17) is presented Algorithm 3.3. SMC online EM algorithm for NMF models

Estep: If t = 1, initialise ; sample , and set , , , , . If ,

For , sample and compute

Resample from particles for according to the weights to get for each with weight .


Mstep: If , set . Else, calculate using the particles before resampling
update the parameter , , .
Algorithm 3.3 is a special application of the SMC online EM algorithm proposed in Cappé (2009) for a general statespace HMM, and it only requires computations per time step. Alternatively, one can implement an SMC approximation to the online EM algorithm, see Del Moral et al. (2009) for its merits and demerits over the current implementation. The is made possible by plugging the following SMC approximation to into (12)
4 Numerical examples
4.1 Multiple basis selection model
In this simple basis selection model, determines which columns of are selected to contribute to the intensity of the Poisson distribution for observations. For ,
where is a distribution over and is such that and . Estimation of can be done by calculating
and applying the maximisation rule where for this model is defined as
Figure 1 shows the estimation results of the exact implementation of online EM (with and ) for the matrix (assuming known) given the matrix which is simulated .
4.2 A relaxation of the multiple basis selection model
In this model, the process is not a discrete one, but it is a Markov process on the unit interval . The law of the Markov chain for is as follows: for , , and
When is close to , the process will spend most of its time around and with a strong correlation. (Figure 2 shows a realisation of for time steps when .) For estimation of , one needs to calculate
Comments
There are no comments yet.