1 Introduction
Matrix decomposition (a.k.a. matrix factorization) plays fundamental roles in machine learning and data mining. As we know, the basic task of machine learning and data mining is to discover knowledge from primitive data. In many cases, primitive datasets or observations are organized in matrix forms. For example, an image can be stored in a matrix of pixels; a corpus is often represented by a documentterm matrix that describes the frequency of terms occur in the collection of documents. The primitive datasets are generally big and noisy matrices. Extracting useful information from primitive datasets directly is often infeasible. Therefore, matrix decomposition is becoming a powerful tool to get a more compact and meaningful representation of observed matrices. Matrix decomposition has been successfully applied to various fields including image analysis [1, 2, 3], gene expression analysis [4, 5, 6], collaborative filtering [7, 8, 9], text mining [10, 11] and many other areas.
Generally, most matrix decomposition methods assume a data matrix is from one single data source. The conceptual idea underlying matrix decomposition is that the observed data matrix can be approximated by a product of two or more lowrank matrices. Given an observed data matrix, a general matrix decomposition method can be summarized as follows:
(1) 
which is further formulated into an optimization problem:
(2) 
where D is a divergence function. Notice that for each column of , , where is the number of columns of , i.e. each column of the observed matrix is approximated by a linear combination of all columns of . Therefore, is often referred as a basis matrix, while is often referred as a coefficient matrix.
Generally, one can get different matrix decomposition methods by placing different constraints on and . For example, means clustering may be considered as the simplest matrix decomposition method [12]. In means clustering, contains clustering centroid matrix, which has no restriction; is a binary matrix and each column of indicates the cluster membership by a single one. Seminonnegative matrix factorization (semiNMF) [12] relaxes the restriction of means. It only restricts to be nonnegative. The nonnegative restriction on usually improves the interpretability of the basis matrix. To decompose nonnegative data like images, nonnegative matrix factorization (NMF) has been explored [13, 1]. NMF restricts both and
to be nonnegative. The nonnegative constraints lead to partbased feature extraction and sort of sparseness naturally
[1]. Moreover, due to its effectiveness in clustering and data representation, NMF has been extended to many variant forms including various sparse NMF models [2, 14] and regularized NMF models [15, 16].However, there are several main drawbacks of those standard matrix factorization methods. First, they are prone to overfitting the observed data matrices. Second, the regularization parameters of models require to be carefully tuned. Bayesian approaches were imposed to matrix factorization [17, 18, 19, 20, 21]
to address those weaknesses by incorporating priors with model parameters and hyperparameters. Third, the above methods only focus on a data matrix from only one source.
Currently, multiview data collected from different sources is very common. Such data from different sources might be complementary to each other. Moreover, they are usually of different quality (i.e. noise level is different in each source). How to make full use of those data is an important but challenging topic. Many studies have devoted their efforts to developing mathematical methods to tackle the underlying issues for data integration. For example, Zhang et al. [22] proposed joint NMF (jNMF) to perform integrative analysis of multiple types of genomic data to discover the combinatorial patterns that would have been ignored with only a single type of data. jNMF assumes the different data shares a common basis matrix for each data set. Moreover, Zhang et al. [23] also developed a sparsityconstrained and networkregularized joint NMF for multiple genomics data integration. Liu et al. [24] developed multiview NMF (MultiNMF) for multiview clustering. Unlike jNMF and others shared oneside factor among multiview data matrices, MultiNMF uses a consensus coefficient matrix to give a multiview clustering results. More recently, Chalise and Fridley [25] proposed a weighted joint NMF model (intNMF) for integrative analysis, which assigns higher weight to data from more reliable source. However, few of these methods address the heterogeneity of noise explicitly. jNMF ignores the noise heterogeneity, MultiNMF and intNMF address this problem implicitly by giving a lower weight to more noisy data source, but the weight of each source is difficult to decide. Thus, powerful computational methods to well consider the heterogeneous noise of multiview data for data integration are urgently needed.
To this end, we propose a powerful joint matrix decomposition framework (BJMD), which models the heterogeneity of noise explicitly by Gaussian distribution in a Bayesian framework. Mathematically, it still remains the interpretability and sparsity of the basis matrix as regularized matrix decomposition does. We develop two algorithms to solve this model: one is a variational Bayesian inference algorithm, which makes full use of the proposal distribution to approximate posterior distribution; and another is a maximum a posterior algorithm (MAP), which turns to be an iterative optimization algorithm, enabling it is more efficient and scalable. We extensively compare the two algorithms and discuss their advantages and disadvantages. Extensive experiments on synthetic and realworld datasets show the effectiveness of BJMD for modeling the noise and demonstrate that considering the heterogeneity of noise leads to superior performance to the stateoftheart methods.
2 Related Work
2.1 Joint NMF and Other Variants
Joint NMF (jNMF) is a natural extension to NMF for integrating multiple datasets[22]. For data matrices , the optimization format of jNMF is formulated as:
(3)  
s.t. 
where is the common or shared basis matrix and is the corresponding coefficient matrix. The common or shared patterns in the multiview data matrices from different sources can be reflected by the shared basis matrix. One can easily see that jNMF is mathematically equivalent to NMF by setting and . Therefore, jNMF ignores the heterogeneity of noise in different data sources. Straz̆ar et al. [26] extended jNMF by adding orthogonalityregularized terms on coefficient matrices to predict proteinRNA interactions (iONMF). The orthogonality regularization prevents multicollinearity and iONMF was stated to outperform other NMF models in predicting proteinRNA interactions. However, the heterogeneity of noise among different data types is still ignored.
MultiNMF extends jNMF to multiview clustering and requires the coefficient matrices learned from various views to be approximately common. Specifically, it is formulated as follows:
(4)  
s.t. 
where is the consensus coefficient matrix and is the weight parameter to tune the relative importance among different views. In image processing, Jing et al. [27] proposed a supervised joint matrix factorization model for image classification and annotation (SNMFCA). SNMFCA factorizes regionimage matrix and annotationimage matrix simultaneously and incorporates the label information as a networkregularized term. SNMFCA also uses weight hyperparameters to balance the importance of each data source. Recently, Chalise and Fridley [25] proposed intNMF which is an extension of jNMF by specifying weight hyperparameter to each data source. Specifically, it is formulated as follows:
(5)  
s.t. 
where is a user predefined weight parameter. Those methods use weight parameters to regularize the importance of each data source. However, how to choose those weight parameters is still not sufficiently studied in literature.
2.2 Glad
Bayesian methods for matrix decomposition have been studied for about twenty years. In 1999, Bishop [17]
proposed Bayesian principal component analysis (BPCA), which can automatically determine an effective number of principal components as part of the inference procedure. Later, Bishop
[28] gives an effective variational inference method for BPCA. Welling and Kurihara [18] introduced Bayesian means, which retains the ability of selecting the model structure by incorporating prior on model parameters. Schmidt et al. [19] proposed a Bayesian variant for NMF by introducing exponential priors on the basis matrix and coefficient matrix respectively (BNMF). Moreover, BNMF assumes Gaussian noise over the product of the basis matrix and coefficient matrix. However, the Gaussian noise may lead to the entry of data matrix to be negative. In collaborative filtering, Salakhutdinov and Mnih [20] proposed Bayesian probabilistic matrix factorization (BPMF), which places Gaussian prior over both basis and coefficient matrices to predict user preference for movies. BPMF has attracted great interests of researchers from diverse fields. Moreover, other Bayesian methods for matrix decomposition in collaborative filtering field have also been proposed [29, 30, 31].GLAD [21] is a mixedmembership model, which utilizes three typical distributions (i.e., Gaussian distribution, Laplace distribution and Dirichlet distribution). GLAD can be described as a matrix decomposition model with Laplace prior on the basis matrix, Dirichlet prior on the coefficient matrix and Gaussian noise on the observed data matrix. Specifically, GLAD assumes that the observed data matrix is generated as follows:
(6) 
where each element of the basis matrix , each column of the coefficient matrix and indicates the Gaussian noise . Mathematically, the Laplace prior over enforces the sparsity of the basis matrix. And has a Dirichlet distribution, providing a distribution over clusters for each column of
. Inference for GLAD focuses on estimating the posterior distribution over latent variables:
(7) 
However, due to the nonconjugate prior, it is intractable to calculate the posterior distribution, which involves high dimensional integral. GLAD seeks to a nonconjugate variational inference (VI). Instead of calculating the integral exactly, VI uses a distribution
parameterized by , to approximate the true posterior distribution. Specifically, VI aims at minimizing the KullbackLeibler (KL) divergence between the approximation distribution and the true posterior distribution :(8) 
where denotes the KL divergence between and , and
denotes the family of probability distribution functions (PDFs) to make the integral trackable. Because of involving posterior distribution, the KL divergence is still difficult to optimize. Instead, VI maximize the evidence lower bound (ELBO):
(9) 
where the first term is the expectation of joint distribution over the approximation distribution, and the second term is the entropy of approximation distribution. Therefore, if the expectation of (
9) can be calculated analytically, VI turns the inference problem to be an optimization problem. Unfortunately, expectation is still difficult to compute for GLAD. To address this issue, GLAD further introduces a Laplace approximation to address the expectation in ELBO.The inference of GLAD has two main drawbacks. First, it approximates the ELBO, which is only an approximation of true posterior. Consequently, the gap between approximation distribution and true posterior distribution might be widened. Second, it is still computationally expensive for involving a nonlinear optimization problem at each iteration. Thus, GLAD takes considerable time to converge for even small data, which makes it unrealistic for realworld datasets.
3 Bayesian Joint Matrix Decomposition
In this section, we present the Bayesian joint matrix decomposition framework (BJMD) (Fig. 1a), and develop two algorithms to solve it.
3.1 Model Construction
Suppose that we have multiview data matrices: . Inspired by previous studies, the generative model of BJMD is formulated as follows:
(10) 
where is a shared basis matrix and models noise of different data sources by Gaussian distribution. To enforce the sparsity of the basis matrix[32], we place a Laplace prior with zero mean on it:
(11) 
where
(12) 
Nonnegative restriction on the coefficient matrix of semiNMF [12] leads to better interpretability of the basis matrix. However, columns of the basis matrix are usually not comparable without proper normalization, which weakens the interpretability. In fact, one can see that semiNMF is invariant with column scaling on or row scaling on . Consider a diagonal matrix with size , is equal to . To this end, we additionally restrict each column of coefficient matrices summing to one. As the support of Dirichlet distribution is a unit simplex, we place a Dirichlet prior on each column of coefficient matrices :
(13) 
where
(14) 
is a
dimensional vector and
indicates the th element of . Moreover, we use Gaussian distribution in each data source modeling the heterogeneity of noise explicitly:(15) 
where
(16) 
To get a complete Bayesian model, we also need to introduce prior distribution for variance of Gaussian distribution. For convenience, we choose conjugate prior for
, i.e. , where ,are two hyperparameters of inverse Gamma distribution and can be set to a small positive value such as
or in a noninformative fashion [33].Under the BJMD model, the posterior is proportional to the complete likelihood, which can be written as:
(17) 
which could help us to understand it from an optimization perspective. Hence, we write down the loglikelihood. For simplicity, we formulate the complete negative loglikelihood retaining only terms involving , as follows:
(18) 
with the unit simplex constraints and for any , , . The first term in (18) can be viewed as a divergence function between the observation and the approximated value . The variance of Gaussian distribution is the weight parameter, which gives a higher weight for the data points from source with lower noise level. Different from the weight hyperparameters of intNMF and MultiNMF, can be automatically inferred during the optimization process. The second term of this equation is a norm regularization on the basis matrix , which regularizes the sparsity of . The last term is a regularization term on the coefficient matrices. If we omit the first two terms, one can find that the last term is minimized when , which is actually the expectation of Dirichlet prior. Therefore, the last term enforces to the expectation of prior and reduces the risk of overfitting.
3.2 Variational Inference
In this subsection, we adopt a variational inference algorithm for the BJMD model. For further comparison, we first extend the GLAD to multiview data matrices by a shared basis matrix and different coefficient matrices. Note that the GLAD and the extended GLAD model for multiview data are not in a full Bayesian framework. Both models treat the variance of Gaussian distribution as a hyperparameter rather than a distribution. Thus, we refer the extended GLAD model as joint matrix decomposition (JMD) later.
We can easily see that the inference algorithm of GLAD can be naturally extended to JMD. However, the extended inference algorithm of JMD still suffers the aforementioned problems of GLAD: (a) the approximation of ELBO might not be a good approximation to the true posterior; (b) the expensive computational cost makes it unrealistic for realworld datasets. Recent advances in stochastic variational methods, including black box variational inference (BBVI) [34], autoencoding variational Bayes (AEVB) [18] and automatic differentiation variational inference (ADVI) [35], provide powerful tools to tackle those problems. These stochastic variational inference methods avoid calculating the ELBO directly. Instead, they take derivative of ELBO and push the derivative operator into the expectation:
(19) 
where denotes the observed data, denotes the latent variable, denotes the variational distribution parameterized by , denotes the joint distribution of latent variables and observed data, and denotes the ELBO. The gradient of ELBO (19) is expressed as an expectation over variational distribution, which can be approximated by a Monte Carlo method:
(20) 
where is the number of samples and . Now we can update the variational parameters with the noisy gradients with simple computation.
We implement ADVI for optimizing the BJMD model with python library PyMC3 [36]
. ADVI converts all constrained latent variables to unconstrained ones and uses Gaussian distributions to approximate the ELBO of unconstrained random variables. Following previous notations, ADVI can be briefly summarized as follows:

[label=()]

Firstly, ADVI transforms the constrained latent variables into unconstrained realvalued variables by a onetoone mapping. ADVI aims at minimizing in which one can use a single family of distributions for all models.

Then ADVI converts the gradient of variational objective function as expectation over , which can be estimated by a Monte Carlo sampling strategy as in (20).

Furthermore, ADVI reparameterizes the gradient in terms of standard Gaussian. This transformation turns the Monte Carlo procedure to sampling from standard Gaussian distribution, enabling it can be sampled effectively.

Finally, ADVI uses the noisy gradients to optimize the variational distribution. As all transformation mentioned in the former three steps are onetoone mappings, to get the variational parameters we concern, we can map the variational distribution to the original space of the constrained latent variable .
ADVI can automatically infer probabilistic models by the above steps. One should notice that the variational distribution is optimized by noisy gradients, and thus it is possible for ADVI to escape from the local optimum and will not be overfitted. The disadvantage of noisy gradients is that the ELBO will not converge. Instead, the ELBO will fluctuate in a small interval. Therefore, it is difficult to decide when to stop iteration. Empirically, stopping iteration immediately after the ELBO becoming stable often leads to poor performance. As we use the coefficient matrices for further clustering and representation, we stop iteration if the coefficient matrices are stable. ADVI empirically takes tens of thousands of iterations to get a welltrained model. Then checking the relative change of basis matrices at each iteration leads to an expensive computational cost. Thus, we check the coefficient matrices every iterations to ensure that the model is sufficiently trained:
(21) 
where is a tolerance threshold, which is a small positive value, is the coefficient matrices at th iteration (). Note that relative change of will not converge to zero due to the noisy gradient. Thus, the iteration should be stopped if the criterion (21) is satisfied or iteration exceeds , where is a user predefined maximum times of iteration.
3.3 Maximum A Posterior
ADVI makes it possible to apply BJMD framework to realworld datasets, yet it still takes tens of thousands of iterations before stopping criterion is satisfied. To get a more efficient and scalable algorithm, we develop a maximum a posterior (MAP) algorithm, which directly maximizes the posterior in (18) and treat it as an optimization problem. The key difference between VI and MAP is that: VI approximates the posterior distribution by a family of distribution; MAP approximates posterior distribution by its mode. VI makes full use of posterior distribution, while it pays a more expensive computational cost. In contrast, MAP is a point estimation with less computational cost. We present a MAP algorithm for the BJMD model in the following.
3.3.1 Model Reformulation
Due to the absolute value factor in Laplace distribution, it is inconvenient for posterior inference within a Bayesian framework for (18). To address this computational issue, we reformulate the model by utilizing a twolevel hierarchical representation of Laplace distribution. Specifically, a random variable following a Laplace distribution
can be reformulated as Gaussian scale mixtures with exponential distribution prior to the variance
[37]. That is:(22) 
where the term is the PDF of exponential distribution.
Therefore, we can replace the Laplace prior of with the twolevel hierarchical representation to eliminate the absolute value factors. This is a useful technique to simplify computation, which has been used in other studies [38]. Under this reformulation, the complete log likelihood is:
(23) 
where random variable is an auxiliary variable. Substituting corresponding PDF into (23), the negative log complete likelihood is formulated as follows:
(24) 
with the unit simplicial constraints and for any , , . Note that the absolute value bars are eliminated after reformulation. Instead, the original norm regularized term is converted to a weighted norm regularized term, which is easier to optimize. Fig. 1b shows the graphical model of the reformulated BJMD.
Equivalently, we minimize the complete nonnegative loglikelihood (24) in the following optimization form:
(25) 
3.3.2 Update Rules
Here, we propose a powerful iterative update procedure to solve problem (25), which can be easily paralleled. Note that this optimization problem (25) is not convex. Therefore, we optimize it in a blockcoordinate manner. Specifically, we update one latent variable with others fixed until convergence.
Updating and fixing others: Let the partial derivative for be zeros, . This is a simple quadratic equation in one variable. Considering the constraint , we have:
(26) 
Updating and fixing others: We only care about at this step. Therefore, retaining terms only involving , the optimization problem is equivalent to:
(27) 
with no constraints. Note that the above optimization problem is separable. For each row of , the subproblem can be written as:
(28) 
where denotes the diagonal function and
denotes the elementwise root of square. This subproblem is close to ridge regression
[39]. The difference is that the elements of vector are equal in ridge regression. The diagonal matrix has the special form . While in our case, each element of follows the exponential distribution with parameter . Let the derivative of (28) be zero:(29) 
The optimal solution of subproblem (28) is:
(30) 
Updating and fixing others: Each column of follows a Dirichlet distribution, whose support is a dimensional simplex. Retaining terms involving , our problem can be written as:
(31) 
If , the above is a convex optimization problem. Otherwise, it would be difficult to solve this problem due to the nonconvexity. We always set in the following experiments. Note that above problem is separable. We can divide it into subproblems by column, and optimize each subproblem in a parallel manner. Subproblem for column is as follows:
(32) 
For convenience, we denote . The subproblem (32) is very close to a quadratic programming problem with unit simplex constraints. The objective function is just a quadratic term coupled with a logarithmic barrier function. The difference is that, for the constraint quadratic programming (QP) problem, is a smooth parameter to enforce the path to be interior within the feasible region. is usually small or gradually decreasing in QP. In our case, is the hyperparameter related to the Dirichlet prior. If we are confident with our prior knowledge, is not necessarily small.
This subproblem can be solved by a interiorpoint algorithm [40]. Omitting constants, the Lagrange function is:
(33) 
where . The KKT condition is as follows:
where . For a given , the KKT condition is a system of nonlinear equations (the nonlinearity is due to the product ). There are variables . Let’s set
(34) 
The system can be solved in a Newtonlike method. At each iteration, to choose a good direction , we use Jacobian matrix as linear approximation. Then, the following linear equation should be solved,
(35) 
where the Jacobian matrix has the special form:
(36) 
Here we summarize the algorithm of updating in subalgorithm 1.
Updating and fixing others: Due to the conjugate prior, can be easily updated by:
(37) 
Stop criterion: It is easy to see that each step of the algorithm is convex. The value of objective function is ensured to be monotonically nonincreasing. We stop iteration when the relative change of objective function is small enough:
(38) 
where is the value of objective function at iteration and is the tolerance, which is a small positive value (e.g. ).
3.3.3 Computational Complexity
Here we briefly discuss the computational complexity of the proposed MAP algorithm for the BJMD model. At each iteration, the most computational expensive step is inferring s, which involves solving a constrained QP problem. To infer , each iteration of Subalgorithm 1 needs to solve a linear system, which is of . If QP stops after iterations, the total cost of updating s is , where . To infer , a linear system should be solved, which leads to cost in total. The computational cost of updating and variance are of and respectively. Altogether, each iteration of MAP is of , and the total complexity of the algorithm is thus , where is the upper bound of iteration. Generally, and is empirically small, and s and s can be optimized in parallel. Therefore, the proposed MAP algorithm is scalable for largescale problems.
4 Experimental Results
Here we first evaluate JMD and BJMD with two algorithms ADVI and MAP (denoted as VIBJMD and MAPBJMD respectively) using synthetic data. We demonstrate that both methods can estimate the variance of noise accurately and BJMD leads to superior performance compared to JMD. Then we apply our methods to three realworld data including 3Source Text dataset, Extended Yale Face Database B and METABRIC Breast Cancer Expression dataset, and compare them with several stateoftheart methods including means, JointNMF, SemiNMF and BNMF. All experiments have been performed on a desktop computer with a 2GHz Intel Xeon E52683 v3 CPU, a GTX 1080 GPU card, 16GB memory and Ubuntu 16.04 (Operating System).
4.1 Evaluation Metric
To evaluate the effect of our methods in terms of clustering ability, we adopt an evaluation metric based on the area under the ROC curve (AUC). Specifically,
is the coefficient matrix from source , and is the corresponding true label indicator matrix of . is binary and equals to one means that sample from data source belongs to the th cluster. Note sample belongs to one or more clusters, and thus has no zero column. We define the following metric:(39) 
where AUC takes as labels and as predictions. To evaluate the performance of model in data source , we calculate from to and take average as follows:
(40) 
4.2 Synthetic Experiments
In synthetic experiments, we compare JMD and BJMD. Note JMD is an extension of GLAD, and BJMD can be solved by two algorithms ADVI and MAP.
4.2.1 Synthetic Data Sets
To this end, we generate synthetic data inspired by [41]. Specifically, we first generate the ground truth dictionary matrix in the following step:
(41) 
where is a constant, denotes the number of nonzero entries in the first columns and denotes the length of coherence between basis and . Note that that all entries of the th column of equal to zero. Next, we generate the entries of coefficient matrix for source .
follows a Bernoulli distribution with parameter
, namely for . For the th column of , if all entries of equal to zero, set . Then we normalize such that the sum of column equals to one. Finally, we generate an observed data matrix by:(42) 
We generated synthetic data by the above procedure in both small and large scales.
SamllScale Synthetic Data Sets: The data were generated with , , , , , , . and are 1.0, 2.5 respectively and ranges from 1.5 to 5.5 in step of 0.5. Consequently, we obtained nine datasets and each one consists of , , .
LargeScale Synthetic Data Sets: Similarly, the data were generated with , , , , , , . and are 1.0, 2.5 respectively and ranges from 1.5 to 5.5 in step of 0.5. Consequently, we obtained nine datasets and each one consists of , , .
In the following experiments, we applied JMD, VIBJMD, MAPBJMD to the synthetic data. We also concatenated the , , by column to get the combined data matrix. We applied GLAD, VIBJMD and MAPBJMD to this concatenated data matrix for comparison. We denote the results of the two algorithms as VIcatBNJMD and MAPcatBJMD respectively for clarity. The AUC value is the average of the best 5 of 20 runs (based on the values of objective function).
4.2.2 Convergence Behavior of VIBJMD
We first investigate the convergence behavior of variational inference of BJMD by numerical simulation. We applied VIBJMD onto a smallscale synthetic dataset with =4.0. We can see that the ELBO becomes stable and begins to fluctuate in a small interval after around 20,000 iterations (Fig. 2a), while the relative change of the coefficient matrices stays stable after around 80000 iterations (Fig. 2b). We also plot the steps of iteration versus the AUC performance of VIBJMD in each source (Fig. 2c). It can be observed that the AUC performance of VIBJMD in each source is increasing as iteration progressed. Moreover, the AUC decreases correspondingly if the noise level increases. However, the AUC performance is still increasing when the ELBO gets stable at about 20000 iterations. Therefore, if we stop iteration immediately after the ELBO becoming stable, the model has a relative poor performance in terms of AUCs. ELBO is a lower bound of evidence, which approximates the posterior distribution. Thus, it is possible that the variational distribution is approaching to the true posterior distribution, while the ELBO stays stable. As we use the coefficient matrix as predictions, it is obvious that the performance of VIBJMD will stay stable after the relative change of coefficient matrices become stable. Therefore, our stopping criterion in (21) of variational algorithm is a better choice than examining the change of ELBO. One should note that the relative change of coefficient matrices will not approach to zero because of the noisy gradients in optimization. Therefore, we stop iteration if the tolerance is satisfied or the maximum times of iteration is exceeded. In the following experiments, we always set .
True noise histograms and estimated PDF (shown with curves) by JMD (top), VIBJMD (middle) and MAPBJMD (bottom) respectively. The true standard deviation of source 1, source 2, source 3 are 1.0, 2.5, 4.0 respectively.
Next, since our model is constructed under the heterogeneity of noise assumption, it is necessary to verify whether our proposed models can estimate the noise level of data from different sources accurately. To verify the ability of discovering the heterogeneity of noise in different data sources, we applied JMD, VIBJMD and MAPBJMD to the smallscale synthetic dataset with =4.0. We can clearly see that the estimated probability density curves are close to the true noise histograms, and thus all those three methods can estimate the variance of Gaussian distribution accurately (Fig. 3).
4.2.3 Considering the Heterogeneity of Noise Leads to Superior Performance
In the following, we conduct experiments to demonstrate that considering the heterogeneity of noise in different data sources leads to superior performance. We evaluate our methods on the small and largescale synthetic datasets both with =4.0 respectively. JMD and GLAD are omitted on the largescale synthetic data due to their huge computational cost.
In order to facilitate comparison, we plot the objective function value (, note ELBO is always negative) versus iteration number and log likelihood versus iteration number respectively (Fig. 4). It is natural that methods considering the heterogeneity of noise get better objective function values because those methods have extra parameters to model noise of each source. Each iteration of variational inference calculates the noisy gradient by Monte Carlo sampling and can be significantly accelerated by GPU and it takes thousands of iterations to satisfy the stopping criterion. On the contrast, JMD and MAPBJMD pay more computational cost at each iteration and converge after a few iterations.
We can clearly see that the methods (JMD, VIBJMD, MAPBJMD) considering the heterogeneity of noise demonstrate significantly better performance than the others (GLAD, VIcatBJMD, MAPcatBJMD) in terms of AUCs (TABLE I). Moreover, the proposed methods MAPBJMD and VIBJMD obtain better performance than GLAD and JMD. We can also see that both JMD and GLAD cost considerable time even in such a small synthetic data (TABLE I). Thus, it is impractical to apply them to largescale realworld datasets. We utilized a GPU card to accelerate the variational inference. The synthetic data is small so that VIBJMD can not utilize all cores of the GPU card. While concatenating all the matrices will significantly accelerate the speed of iteration. In this situation, VIcatBJMD only spends around a third of the running time of VIBJMD. If the size of data matrix of each source is big enough, there will be no significant difference between VIBJMD and VIcatBJMD in terms of iteration speed. MAPBJMD and MAPcatBJMD are much faster than the other methods. Besides, MAPBJMD gets about the same performance as VIBJMD.
The performance on the largescale synthetic dataset is consistent with that on the smallscale synthetic dataset (Fig. 5 and TABLE II). VIcatBJMD exceeds the maximum iteration , while MAPBJMD and MAPcatBJMD take less than ten steps to converge (Fig. 5). Note that MAPBJMD and VIBJMD outperform MAPcatBJMD and VIcatBJMD respectively (TABLE II). Besides, VIBJMD obtains better performance than MAPBJMD. Since the coefficient matrices are relative large, the iteration speed of VIcatBJMD is only slightly faster than VIBJMD. Both VIcatBJMD and VIBJMD take more than 1000 seconds, but they are still applicable to realworld data. The MAP algorithm of BJMD are much faster than the VI one. It is more scalable for largescale data.
Algorithm  AUC  Time (s)  
Source 1  Source 2  Source 3  
GLAD  85.25  74.32  66.35  5068.10 
JMD  90.23  80.96  72.06  5709.29 
VIcatBJMD  86.99  78.51  72.79  84.44 
VIBJMD  99.66  90.87  79.16  230.21 
MAPcatBJMD  92.04  82.98  69.79  5.81 
MAPBJMD  99.67  90.31  77.85  4.20 
Algorithm  AUC  Time (s)  
Source 1  Source 2  Source 3  
VIcatBJMD  89.46  84.26  77.97  1461.89 
VIBJMD  99.84  96.96  91.85  1154.06 
MAPcatBJMD  89.79  80.16  72.45  27.26 
MAPBJMD  98.20  94.29  89.09  43.16 
4.2.4 The Effects of Heterogeneous Noise
To further investigate the effects of heterogeneous noise on model performance, we evaluate our methods on the small and largescale synthetic datasets respectively. JMD and GLAD are omitted on the largescale synthetic datasets. The results are showed in Fig 6. We can clearly see that:

Considering the heterogeneity of noise leads to superior performance. For example, AUCs of JMD are always higher than those of GLAD, and it is the same for VIBJMD and MAPBJMD.

The gap of performance between the algorithm considering heterogeneity of noise and those ignoring it is widened with the increase of noise level in Source 3.

AUCs of VIBJMD and MAPBJMD in Source 1 and Source 2 are not influenced by the increasing noise level in Source 3, which indicates that our proposed algorithms are more suitable for integrating data with different noise level.

The AUCs of VIcatBJMD in Source 1 and Source 2 suffers rapid decreasing after the noise level of Source 3 is greater than 2.5. The reason is that when the noise level of Source 3 is greater than 2.5, the variances of noise in Source 1 and Source 2 are overestimated, and thus the gradients of the coefficient matrices of Source 1 and Source 2 are smaller than the true gradient. As a consequence, VIcatBJMD needs much more iterations.

AUCs of VIBJMD are slightly better than those of MAPBJMD. VIBJMD approximates the posterior distribution by variational distribution, while MAPBJMD approximates the posterior distribution by its mode of density function. As a result, VI should more robust than MAP when the model is misspecified. One can observe that the AUCs of VIcatBJMD are better than those of MAPcatBJMD, when the noise level of Source 3 is less than 3.0.
The results on the largescale synthetic datasets are consistent with those on the smallscale ones (Fig 7). We can still observe that considering the heterogeneity of noise leads to superior performance. When the noise level on Source 3 is more than 3.5, the AUC performance of VIcatBJMD decreases sharply. Compared to the result on smallscale ones, VIBJMD outperforms MAPBJMD significantly. The performance curves of VIBJMD on Source 1 and Source 2 are very stable while the noise level of Source 3 increasing. It implies that VIBJMD successfully protects Source 1 and Source 2 from the influence of highly noisy source. However, the performance curves of MAPBJMD fluctuates slightly. Moreover, the performance of VIBJMD is consistently better than MAPBJMD in all sources.
Dataset  Dimensionality ()  Size ()  of groups  of classes 

3Sources  4750  948  3  6 
Extended Yale Face Database B  2016  1244  2  38 
METABRIC  19466  1911  3  5 
4.3 RealWorld Experiments
In this subsection, we applied VIBJMD and MAPBJMD to three realworld datasets and compared their performance with means, jointNMF, BNMF and semiNMF. GLAD and JMD are omitted due to their huge computational cost. The three datasets and data preprocessing are summarized below (TABLE III).
3Sources Dataset: The first dataset is the socalled 3Sources text data obtained from http://mlg.ucd.ie/datasets/3sources.html. This dataset is collected from three wellknown online news sources: BBC, Reuters and The Guardian, which contains 948 news articles covering 416 distinct news stories. Each story is annotated manually with one or more of six labels. BBC, Reuters and The Guardian are represented by termdocument matrices of size , , respectively. To conduct an integrative clustering, we combined the terms of those data matrices. As a result, the data matrices are of sizes , , respectively. Note that this dataset is very sparse. Each article is represented by a 4750dimensional vector and only a few entries are nonzeros.
Extended Yale Face Database B Dataset: The next dataset is the socalled Extended Yale Face Database B data [42, 43], which contains images of 38 individuals collected under different illumination conditions. For each image, the illumination condition is recorded by azimuth and elevation angles of the source of light. We used images whose azimuth and elevation angles are smaller than 60 degrees and resized them to . Those images can be divided into two groups: center lighting, the source of light is near to the center with azimuth angle and there are images in total; side lighting, the light is from side with azimuth angle and there are images in total. Therefore, the data matrices for center lighting and side lighting are of sizes and respectively. We normalized the data matrices by dividing its average for further analysis. Images collected under side lighting have more shadow, and thus it is natural to assumes that the noise level of the two groups is different.
Algorithm  3Sources  Yale Face B  METABRIC  Filtered METABRIC  
BBC  Guardian  Reuters  Center  Side  Low  Moderate  High  Low  Moderate  High  
VIBJMD  0.3226  0.3390  0.2922  0.1186  0.1804        0.5987  0.5998  0.6348 
MAPBJMD  0.3210  0.3370  0.2909  0.1106  0.1790  0.3455  0.3454  0.3684  0.5987  0.5987  0.6346 
Algorithm  AUC (%)  Time (s)  
BBC  Guardian  Reuters  
means  65.60  66.14  65.60  4.40 
JointNMF  80.95  80.61  80.94  0.65 
SemiNMF  76.42  73.70  76.42  41.99 
BNMF  82.84  83.51  82.87  77.37 
VIBJMD  90.25  90.78  90.33  2274.80 
MAPBJMD  90.57  88.82  89.46  6.55 
Algorithm  AUC (%)  Time (s)  
Center  Side  
means  61.97  55.77  6.05 
JointNMF  93.69  83.37  3.91 
SemiNMF  96.81  91.98  53.79 
BNMF  94.63  85.50  189.55 
VIBJMD  97.11  90.56  1444.34 
MAPBJMD  97.47  92.58  112.20 
METABRIC Breast Cancer Dataset: The third dataset is the socalled METABRIC data [44, 45]
, which contains a large breast cancer patient cohort of around 2000 samples with detailed clinical measurements and genomewide molecular profiles. Using gene expression to classify invasive breast cancers into biologically and clinically distinct subtypes has been studied for over a decade. In 2009, Parker
et al. developed an efficient classifier called PAM50, to distinguish the intrinsic subtypes of breast cancer [46]. PAM50 is now commonly employed in breast cancer studies [47, 48, 49]. Thus, we used the PAM50 subtypes as reference to evaluate the results of clustering. The cellularity of breast tumor is a concerned clinical indicator in treatment [50, 51, 52]. Therefore, we used the cellularity to divide samples into groups. We first mapped probes of gene expression to gene names by average pooling. Then we removed samples that PAM50 subtypes or cellularity is not available. There are 19466 genes and 1911 samples left after preprocessing. The samples are divided into 3 groups: low cellularity, moderate cellularity and high cellularity with size of , and , respectively. The number of features of this dataset is much larger than the number of samples.We applied our algorithms to these three datasets. In all experiments, we set the Dirichlet prior and the Laplace prior . The tolerance of MAPBJMD and VIBJMD is and , respectively. We used means, jointNMF, BNMF and semiNMF for comparison. Among these four algorithms, BNMF is a Bayesian approach. For each experiment, the AUC is reported by the average of the best 5 of 20 runs.
TABLE IV shows the estimated noise level on each dataset. TABLE VI, VI and VIII show the clustering performance on 3Sources, Extended Yale Face Database B and METABRIC dataset, respectively. The AUCs and running time are reported. VIBJMD takes much time on the METABRIC dataset and thus it is not reported (TABLE VIII).
These experiments reveal some important and interesting points:

Both MAPBJMD and VIBJMD can achieve similar or better performance than the competing methods over all datasets.

The algorithm of SemiNMF involves calculating the inverse of matrix. As we mentioned earlier, 3Sources data is very sparse. Therefore, SemiNMF is not robust on this dataset. MAPBJMD involves calculating the inverse of matrix in updating basis matrix too, but the Laplace priors can protect the inverse of matrix from singular. The Bayesian methods including BNMF, VIBJMD and MAPBJMD outperform other methods on 3Sources dataset. It suggests the superiority of Bayesian priors for very sparse data.

All methods except MAPBJMD have relative poor performance on METABRIC data. This suggests that MAPBJMD is more robust than other methods. We note that the number of features is much larger than the number of samples in the METABRIC dataset. We believe that feature selection will be needed to improve this situation.

Although VIBJMD outperforms MAPBJMD in synthetic experiments. However, VIBJMD is not guaranteed to outperform MAPBJMD in realwold data. Realworld data is much more complicated and VIBJMD often exceeds the maximum steps.

Bayesian methods including VIBJMD and BNMF take more time than other methods, while the speed of MAPBJMD is comparable to other ones. Thus, MAPBJMD is more scalable and efficient.

The estimated noise levels of VIBJMD and MAPBJMD are very consistent. Moreover, the estimated noise level can be used to guide feature selection as discussed in the next subsection.
4.4 Adaptive Feature Selection
In this subsection, we discuss how to use the estimated noise level to select features adaptively. In biological applications, the features are often redundant. It is common that we have thousands of features but only tens or hundreds of samples. Thus, feature selection is needed. Perhaps the most straightforward approach to select features is to choose those variance greater than a threshold. However, it is difficult to specify the threshold value for unsupervised tasks. Luckily, we can use the estimated noise level as the threshold to conduct an adaptive feature selection.
Algorithm  AUC (%)  Time (s)  
Low  Moderate  High  
means  62.99  66.82  72.73  24.74 
JointNMF  65.81  71.85  75.07  45.58 
SemiNMF  58.90  55.02  53.48  311.24 
BNMF  64.56  66.27  68.59  423.15 
MAPBJMD  82.22  79.05  81.45  27.17 
Algorithm  AUC (%)  Time (s)  
Low  Moderate  High  
means  67.01  70.16  78.85  5.53 
JointNMF  82.70  79.93  82.51  18.76 
SemiNMF  63.18  62.89  59.89  76.28 
BNMF  82.71  80.32  85.43  101.96 
VIBJMD  84.93  81.81  85.16  3623.24 
MAPBJMD  84.34  79.55  84.03  20.70 
Specifically, as MAPBJMD is much faster than VIBJMD, we can use MAPBJMD to estimate the variance of Gaussian noise for each source. If the variance of a feature is less than the variance of noise. It is natural to assume that it has no significant effect on clustering. The procedure of adaptive feature selection is summarized as follows: Given a dataset with C sources, we first use MAPBJMD to estimate the variance of each source. Then, for each source we conduct
individual hypothesis tests with null hypothesis
: variance of feature
Comments
There are no comments yet.