Bayesian Joint Matrix Decomposition for Data Integration with Heterogeneous Noise

by   Chihao Zhang, et al.

Matrix decomposition is a popular and fundamental approach in machine learning and data mining. It has been successfully applied into various fields. Most matrix decomposition methods focus on decomposing a data matrix from one single source. However, it is common that data are from different sources with heterogeneous noise. A few of matrix decomposition methods have been extended for such multi-view data integration and pattern discovery. While only few methods were designed to consider the heterogeneity of noise in such multi-view data for data integration explicitly. To this end, we propose a joint matrix decomposition framework (BJMD), which models the heterogeneity of noise by Gaussian distribution in a Bayesian framework. We develop two algorithms to solve this model: one is a variational Bayesian inference algorithm, which makes full use of the posterior distribution; and another is a maximum a posterior algorithm, which is more scalable and can be easily paralleled. Extensive experiments on synthetic and real-world datasets demonstrate that BJMD considering the heterogeneity of noise is superior or competitive to the state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4


Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering

Matrix decomposition is one of the fundamental tools to discover knowled...

Partially Shared Semi-supervised Deep Matrix Factorization with Multi-view Data

Since many real-world data can be described from multiple views, multi-v...

Double-matched matrix decomposition for multi-view data

We consider the problem of extracting joint and individual signals from ...

Comparative Study of Inference Methods for Bayesian Nonnegative Matrix Factorisation

In this paper, we study the trade-offs of different inference approaches...

Error-Robust Multi-View Clustering

In the era of big data, data may come from multiple sources, known as mu...

Joint Adaptive Neighbours and Metric Learning for Multi-view Subspace Clustering

Due to the existence of various views or representations in many real-wo...

Bayesian Hybrid Matrix Factorisation for Data Integration

We introduce a novel Bayesian hybrid matrix factorisation model (HMF) fo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Matrix decomposition (a.k.a. matrix factorization) plays fundamental roles in machine learning and data mining. As we know, the basic task of machine learning and data mining is to discover knowledge from primitive data. In many cases, primitive datasets or observations are organized in matrix forms. For example, an image can be stored in a matrix of pixels; a corpus is often represented by a document-term matrix that describes the frequency of terms occur in the collection of documents. The primitive datasets are generally big and noisy matrices. Extracting useful information from primitive datasets directly is often infeasible. Therefore, matrix decomposition is becoming a powerful tool to get a more compact and meaningful representation of observed matrices. Matrix decomposition has been successfully applied to various fields including image analysis [1, 2, 3], gene expression analysis [4, 5, 6], collaborative filtering [7, 8, 9], text mining [10, 11] and many other areas.

Generally, most matrix decomposition methods assume a data matrix is from one single data source. The conceptual idea underlying matrix decomposition is that the observed data matrix can be approximated by a product of two or more low-rank matrices. Given an observed data matrix, a general matrix decomposition method can be summarized as follows:


which is further formulated into an optimization problem:


where D is a divergence function. Notice that for each column of , , where is the number of columns of , i.e. each column of the observed matrix is approximated by a linear combination of all columns of . Therefore, is often referred as a basis matrix, while is often referred as a coefficient matrix.

Generally, one can get different matrix decomposition methods by placing different constraints on and . For example, -means clustering may be considered as the simplest matrix decomposition method [12]. In -means clustering, contains clustering centroid matrix, which has no restriction; is a binary matrix and each column of indicates the cluster membership by a single one. Semi-nonnegative matrix factorization (semi-NMF) [12] relaxes the restriction of -means. It only restricts to be nonnegative. The nonnegative restriction on usually improves the interpretability of the basis matrix. To decompose nonnegative data like images, nonnegative matrix factorization (NMF) has been explored [13, 1]. NMF restricts both and

to be nonnegative. The nonnegative constraints lead to part-based feature extraction and sort of sparseness naturally

[1]. Moreover, due to its effectiveness in clustering and data representation, NMF has been extended to many variant forms including various sparse NMF models [2, 14] and regularized NMF models [15, 16].

However, there are several main drawbacks of those standard matrix factorization methods. First, they are prone to overfitting the observed data matrices. Second, the regularization parameters of models require to be carefully tuned. Bayesian approaches were imposed to matrix factorization [17, 18, 19, 20, 21]

to address those weaknesses by incorporating priors with model parameters and hyperparameters. Third, the above methods only focus on a data matrix from only one source.

Currently, multi-view data collected from different sources is very common. Such data from different sources might be complementary to each other. Moreover, they are usually of different quality (i.e. noise level is different in each source). How to make full use of those data is an important but challenging topic. Many studies have devoted their efforts to developing mathematical methods to tackle the underlying issues for data integration. For example, Zhang et al. [22] proposed joint NMF (jNMF) to perform integrative analysis of multiple types of genomic data to discover the combinatorial patterns that would have been ignored with only a single type of data. jNMF assumes the different data shares a common basis matrix for each data set. Moreover, Zhang et al. [23] also developed a sparsity-constrained and network-regularized joint NMF for multiple genomics data integration. Liu et al. [24] developed multi-view NMF (MultiNMF) for multi-view clustering. Unlike jNMF and others shared one-side factor among multi-view data matrices, MultiNMF uses a consensus coefficient matrix to give a multi-view clustering results. More recently, Chalise and Fridley [25] proposed a weighted joint NMF model (intNMF) for integrative analysis, which assigns higher weight to data from more reliable source. However, few of these methods address the heterogeneity of noise explicitly. jNMF ignores the noise heterogeneity, MultiNMF and intNMF address this problem implicitly by giving a lower weight to more noisy data source, but the weight of each source is difficult to decide. Thus, powerful computational methods to well consider the heterogeneous noise of multi-view data for data integration are urgently needed.

To this end, we propose a powerful joint matrix decomposition framework (BJMD), which models the heterogeneity of noise explicitly by Gaussian distribution in a Bayesian framework. Mathematically, it still remains the interpretability and sparsity of the basis matrix as regularized matrix decomposition does. We develop two algorithms to solve this model: one is a variational Bayesian inference algorithm, which makes full use of the proposal distribution to approximate posterior distribution; and another is a maximum a posterior algorithm (MAP), which turns to be an iterative optimization algorithm, enabling it is more efficient and scalable. We extensively compare the two algorithms and discuss their advantages and disadvantages. Extensive experiments on synthetic and real-world datasets show the effectiveness of BJMD for modeling the noise and demonstrate that considering the heterogeneity of noise leads to superior performance to the state-of-the-art methods.

2 Related Work

2.1 Joint NMF and Other Variants

Joint NMF (jNMF) is a natural extension to NMF for integrating multiple datasets[22]. For data matrices , the optimization format of jNMF is formulated as:


where is the common or shared basis matrix and is the corresponding coefficient matrix. The common or shared patterns in the multi-view data matrices from different sources can be reflected by the shared basis matrix. One can easily see that jNMF is mathematically equivalent to NMF by setting and . Therefore, jNMF ignores the heterogeneity of noise in different data sources. Straz̆ar et al. [26] extended jNMF by adding orthogonality-regularized terms on coefficient matrices to predict protein-RNA interactions (iONMF). The orthogonality regularization prevents multicollinearity and iONMF was stated to outperform other NMF models in predicting protein-RNA interactions. However, the heterogeneity of noise among different data types is still ignored.

MultiNMF extends jNMF to multi-view clustering and requires the coefficient matrices learned from various views to be approximately common. Specifically, it is formulated as follows:


where is the consensus coefficient matrix and is the weight parameter to tune the relative importance among different views. In image processing, Jing et al. [27] proposed a supervised joint matrix factorization model for image classification and annotation (SNMFCA). SNMFCA factorizes region-image matrix and annotation-image matrix simultaneously and incorporates the label information as a network-regularized term. SNMFCA also uses weight hyperparameters to balance the importance of each data source. Recently, Chalise and Fridley [25] proposed intNMF which is an extension of jNMF by specifying weight hyperparameter to each data source. Specifically, it is formulated as follows:


where is a user predefined weight parameter. Those methods use weight parameters to regularize the importance of each data source. However, how to choose those weight parameters is still not sufficiently studied in literature.

2.2 Glad

Bayesian methods for matrix decomposition have been studied for about twenty years. In 1999, Bishop [17]

proposed Bayesian principal component analysis (BPCA), which can automatically determine an effective number of principal components as part of the inference procedure. Later, Bishop

[28] gives an effective variational inference method for BPCA. Welling and Kurihara [18] introduced Bayesian -means, which retains the ability of selecting the model structure by incorporating prior on model parameters. Schmidt et al. [19] proposed a Bayesian variant for NMF by introducing exponential priors on the basis matrix and coefficient matrix respectively (BNMF). Moreover, BNMF assumes Gaussian noise over the product of the basis matrix and coefficient matrix. However, the Gaussian noise may lead to the entry of data matrix to be negative. In collaborative filtering, Salakhutdinov and Mnih [20] proposed Bayesian probabilistic matrix factorization (BPMF), which places Gaussian prior over both basis and coefficient matrices to predict user preference for movies. BPMF has attracted great interests of researchers from diverse fields. Moreover, other Bayesian methods for matrix decomposition in collaborative filtering field have also been proposed [29, 30, 31].

GLAD [21] is a mixed-membership model, which utilizes three typical distributions (i.e., Gaussian distribution, Laplace distribution and Dirichlet distribution). GLAD can be described as a matrix decomposition model with Laplace prior on the basis matrix, Dirichlet prior on the coefficient matrix and Gaussian noise on the observed data matrix. Specifically, GLAD assumes that the observed data matrix is generated as follows:


where each element of the basis matrix , each column of the coefficient matrix and indicates the Gaussian noise . Mathematically, the Laplace prior over enforces the sparsity of the basis matrix. And has a Dirichlet distribution, providing a distribution over clusters for each column of

. Inference for GLAD focuses on estimating the posterior distribution over latent variables:


However, due to the non-conjugate prior, it is intractable to calculate the posterior distribution, which involves high dimensional integral. GLAD seeks to a non-conjugate variational inference (VI). Instead of calculating the integral exactly, VI uses a distribution

parameterized by , to approximate the true posterior distribution. Specifically, VI aims at minimizing the Kullback-Leibler (KL) divergence between the approximation distribution and the true posterior distribution :


where denotes the KL divergence between and , and

denotes the family of probability distribution functions (PDFs) to make the integral trackable. Because of involving posterior distribution, the KL divergence is still difficult to optimize. Instead, VI maximize the evidence lower bound (ELBO):


where the first term is the expectation of joint distribution over the approximation distribution, and the second term is the entropy of approximation distribution. Therefore, if the expectation of (

9) can be calculated analytically, VI turns the inference problem to be an optimization problem. Unfortunately, expectation is still difficult to compute for GLAD. To address this issue, GLAD further introduces a Laplace approximation to address the expectation in ELBO.

The inference of GLAD has two main drawbacks. First, it approximates the ELBO, which is only an approximation of true posterior. Consequently, the gap between approximation distribution and true posterior distribution might be widened. Second, it is still computationally expensive for involving a non-linear optimization problem at each iteration. Thus, GLAD takes considerable time to converge for even small data, which makes it unrealistic for real-world datasets.

3 Bayesian Joint Matrix Decomposition

In this section, we present the Bayesian joint matrix decomposition framework (BJMD) (Fig. 1a), and develop two algorithms to solve it.

Fig. 1: Graphical models of (a) BJMD and (b) reformulated BJMD.

3.1 Model Construction

Suppose that we have multi-view data matrices: . Inspired by previous studies, the generative model of BJMD is formulated as follows:


where is a shared basis matrix and models noise of different data sources by Gaussian distribution. To enforce the sparsity of the basis matrix[32], we place a Laplace prior with zero mean on it:




Non-negative restriction on the coefficient matrix of semi-NMF [12] leads to better interpretability of the basis matrix. However, columns of the basis matrix are usually not comparable without proper normalization, which weakens the interpretability. In fact, one can see that semi-NMF is invariant with column scaling on or row scaling on . Consider a diagonal matrix with size , is equal to . To this end, we additionally restrict each column of coefficient matrices summing to one. As the support of Dirichlet distribution is a unit simplex, we place a Dirichlet prior on each column of coefficient matrices :




is a

-dimensional vector and

indicates the th element of . Moreover, we use Gaussian distribution in each data source modeling the heterogeneity of noise explicitly:




To get a complete Bayesian model, we also need to introduce prior distribution for variance of Gaussian distribution. For convenience, we choose conjugate prior for

, i.e. , where ,

are two hyperparameters of inverse Gamma distribution and can be set to a small positive value such as

or in a noninformative fashion [33].

Under the BJMD model, the posterior is proportional to the complete likelihood, which can be written as:


which could help us to understand it from an optimization perspective. Hence, we write down the log-likelihood. For simplicity, we formulate the complete negative log-likelihood retaining only terms involving , as follows:


with the unit simplex constraints and for any , , . The first term in (18) can be viewed as a divergence function between the observation and the approximated value . The variance of Gaussian distribution is the weight parameter, which gives a higher weight for the data points from source with lower noise level. Different from the weight hyperparameters of intNMF and MultiNMF, can be automatically inferred during the optimization process. The second term of this equation is a -norm regularization on the basis matrix , which regularizes the sparsity of . The last term is a regularization term on the coefficient matrices. If we omit the first two terms, one can find that the last term is minimized when , which is actually the expectation of Dirichlet prior. Therefore, the last term enforces to the expectation of prior and reduces the risk of overfitting.

3.2 Variational Inference

In this subsection, we adopt a variational inference algorithm for the BJMD model. For further comparison, we first extend the GLAD to multi-view data matrices by a shared basis matrix and different coefficient matrices. Note that the GLAD and the extended GLAD model for multi-view data are not in a full Bayesian framework. Both models treat the variance of Gaussian distribution as a hyperparameter rather than a distribution. Thus, we refer the extended GLAD model as joint matrix decomposition (JMD) later.

We can easily see that the inference algorithm of GLAD can be naturally extended to JMD. However, the extended inference algorithm of JMD still suffers the aforementioned problems of GLAD: (a) the approximation of ELBO might not be a good approximation to the true posterior; (b) the expensive computational cost makes it unrealistic for real-world datasets. Recent advances in stochastic variational methods, including black box variational inference (BBVI) [34], auto-encoding variational Bayes (AEVB) [18] and automatic differentiation variational inference (ADVI) [35], provide powerful tools to tackle those problems. These stochastic variational inference methods avoid calculating the ELBO directly. Instead, they take derivative of ELBO and push the derivative operator into the expectation:


where denotes the observed data, denotes the latent variable, denotes the variational distribution parameterized by , denotes the joint distribution of latent variables and observed data, and denotes the ELBO. The gradient of ELBO (19) is expressed as an expectation over variational distribution, which can be approximated by a Monte Carlo method:


where is the number of samples and . Now we can update the variational parameters with the noisy gradients with simple computation.

We implement ADVI for optimizing the BJMD model with python library PyMC3 [36]

. ADVI converts all constrained latent variables to unconstrained ones and uses Gaussian distributions to approximate the ELBO of unconstrained random variables. Following previous notations, ADVI can be briefly summarized as follows:

  1. [label=()]

  2. Firstly, ADVI transforms the constrained latent variables into unconstrained real-valued variables by a one-to-one mapping. ADVI aims at minimizing in which one can use a single family of distributions for all models.

  3. Then ADVI converts the gradient of variational objective function as expectation over , which can be estimated by a Monte Carlo sampling strategy as in (20).

  4. Furthermore, ADVI reparameterizes the gradient in terms of standard Gaussian. This transformation turns the Monte Carlo procedure to sampling from standard Gaussian distribution, enabling it can be sampled effectively.

  5. Finally, ADVI uses the noisy gradients to optimize the variational distribution. As all transformation mentioned in the former three steps are one-to-one mappings, to get the variational parameters we concern, we can map the variational distribution to the original space of the constrained latent variable .

ADVI can automatically infer probabilistic models by the above steps. One should notice that the variational distribution is optimized by noisy gradients, and thus it is possible for ADVI to escape from the local optimum and will not be overfitted. The disadvantage of noisy gradients is that the ELBO will not converge. Instead, the ELBO will fluctuate in a small interval. Therefore, it is difficult to decide when to stop iteration. Empirically, stopping iteration immediately after the ELBO becoming stable often leads to poor performance. As we use the coefficient matrices for further clustering and representation, we stop iteration if the coefficient matrices are stable. ADVI empirically takes tens of thousands of iterations to get a well-trained model. Then checking the relative change of basis matrices at each iteration leads to an expensive computational cost. Thus, we check the coefficient matrices every iterations to ensure that the model is sufficiently trained:


where is a tolerance threshold, which is a small positive value, is the coefficient matrices at th iteration (). Note that relative change of will not converge to zero due to the noisy gradient. Thus, the iteration should be stopped if the criterion (21) is satisfied or iteration exceeds , where is a user predefined maximum times of iteration.

3.3 Maximum A Posterior

ADVI makes it possible to apply BJMD framework to real-world datasets, yet it still takes tens of thousands of iterations before stopping criterion is satisfied. To get a more efficient and scalable algorithm, we develop a maximum a posterior (MAP) algorithm, which directly maximizes the posterior in (18) and treat it as an optimization problem. The key difference between VI and MAP is that: VI approximates the posterior distribution by a family of distribution; MAP approximates posterior distribution by its mode. VI makes full use of posterior distribution, while it pays a more expensive computational cost. In contrast, MAP is a point estimation with less computational cost. We present a MAP algorithm for the BJMD model in the following.

3.3.1 Model Reformulation

Due to the absolute value factor in Laplace distribution, it is inconvenient for posterior inference within a Bayesian framework for (18). To address this computational issue, we reformulate the model by utilizing a two-level hierarchical representation of Laplace distribution. Specifically, a random variable following a Laplace distribution

can be reformulated as Gaussian scale mixtures with exponential distribution prior to the variance

[37]. That is:


where the term is the PDF of exponential distribution.

Therefore, we can replace the Laplace prior of with the two-level hierarchical representation to eliminate the absolute value factors. This is a useful technique to simplify computation, which has been used in other studies [38]. Under this reformulation, the complete log likelihood is:


where random variable is an auxiliary variable. Substituting corresponding PDF into (23), the negative log complete likelihood is formulated as follows:


with the unit simplicial constraints and for any , , . Note that the absolute value bars are eliminated after reformulation. Instead, the original -norm regularized term is converted to a weighted -norm regularized term, which is easier to optimize. Fig. 1b shows the graphical model of the reformulated BJMD.

Equivalently, we minimize the complete non-negative log-likelihood (24) in the following optimization form:


3.3.2 Update Rules

Here, we propose a powerful iterative update procedure to solve problem (25), which can be easily paralleled. Note that this optimization problem (25) is not convex. Therefore, we optimize it in a block-coordinate manner. Specifically, we update one latent variable with others fixed until convergence.
Updating and fixing others: Let the partial derivative for be zeros, . This is a simple quadratic equation in one variable. Considering the constraint , we have:


Updating and fixing others: We only care about at this step. Therefore, retaining terms only involving , the optimization problem is equivalent to:


with no constraints. Note that the above optimization problem is separable. For each row of , the subproblem can be written as:


where denotes the diagonal function and

denotes the element-wise root of square. This subproblem is close to ridge regression

[39]. The difference is that the elements of vector are equal in ridge regression. The diagonal matrix has the special form . While in our case, each element of follows the exponential distribution with parameter . Let the derivative of (28) be zero:


The optimal solution of subproblem (28) is:


Updating and fixing others: Each column of follows a Dirichlet distribution, whose support is a -dimensional simplex. Retaining terms involving , our problem can be written as:


If , the above is a convex optimization problem. Otherwise, it would be difficult to solve this problem due to the non-convexity. We always set in the following experiments. Note that above problem is separable. We can divide it into subproblems by column, and optimize each subproblem in a parallel manner. Subproblem for column is as follows:


For convenience, we denote . The subproblem (32) is very close to a quadratic programming problem with unit simplex constraints. The objective function is just a quadratic term coupled with a logarithmic barrier function. The difference is that, for the constraint quadratic programming (QP) problem, is a smooth parameter to enforce the path to be interior within the feasible region. is usually small or gradually decreasing in QP. In our case, is the hyper-parameter related to the Dirichlet prior. If we are confident with our prior knowledge, is not necessarily small.

This subproblem can be solved by a interior-point algorithm [40]. Omitting constants, the Lagrange function is:


where . The KKT condition is as follows:

where . For a given , the KKT condition is a system of non-linear equations (the nonlinearity is due to the product ). There are variables . Let’s set


The system can be solved in a Newton-like method. At each iteration, to choose a good direction , we use Jacobian matrix as linear approximation. Then, the following linear equation should be solved,


where the Jacobian matrix has the special form:


Here we summarize the algorithm of updating in subalgorithm 1.
Updating and fixing others: Due to the conjugate prior, can be easily updated by:


Stop criterion: It is easy to see that each step of the algorithm is convex. The value of objective function is ensured to be monotonically non-increasing. We stop iteration when the relative change of objective function is small enough:


where is the value of objective function at iteration and is the tolerance, which is a small positive value (e.g. ).

The MAP algorithm updates the latent variables in turn by the above update rules until the stopping criterion (38) is satisfied. We summarize the MAP algorithm for the BJMD model in Algorithm 1.

1:data matrices ,, and initialized , , , , , , ,
2:, , , , , , ,
4:     for  do Can be in parallel
5:         update by (30)
6:     end for
7:     for  do
8:         for  do Can be in parallel
9:              update by Subalgorithm 1
10:         end for
11:     end for
12:     update by (26)
13:     update ,, by (37)
14:until (38) is satisfied
16:Subalgorithm 1 - Interior-point algorithm for
18:initialized , and other parameters
19:the th column of coefficient matrix
21:     solve the system of equations (35)
22:     determine the searching direction
23:     choose the largest step length such that
24:     set
25:     set
26:until change of objective function (32) is small
Algorithm 1 Maximum a posterior for the BJMD model

3.3.3 Computational Complexity

Here we briefly discuss the computational complexity of the proposed MAP algorithm for the BJMD model. At each iteration, the most computational expensive step is inferring s, which involves solving a constrained QP problem. To infer , each iteration of Subalgorithm 1 needs to solve a linear system, which is of . If QP stops after iterations, the total cost of updating s is , where . To infer , a linear system should be solved, which leads to cost in total. The computational cost of updating and variance are of and respectively. Altogether, each iteration of MAP is of , and the total complexity of the algorithm is thus , where is the upper bound of iteration. Generally, and is empirically small, and s and s can be optimized in parallel. Therefore, the proposed MAP algorithm is scalable for large-scale problems.

4 Experimental Results

Here we first evaluate JMD and BJMD with two algorithms ADVI and MAP (denoted as VI-BJMD and MAP-BJMD respectively) using synthetic data. We demonstrate that both methods can estimate the variance of noise accurately and BJMD leads to superior performance compared to JMD. Then we apply our methods to three real-world data including 3-Source Text dataset, Extended Yale Face Database B and METABRIC Breast Cancer Expression dataset, and compare them with several state-of-the-art methods including -means, Joint-NMF, Semi-NMF and BNMF. All experiments have been performed on a desktop computer with a 2GHz Intel Xeon E5-2683 v3 CPU, a GTX 1080 GPU card, 16GB memory and Ubuntu 16.04 (Operating System).

4.1 Evaluation Metric

To evaluate the effect of our methods in terms of clustering ability, we adopt an evaluation metric based on the area under the ROC curve (AUC). Specifically,

is the coefficient matrix from source , and is the corresponding true label indicator matrix of . is binary and equals to one means that sample from data source belongs to the th cluster. Note sample belongs to one or more clusters, and thus has no zero column. We define the following metric:


where AUC takes as labels and as predictions. To evaluate the performance of model in data source , we calculate from to and take average as follows:


4.2 Synthetic Experiments

In synthetic experiments, we compare JMD and BJMD. Note JMD is an extension of GLAD, and BJMD can be solved by two algorithms ADVI and MAP.

4.2.1 Synthetic Data Sets

To this end, we generate synthetic data inspired by [41]. Specifically, we first generate the ground truth dictionary matrix in the following step:


where is a constant, denotes the number of non-zero entries in the first columns and denotes the length of coherence between basis and . Note that that all entries of the th column of equal to zero. Next, we generate the entries of coefficient matrix for source .

follows a Bernoulli distribution with parameter

, namely for . For the th column of , if all entries of equal to zero, set . Then we normalize such that the sum of column equals to one. Finally, we generate an observed data matrix by:


We generated synthetic data by the above procedure in both small and large scales.

Samll-Scale Synthetic Data Sets: The data were generated with , , , , , , . and are 1.0, 2.5 respectively and ranges from 1.5 to 5.5 in step of 0.5. Consequently, we obtained nine datasets and each one consists of , , .

Large-Scale Synthetic Data Sets: Similarly, the data were generated with , , , , , , . and are 1.0, 2.5 respectively and ranges from 1.5 to 5.5 in step of 0.5. Consequently, we obtained nine datasets and each one consists of , , .

In the following experiments, we applied JMD, VI-BJMD, MAP-BJMD to the synthetic data. We also concatenated the , , by column to get the combined data matrix. We applied GLAD, VI-BJMD and MAP-BJMD to this concatenated data matrix for comparison. We denote the results of the two algorithms as VI-catBNJMD and MAP-catBJMD respectively for clarity. The AUC value is the average of the best 5 of 20 runs (based on the values of objective function).

4.2.2 Convergence Behavior of VI-BJMD

We first investigate the convergence behavior of variational inference of BJMD by numerical simulation. We applied VI-BJMD onto a small-scale synthetic dataset with =4.0. We can see that the ELBO becomes stable and begins to fluctuate in a small interval after around 20,000 iterations (Fig. 2a), while the relative change of the coefficient matrices stays stable after around 80000 iterations (Fig. 2b). We also plot the steps of iteration versus the AUC performance of VI-BJMD in each source (Fig. 2c). It can be observed that the AUC performance of VI-BJMD in each source is increasing as iteration progressed. Moreover, the AUC decreases correspondingly if the noise level increases. However, the AUC performance is still increasing when the ELBO gets stable at about 20000 iterations. Therefore, if we stop iteration immediately after the ELBO becoming stable, the model has a relative poor performance in terms of AUCs. ELBO is a lower bound of evidence, which approximates the posterior distribution. Thus, it is possible that the variational distribution is approaching to the true posterior distribution, while the ELBO stays stable. As we use the coefficient matrix as predictions, it is obvious that the performance of VI-BJMD will stay stable after the relative change of coefficient matrices become stable. Therefore, our stopping criterion in (21) of variational algorithm is a better choice than examining the change of ELBO. One should note that the relative change of coefficient matrices will not approach to zero because of the noisy gradients in optimization. Therefore, we stop iteration if the tolerance is satisfied or the maximum times of iteration is exceeded. In the following experiments, we always set .

Fig. 2: Illustration of the convergence behavior of VI-BJMD using a small-scale synthetic dataset with =4.0. (a) The ELBO versus iteration numbers; (b) the relative change of coefficient matrices in percentage and (c) the AUCs versus iteration numbers for each source.
Fig. 3:

True noise histograms and estimated PDF (shown with curves) by JMD (top), VI-BJMD (middle) and MAP-BJMD (bottom) respectively. The true standard deviation of source 1, source 2, source 3 are 1.0, 2.5, 4.0 respectively.

Next, since our model is constructed under the heterogeneity of noise assumption, it is necessary to verify whether our proposed models can estimate the noise level of data from different sources accurately. To verify the ability of discovering the heterogeneity of noise in different data sources, we applied JMD, VI-BJMD and MAP-BJMD to the small-scale synthetic dataset with =4.0. We can clearly see that the estimated probability density curves are close to the true noise histograms, and thus all those three methods can estimate the variance of Gaussian distribution accurately (Fig. 3).

4.2.3 Considering the Heterogeneity of Noise Leads to Superior Performance

In the following, we conduct experiments to demonstrate that considering the heterogeneity of noise in different data sources leads to superior performance. We evaluate our methods on the small and large-scale synthetic datasets both with =4.0 respectively. JMD and GLAD are omitted on the large-scale synthetic data due to their huge computational cost.

In order to facilitate comparison, we plot the objective function value (, note ELBO is always negative) versus iteration number and log likelihood versus iteration number respectively (Fig. 4). It is natural that methods considering the heterogeneity of noise get better objective function values because those methods have extra parameters to model noise of each source. Each iteration of variational inference calculates the noisy gradient by Monte Carlo sampling and can be significantly accelerated by GPU and it takes thousands of iterations to satisfy the stopping criterion. On the contrast, JMD and MAP-BJMD pay more computational cost at each iteration and converge after a few iterations.

We can clearly see that the methods (JMD, VI-BJMD, MAP-BJMD) considering the heterogeneity of noise demonstrate significantly better performance than the others (GLAD, VI-catBJMD, MAP-catBJMD) in terms of AUCs (TABLE I). Moreover, the proposed methods MAP-BJMD and VI-BJMD obtain better performance than GLAD and JMD. We can also see that both JMD and GLAD cost considerable time even in such a small synthetic data (TABLE I). Thus, it is impractical to apply them to large-scale real-world datasets. We utilized a GPU card to accelerate the variational inference. The synthetic data is small so that VI-BJMD can not utilize all cores of the GPU card. While concatenating all the matrices will significantly accelerate the speed of iteration. In this situation, VI-catBJMD only spends around a third of the running time of VI-BJMD. If the size of data matrix of each source is big enough, there will be no significant difference between VI-BJMD and VI-catBJMD in terms of iteration speed. MAP-BJMD and MAP-catBJMD are much faster than the other methods. Besides, MAP-BJMD gets about the same performance as VI-BJMD.

Fig. 4: The objective function values versus iteration numbers for methods on a small-scale synthetic dataset with =4.0.
Fig. 5: The objective function values versus iteration numbers for methods on a large-scale synthetic dataset with =4.0.

The performance on the large-scale synthetic dataset is consistent with that on the small-scale synthetic dataset (Fig. 5 and TABLE II). VI-catBJMD exceeds the maximum iteration , while MAP-BJMD and MAP-catBJMD take less than ten steps to converge (Fig. 5). Note that MAP-BJMD and VI-BJMD outperform MAP-catBJMD and VI-catBJMD respectively (TABLE II). Besides, VI-BJMD obtains better performance than MAP-BJMD. Since the coefficient matrices are relative large, the iteration speed of VI-catBJMD is only slightly faster than VI-BJMD. Both VI-catBJMD and VI-BJMD take more than 1000 seconds, but they are still applicable to real-world data. The MAP algorithm of BJMD are much faster than the VI one. It is more scalable for large-scale data.

Algorithm AUC Time (s)
Source 1 Source 2 Source 3
GLAD 85.25 74.32 66.35 5068.10
JMD 90.23 80.96 72.06 5709.29
VI-catBJMD 86.99 78.51 72.79 84.44
VI-BJMD 99.66 90.87 79.16 230.21
MAP-catBJMD 92.04 82.98 69.79 5.81
MAP-BJMD 99.67 90.31 77.85 4.20
TABLE I: Performance Comparison on a Small-Scale Synthetic Dataset with =4.0
Algorithm AUC Time (s)
Source 1 Source 2 Source 3
VI-catBJMD 89.46 84.26 77.97 1461.89
VI-BJMD 99.84 96.96 91.85 1154.06
MAP-catBJMD 89.79 80.16 72.45 27.26
MAP-BJMD 98.20 94.29 89.09 43.16
TABLE II: Performance Comparison on a Large-Scale Synthetic Dataset with =4.0

4.2.4 The Effects of Heterogeneous Noise

To further investigate the effects of heterogeneous noise on model performance, we evaluate our methods on the small and large-scale synthetic datasets respectively. JMD and GLAD are omitted on the large-scale synthetic datasets. The results are showed in Fig 6. We can clearly see that:

  • Considering the heterogeneity of noise leads to superior performance. For example, AUCs of JMD are always higher than those of GLAD, and it is the same for VI-BJMD and MAP-BJMD.

  • The gap of performance between the algorithm considering heterogeneity of noise and those ignoring it is widened with the increase of noise level in Source 3.

  • AUCs of VI-BJMD and MAP-BJMD in Source 1 and Source 2 are not influenced by the increasing noise level in Source 3, which indicates that our proposed algorithms are more suitable for integrating data with different noise level.

  • The AUCs of VI-catBJMD in Source 1 and Source 2 suffers rapid decreasing after the noise level of Source 3 is greater than 2.5. The reason is that when the noise level of Source 3 is greater than 2.5, the variances of noise in Source 1 and Source 2 are overestimated, and thus the gradients of the coefficient matrices of Source 1 and Source 2 are smaller than the true gradient. As a consequence, VI-catBJMD needs much more iterations.

  • AUCs of VI-BJMD are slightly better than those of MAP-BJMD. VI-BJMD approximates the posterior distribution by variational distribution, while MAP-BJMD approximates the posterior distribution by its mode of density function. As a result, VI should more robust than MAP when the model is misspecified. One can observe that the AUCs of VI-catBJMD are better than those of MAP-catBJMD, when the noise level of Source 3 is less than 3.0.

The results on the large-scale synthetic datasets are consistent with those on the small-scale ones (Fig 7). We can still observe that considering the heterogeneity of noise leads to superior performance. When the noise level on Source 3 is more than 3.5, the AUC performance of VI-catBJMD decreases sharply. Compared to the result on small-scale ones, VI-BJMD outperforms MAP-BJMD significantly. The performance curves of VI-BJMD on Source 1 and Source 2 are very stable while the noise level of Source 3 increasing. It implies that VI-BJMD successfully protects Source 1 and Source 2 from the influence of highly noisy source. However, the performance curves of MAP-BJMD fluctuates slightly. Moreover, the performance of VI-BJMD is consistently better than MAP-BJMD in all sources.

Fig. 6: Performance comparison of JMD, VI-BJMD and MAP-BJMD and their respective version ignoring heterogeneous noise on the small-scale synthetic datasets. The standard deviation of Gaussian noise is fixed in Source 1 and Source 2. The noise level of Source 3 ranges from 1.5 to 5.5. Performance of algorithms (JMD, VI-BJMD, MAP-BJMD) considering difference of noise level is plotted in dashed line and the others are plotted in solid line.
Fig. 7: Performance comparison of VI-BJMD and MAP-BJMD (as well as VI-catBJMD, MAP-catBJMD) on the large-scale synthetic datasets. The standard deviation of Gaussian noise is fixed in Source 1 and Source 2. The noise level of Source 3 ranges from 1.5 to 5.5.
Dataset Dimensionality () Size () of groups of classes
3Sources 4750 948 3 6
Extended Yale Face Database B 2016 1244 2 38
METABRIC 19466 1911 3 5
TABLE III: Summary of the Three Data Sets

4.3 Real-World Experiments

In this subsection, we applied VI-BJMD and MAP-BJMD to three real-world datasets and compared their performance with -means, joint-NMF, BNMF and semi-NMF. GLAD and JMD are omitted due to their huge computational cost. The three datasets and data preprocessing are summarized below (TABLE III).

3Sources Dataset: The first dataset is the so-called 3Sources text data obtained from This dataset is collected from three well-known online news sources: BBC, Reuters and The Guardian, which contains 948 news articles covering 416 distinct news stories. Each story is annotated manually with one or more of six labels. BBC, Reuters and The Guardian are represented by term-document matrices of size , , respectively. To conduct an integrative clustering, we combined the terms of those data matrices. As a result, the data matrices are of sizes , , respectively. Note that this dataset is very sparse. Each article is represented by a 4750-dimensional vector and only a few entries are non-zeros.

Extended Yale Face Database B Dataset: The next dataset is the so-called Extended Yale Face Database B data [42, 43], which contains images of 38 individuals collected under different illumination conditions. For each image, the illumination condition is recorded by azimuth and elevation angles of the source of light. We used images whose azimuth and elevation angles are smaller than 60 degrees and resized them to . Those images can be divided into two groups: center lighting, the source of light is near to the center with azimuth angle and there are images in total; side lighting, the light is from side with azimuth angle and there are images in total. Therefore, the data matrices for center lighting and side lighting are of sizes and respectively. We normalized the data matrices by dividing its average for further analysis. Images collected under side lighting have more shadow, and thus it is natural to assumes that the noise level of the two groups is different.

Algorithm 3Sources Yale Face B METABRIC Filtered METABRIC
BBC Guardian Reuters Center Side Low Moderate High Low Moderate High
VI-BJMD 0.3226 0.3390 0.2922 0.1186 0.1804 - - - 0.5987 0.5998 0.6348
MAP-BJMD 0.3210 0.3370 0.2909 0.1106 0.1790 0.3455 0.3454 0.3684 0.5987 0.5987 0.6346
TABLE IV: The Noise Level Estimated in Each Dataset
Algorithm AUC (%) Time (s)
BBC Guardian Reuters
-means 65.60 66.14 65.60 4.40
Joint-NMF 80.95 80.61 80.94 0.65
Semi-NMF 76.42 73.70 76.42 41.99
BNMF 82.84 83.51 82.87 77.37
VI-BJMD 90.25 90.78 90.33 2274.80
MAP-BJMD 90.57 88.82 89.46 6.55
TABLE VI: Performance Comparison of Different Methods on Extended Yale Face B Dataset
Algorithm AUC (%) Time (s)
Center Side
-means 61.97 55.77 6.05
Joint-NMF 93.69 83.37 3.91
Semi-NMF 96.81 91.98 53.79
BNMF 94.63 85.50 189.55
VI-BJMD 97.11 90.56 1444.34
MAP-BJMD 97.47 92.58 112.20
TABLE V: Performance Comparison of Different Methods on 3Sources Dataset

METABRIC Breast Cancer Dataset: The third dataset is the so-called METABRIC data [44, 45]

, which contains a large breast cancer patient cohort of around 2000 samples with detailed clinical measurements and genome-wide molecular profiles. Using gene expression to classify invasive breast cancers into biologically and clinically distinct subtypes has been studied for over a decade. In 2009, Parker

et al. developed an efficient classifier called PAM50, to distinguish the intrinsic subtypes of breast cancer [46]. PAM50 is now commonly employed in breast cancer studies [47, 48, 49]. Thus, we used the PAM50 subtypes as reference to evaluate the results of clustering. The cellularity of breast tumor is a concerned clinical indicator in treatment [50, 51, 52]. Therefore, we used the cellularity to divide samples into groups. We first mapped probes of gene expression to gene names by average pooling. Then we removed samples that PAM50 subtypes or cellularity is not available. There are 19466 genes and 1911 samples left after preprocessing. The samples are divided into 3 groups: low cellularity, moderate cellularity and high cellularity with size of , and , respectively. The number of features of this dataset is much larger than the number of samples.

We applied our algorithms to these three datasets. In all experiments, we set the Dirichlet prior and the Laplace prior . The tolerance of MAP-BJMD and VI-BJMD is and , respectively. We used -means, joint-NMF, BNMF and semi-NMF for comparison. Among these four algorithms, BNMF is a Bayesian approach. For each experiment, the AUC is reported by the average of the best 5 of 20 runs.

TABLE IV shows the estimated noise level on each dataset. TABLE VI, VI and VIII show the clustering performance on 3Sources, Extended Yale Face Database B and METABRIC dataset, respectively. The AUCs and running time are reported. VI-BJMD takes much time on the METABRIC dataset and thus it is not reported (TABLE VIII).

These experiments reveal some important and interesting points:

  • Both MAP-BJMD and VI-BJMD can achieve similar or better performance than the competing methods over all datasets.

  • The algorithm of Semi-NMF involves calculating the inverse of matrix. As we mentioned earlier, 3Sources data is very sparse. Therefore, Semi-NMF is not robust on this dataset. MAP-BJMD involves calculating the inverse of matrix in updating basis matrix too, but the Laplace priors can protect the inverse of matrix from singular. The Bayesian methods including BNMF, VI-BJMD and MAP-BJMD outperform other methods on 3Sources dataset. It suggests the superiority of Bayesian priors for very sparse data.

  • All methods except MAP-BJMD have relative poor performance on METABRIC data. This suggests that MAP-BJMD is more robust than other methods. We note that the number of features is much larger than the number of samples in the METABRIC dataset. We believe that feature selection will be needed to improve this situation.

  • Although VI-BJMD outperforms MAP-BJMD in synthetic experiments. However, VI-BJMD is not guaranteed to outperform MAP-BJMD in real-wold data. Real-world data is much more complicated and VI-BJMD often exceeds the maximum steps.

  • Bayesian methods including VI-BJMD and BNMF take more time than other methods, while the speed of MAP-BJMD is comparable to other ones. Thus, MAP-BJMD is more scalable and efficient.

  • The estimated noise levels of VI-BJMD and MAP-BJMD are very consistent. Moreover, the estimated noise level can be used to guide feature selection as discussed in the next subsection.

4.4 Adaptive Feature Selection

In this subsection, we discuss how to use the estimated noise level to select features adaptively. In biological applications, the features are often redundant. It is common that we have thousands of features but only tens or hundreds of samples. Thus, feature selection is needed. Perhaps the most straightforward approach to select features is to choose those variance greater than a threshold. However, it is difficult to specify the threshold value for unsupervised tasks. Luckily, we can use the estimated noise level as the threshold to conduct an adaptive feature selection.

Algorithm AUC (%) Time (s)
Low Moderate High
-means 62.99 66.82 72.73 24.74
Joint-NMF 65.81 71.85 75.07 45.58
Semi-NMF 58.90 55.02 53.48 311.24
BNMF 64.56 66.27 68.59 423.15
MAP-BJMD 82.22 79.05 81.45 27.17
TABLE VIII: Performance Comparison of Different Methods on Filtered METABRIC Dataset
Algorithm AUC (%) Time (s)
Low Moderate High
-means 67.01 70.16 78.85 5.53
Joint-NMF 82.70 79.93 82.51 18.76
Semi-NMF 63.18 62.89 59.89 76.28
BNMF 82.71 80.32 85.43 101.96
VI-BJMD 84.93 81.81 85.16 3623.24
MAP-BJMD 84.34 79.55 84.03 20.70
TABLE VII: Performance Comparison of Different Methods on METABRIC Dataset

Specifically, as MAP-BJMD is much faster than VI-BJMD, we can use MAP-BJMD to estimate the variance of Gaussian noise for each source. If the variance of a feature is less than the variance of noise. It is natural to assum