Group Component Analysis for Multiblock Data: Common and Individual Feature Extraction

12/17/2012
by   Guoxu Zhou, et al.
0

Very often data we encounter in practice is a collection of matrices rather than a single matrix. These multi-block data are naturally linked and hence often share some common features and at the same time they have their own individual features, due to the background in which they are measured and collected. In this study we proposed a new scheme of common and individual feature analysis (CIFA) that processes multi-block data in a linked way aiming at discovering and separating their common and individual features. According to whether the number of common features is given or not, two efficient algorithms were proposed to extract the common basis which is shared by all data. Then feature extraction is performed on the common and the individual spaces separately by incorporating the techniques such as dimensionality reduction and blind source separation. We also discussed how the proposed CIFA can significantly improve the performance of classification and clustering tasks by exploiting common and individual features of samples respectively. Our experimental results show some encouraging features of the proposed methods in comparison to the state-of-the-art methods on synthetic and real data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 10

page 12

05/02/2013

Tensor Decompositions: A New Concept in Brain Data Analysis?

Matrix factorizations and their extensions to tensor factorizations and ...
11/01/2017

Tensor Valued Common and Individual Feature Extraction: Multi-dimensional Perspective

A novel method for common and individual feature analysis from exceeding...
08/17/2021

M-ar-K-Fast Independent Component Analysis

This study presents the m-arcsinh Kernel ('m-ar-K') Fast Independent Com...
05/05/2020

Modal features for image texture classification

Feature extraction is a key step in image processing for pattern recogni...
02/10/2016

Comparison of feature extraction and dimensionality reduction methods for single channel extracellular spike sorting

Spikes in the membrane electrical potentials of neurons play a major rol...
08/07/2018

Modelling hidden structure of signals in group data analysis with modified (Lr, 1) and block-term decompositions

This work is devoted to elaboration on the idea to use block term decomp...
12/04/2017

A text-independent speaker verification model: A comparative analysis

The most pressing challenge in the field of voice biometrics is selectin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

Massive high-dimensional data are increasingly prevalent in many areas of science. A variety of data analysis tools have been proposed for different purposes, such as data representation, interpretation, information retrieve, etc. Recently, multi-block data analysis has attracted increasing attention

[1, 2, 3, 4]. Multi-block data is encountered when multiple measurements are taken from a set of experiments on a same subject using various techniques or on multiple subjects under similar configurations. For example, in biomedical studies, human electrophysiological signals responding to some pre-designed stimuli will be collected from different individuals and trials. A number of different existing technologies and devices may be used to collect diverse information from different aspects. All these result in naturally linked multi-block data. These data should share some common information due to the background in which they are collected, and at the same time they also possess their individual features. It is consequently very meaningful to analyze the data in a connected and linked way instead of a separate one. This study is devoted to such an interesting and promising topic.

Actually there have been some methods developed for multi-block data analysis. For example, canonical correlation analysis (CCA) was proposed to maximize the correlations between the random variables in two data sets

[5, 6, 7]. Later CCA was generalized to analyze multiple data sets and applied to joint blind source separation and feature extraction [8, 1, 9, 10]. In contrast to CCA, Partial Least Squares (PLS) maximizes the covariance rather than correlations [11, 12, 13]. To analyze image populations, a framework named Population Value Decomposition (PVD) was proposed for the data sets which have exactly same size [3]

. It turns out that PVD can actually be studied in the more general framework of tensor (Tucker) decompositions, which is another hot topic for high-dimensional data analysis and exploration in recent years, see

[14, 15] and references therein. A method named Joint and Individual Variation Explained (JIVE) was proposed for integrated analysis of multiple data types [2], together with a new algorithm which extracts their joint and individual components simultaneously. To our best knowledge, however, their potential as a common and individual feature analysis tool has not been fully exploited.

In this study a general framework of Common and Individual Feature Analysis (CIFA) was proposed for multi-block data analysis. Compared with the existing works, our main contributions include:

  1. New efficient algorithms were proposed to extract common orthogonal basis from multi-block data according to whether the number of common components is given or not.

  2. A detailed analysis on the relationship between the proposed methods and other related methods such as CCA and principal component analysis (PCA) was discussed. Our results show that common feature extraction can be interpreted as high-correlation analysis and it performs PCA on the common space shared by all data rather than on the whole data which are used in ordinary PCA.

  3. In the proposed framework various well-established data analysis methods proposed for a single matrix, e.g., dimensionality reduction [16, 17], Blind Source Separation (BSS) [18], Nonnegative Matrix Factorization (NMF) [15], can be easily applied to common and individual spaces separately in order to extract components with desired features and properties, which provides a quite flexible and versatile facility for multi-block data analysis tasks.

  4. Two important applications of CIFA, i.e., classification and clustering, were discussed, which illustrated how the extracted common and individual features are able to improve the performance of data analysis.

The rest of the paper is organized as follows. In Section 2 the common orthogonal basis extraction (COBE) is discussed, including the problem statement, model, algorithms, and its relationship with CCA, PCA, and other related methods. In Section 3 the general framework of common and individual feature analysis (CIFA) is presented. The applications of CIFA in classification and clustering are discussed in Section 4. In Section 5 simulations on synthetic data and real data justify the efficiency and validity of the proposed methods. Finally we provided some concluding remarks and suggestions for future work in Section 6.

2 Common Orthogonal Basis Extraction

2.1 Problem Formulation

Given a set of matrices , , consider the following matrix factorization problem of each matrix :

(1)

where the columns of consist of the latent variables in (sources, basis, etc ), denotes the corresponding coefficient matrix (mixing, encoding, etc ). is the number of latent components with , which generally corresponds to a compact/compressed representation of . The necessity and justification of this assumption will be discussed in Section 2.4.

So far a very wide variety of matrix factorization techniques has been proposed for (1

), such as PCA, independent component analysis (ICA)

[19, 20], BSS [18], etc. In these methods the matrices are treated independently and separately. Here we consider the case where the data are naturally linked and share some common components such that

(2)

where , , and . In (2), contains the common components shared by all matrices in while contains the individual information only presented in . In this way, the matrices in are factorized in a linked way such that

(3)

where and are the compatible partition of . In other words, each matrix is represented by two parts: the common space and the individual space , which are spanned by the common components (i.e., columns of ) existing in all () and its individual components only presented in , respectively. Our problem is to seek and from a given set of matrices , , without the knowledge of and possibly the number . Note that two special cases of (3) have been extensively studied in the past decades:

  • . No common components exist in and the problem is simply equivalent to factorizing each matrix separately.

  • for all . The problem is equivalent to ordinary matrix factorization of a large matrix created by stacking all matrices . This will be further detailed in the end of this sub-section.

Note that the solution is not unique since is also a solution to (3

) for an arbitrary invertible matrix

with proper size. To shrink the solution space and simplify the computation we let

be the QR-decomposition of

such that (the matrix

denotes the identity matrix with proper size. In the case where the size should be specified explicitly we use

to denote the -by- identity matrix). Substitute them into (3) we have

(4)

Comparing (3) and (4), we can assume that in (3) hereafter, without loss of any generality.

Taking our purpose into consideration, we further assume that , where

is the zero matrix with proper size. This assumption means that there is no any interaction between common features and individual features. This assumption will not cause any additional factorization error. To see this,

, if , from (3) and the fact that

(5)

we have

(6)

Compare (6) and (3) and define and , we have immediately. As a result, it is reasonable to assume that .

Furthermore, we consider the truncated singular value decomposition (SVD) of

, where , , and is invertible. Then . Define and . We have and .

Based on the above analysis, the general problem we consider can be formally formulated as:

(7)

In (7) we have implicitly assumed that the number of common components

is known. How to estimate

in practice and how to solve (7) will be discussed in section 2.2 and 2.3. Compared with (1), the procedure of (2)-(6) does not cause any additional decomposition error. Hence the restriction of implicitly guarantees that we are seeking common components. Indeed, once contains information other than common components, the total decomposition error will increase under this rank restriction.

It is worth noticing that once for all , the problem is reduced to be ordinary PCA, or equivalently, low-rank approximation of matrices. In this case can be found by solving

(8)

Let be the matrix by stacking all matrices horizontally, and similarly let . Then (8) can be viewed as a partitioned version of PCA

(9)

If is too large to fit into physical memory, we may solve (8) instead of (9) in practice.

When , model (7) is distinguished from (8) due to the involved individual parts and the rank restriction discussed above. From this sense (7) can also be interpreted as the principal components of their common space

, i.e., the residuals after removing their individual components. Unfortunately, as the individual parts are also unknown and can have very large variance (energy), we cannot solve

by running standard PCA on .

We use two steps to solve (7): in step 1 matrices in (7) are updated by their optimal rank- approximation by solving (1) separately for each . To distinguish, we call the original raw data while call the reduced version cleaned data. In step 2, (7) is solved by using the cleaned data. Due to (2)-(6) which means that no additional error arises from the separation of individual and common spaces, theoretically we have

(10)

In section 2.2 and 2.3 we will focus on the second step.

2.2 The COBE Algorithm: the Number of Common Components is Unknown

From (7) (or (10)), once has been estimated, can be computed from 111If is exact as in (10), Equation (11) is also exact. Otherwise (11) is interpreted as the least square solution of . Similarly for equations (12) and (13).

(11)

After that can be computed via truncated singular value decomposition (tSVD) of the residual matrix , . In other words, estimating plays the central role to solve (7). In this section we focus on the problem of how to estimate efficiently.

For any and , the optimal and in (7) satisfy that

(12)

where (as in Eq.(2)), , and denotes the Moore-Penrose pseudo inverse of a matrix. Let such that (For each matrix this only needs to be computed once by using, e.g., QR decomposition or truncated SVD of ). Then we define , and (12) is equivalent to

(13)

and hence for any , , there holds that

(14)

where and are the th column of and , respectively.

According to (14), the first column of , i.e., , can be obtained by solving the following optimization model:

(15)

We use alternating least-square (ALS) iterations to solve (15). Fix first and the optimal is given by

(16)

and then is normalized to have unit norm. Then is fixed and we get

(17)

By running (16) and (17) alteratively till convergence. If for a very small threshold , a common column is found. Otherwise, no common basis exists in , and we terminate the procedure.

Now suppose that have been found and we seek the next common basis . To avoid finding repeated common basis, we consider a useful property of . Let denote the matrix consisting of the first columns of . From (13) we have

(18)

which means that , i.e., is in the null space of . Hence we update as

(19)

where . Then can be found by solving

(20)

Repeating the procedure done for (15)222For the matrix is not orthogonal any more. However, it can be verified that is the More-Penrose pseudoinverse of , thereby leading to the least square solution ., the minimum of can be obtained. Again, there are two cases:

  1. . In this case a new common basis vector

    is found. Update using (19) and then solve (20) to seek the next common basis.

  2. Otherwise, no common basis vector exists any more and a total of common orthogonal basis vectors are found as .

By this way an orthogonal basis of common space can be found sequentially. This procedure is called common orthogonal basis extraction (COBE) and is presented as Algorithm 1.

0:  , , .
1:  Let = such that for all .
2:  , , and .
3:  while  do
4:      if , .
5:     while not converged do
6:        ;
7:        , ;
8:     end while
9:     ;
10:     ;
11:     ;
12:  end while
13:  return  , where .
Algorithm 1 The COBE Algorithm

The parameter controls how identical the extracted components are. If , the extracted components are exactly the same. Otherwise, approximately identical (or equivalently, highly correlated) components will be extracted (see section 2.5 for detailed discussion). We can adopt the SORTE method proposed in [21] to select the parameter

automatically. Basically, the SORTE detects the gap between the eigenvalues corresponding to signal space and those to noise space. Here we can detect the gap between common space and individual space similarly. We will illustrate this in simulations.

2.3 The COBE Algorithm With Specified Number of Common Components

We briefly discuss the case where is given. Following the analysis in section 2.1, we solve the following model:

(21)

Again, we optimize with respect to and alternatively. When is fixed, the optimal is computed from

(22)

And when , , are fixed, (21) is equivalent to

(23)

where denotes the trace of a matrix and

(24)

Let be the truncated SVD of , where is a diagonal matrix with . Motivated by the work in [22] (page 601)333The main difference is that here is unnecessarily square., we show that the optimal solution of (23) is

(25)

In fact,

As and , we have , which means that . Clearly, when , there holds that , and reaches its upper bound .

The pseudo-code is presented in Algorithm 2.

0:   and , .
1:  Let = such that for all .
2:  Initialize randomly.
3:  while not converged do
4:     .
5:     =, where .
6:     .
7:  end while
8:  return  .
Algorithm 2 The COBE Algorithm

2.4 Pre-processing: Dimensionality Reduction

Like CCA, COBE loses its sense if for all , because in this case for any invertible matrix there always exists matrices such that , i.e., any invertible matrix forms a common basis. Hence in model (7) is required for all . Fortunately, this requirement is actually not so restrictive as we think. This is because that, in practice although observation data can be of very high dimensionality, the latent rank is often significantly lower than the dimensionality of observation data [23]. And even if this condition does not hold, we perform dimensionality reduction, such as PCA, on the raw data by solving (1) before running COBE, which has been stated at the end of section 2.1. By the dimensionality reduction step only principal components are targeted and subsequent computational complexity can be significantly reduced. Another strong reason of running dimensionality reduction is to reduce noise. Indeed, the significance of dimensionality reduction has been extensively justified in the literature. From this sense, if in (1) is interpreted as the PCA of each matrix , in COBE we simply rotate/transform the columns of such that the common basis and the individual basis are completely separated.

One of the most widely used dimensionality reduction method may be principal component analysis (PCA) which is based on the assumption that the noise is drawn from independent identical Gaussian distributions. Otherwise if the noise is very sparse, we may consider robust PCA (RPCA)

[24]. Moreover, we may use the SORTE [21] or related techniques to estimate the number of latent components and then use PCA to perform dimensionality reduction.

2.5 Relation With Other Methods

The COBE has a very close relation with canonical correlation analysis (CCA). For two given sets of data and , CCA seeks vectors and such that the correlation is maximized. In COBE, however, only the components with the correlation higher than a specified threshold will be extracted. Let are row-centered (i.e., with zero mean) random variables. We have

Proposition 1

Suppose that with , . Then

(26)
Proof:

From and , we have

(27)

Moreover, , there holds that

(28)

Hence,

(29)

From (27) and (29), we have

(30)

This ends the proof.

From the proposition, once in (15) and (20) are upper bounded, the correlations between the projected variables are consequently lower bounded. Particularly, as . This shows that COBE actually can be interpreted as high correlation analysis (HCA) that differs from canonical correlation analysis (CCA) for multiple data sets.

The following Fig.1 illustrates the relationship between COBE and CCA. Given two matrices , let , i.e. the first column of , be the sine wave and , where

. The entries of the other components were drawn from independent standard normal distributions. Each matrix was mixed via a different matrix

whose entries were drawn from independent standard normal distributions such that (). The red line in Fig.1(a) shows the common components extracted by COBE, and the corresponding components projected onto , i.e., , , match very well to the projected components obtained by CCA. In this sense, COBE realizes CCA from another aspect. However, in COBE only the components with very high correlations will be extracted, as stated in Proposition 1. From the figure, can be interpreted as the principal component of , the information that is not provided in CCA. Note also that in the proposed method, the common components (highly correlated features) satisfy that , which makes COBE like a regularized CCA [25, 9, 26]. Finally, due to its close relationship with CCA, COBE inherits most connections and differences from CCA with other related methods such as PLS [27], alternating conditional expectation (ACE) [28], etc.

Fig. 1: Illustration of the relation between COBE and CCA. Generally, COBE focuses on highly correlated components and returns the principal components of them.

From (7) and (8), we know that COBE also has close relation with PCA. Fig.2 illustrates the difference between COBE and PCA using the above data (The principle component is computed from the concatenated version defined in (9)). Basically, COBE seeks the principal components of the common space (spanned by common or very similar basis) of all data whereas PCA seeks the principal components of all data, which makes COBE quite useful to find highly relevant and related information from a large number of sets of signals. Moreover, as (7) can be interpreted as the PCA of the common space of all data sets, or the PCA of the individual space of each single data set, we may optimize (7) by using a series of alternating truncated SVD (PCA), which is the way adopted by the JIVE method [2]. This way involves frequent SVDs of huge matrices formed by all data in the computation of the common space and hence it is quite time consuming. Compared with JIVE, the COBEc method is more efficient in optimization and more intuitive and flexible in the estimation of number of common components.

Fig. 2: Illustration of the relation between COBE and PCA. The COBE method finds the principal components of the highly correlated columns whereas PCA finds the principal components of all columns.

2.6 Scalability For Large-Scale Problems

In multi-block data analysis it is common that the data we encounter is huge. Here huge means that both and () are quite large. Note that in (15), (20), and (21) we actually use the dimensionality reduced matrices with . Hence the value of is generally not a big issue. In the case where is extremely large, we consider the following way to significantly reduce the time and memory consumptions of COBE. Let

be a random matrix with

. From (12) we may solve the following model first:

(31)

where is much smaller than , and . After have been estimated by using COBE or COBE, the corresponding common features can be computed from . Obviously, as long as . In other words, this way will not lose any common features. In the worst case, however, (31) may give fake common features when occasionally lies in the null space of . Fortunately this rarely happens in practice and these fake common features can be easily detected by examining the value of .

3 Common And Individual Feature Analysis

3.1 Linked BSS with Pre-whitening

So far we only impose orthogonality on the components . In this case the common components are not unique as the columns of

also form a common orthogonal basis for any orthogonal matrix

. Sometimes we want to project the common components onto a feature space with some desired property or uniqueness. This can be done typically by, for example, blind source separation (BSS) [18]. BSS is a problem of finding latent variables from their linear mixtures such that

(32)

where denotes a BSS algorithm, is the mixing matrix. and are a permutation matrix and a diagonal matrix, respectively, denoting unavoidable ambiguities of BSS. In other words, by using BSS methods the sources can be exactly recovered from their mixtures, only remaining a scale and permutation ambiguity without any knowledge of the mixing matrix

. Hence BSS is quite attractive and has been severed as feature extraction tools in a wide range of applications, such as pattern recognition, classification, etc. If we assume that the latent features (sources)

satisfy that

(33)

where is defined in (3). From we have

(34)

Consequently, the columns of are just the linear mixtures of and hence can be estimated via

(35)

by using a proper BSS algorithm . In this case is actually the pre-whitened version of (33), from (34) and the fact that . By using BSS, we may obtain the common features with desired properties such as sparsity, independence, temporal correlations, nonnegativity, etc, by imposing proper penalties on , or even nonlinear common features by using kernel tricks [20]. We call the above BSS procedure linked BSS because we perform BSS on multi-block linked data . Note that the JBSS method in [1] also performs BSS involving multi-block data. It extracts a group of signals with the highest corrections each time and it requires that the extracted groups have distinct corrections. In other words, the JBSS method is actually a way to realize BSS by applying multiple-set CCA. In contrast, the linked BSS method extracts common basis first and then applies ordinary BSS to it to discover common components with some desired property and diversity.


Fig. 3: Flow diagram of the general common and individual feature analysis (CIFA).

3.2 Common Nonnegative Features Extraction

In the case where is required to be nonnegative, we cannot run NMF methods on directly. In this case we use two steps to extract nonnegative common components. First, from (11) the common space, i.e., , can be extracted. Then we consider the following low-rank approximation based (semi-) nonnegative matrix factorization (NMF) model [29]:

(36)

By using low-rank NMF (if is also nonnegative) or low-rank semiNMF (where is arbitrary) we can extract the common nonnegative components . For example, by using the following multiplicative update rules iteratively both and are nonnegative:

(37)

where and are element-wise product and division of matrices. See [29] for detailed convergence analysis.

3.3 Individual Feature Extraction (IFE)

In the above section we discussed common feature extraction (CFE). Besides the common features or , each data also has its own individual features contained in the matrix . These individual features are often quite helpful in classification and recognition tasks. Although has the same size as , it is rank deficient and satisfies that . Hence dimensionality reduction on should be a top priority before further analysis. We can run any dimensionality reduction method discussed in section 2.4 on each separately to estimate and , and then use BSS or related methods to extract the features in and . However, there is a major difference between the dimensionality reduction methods considered here and those in the pre-processing stage. In the pre-processing stage dimensionality reduction is rather general-purposed and relatively simpler, whereas in this stage, the dimensionality reduction is more closely related to the specific purpose of tasks at hand. For example, if we want to visualize the data in low-dimensional space, we may consider the methods discussed in [17]. For classification and recognition tasks, we may need to extract discriminative information, neighbor relationship, etc, as much as possible [30]. See also [6] for a unified least-squares framework for various component analysis. In summary, careful selection of dimensionality reduction methods in this stage is quite critical to successfully achieve ultimate purpose. The above procedure is called as individual feature extraction (IFE) as the extracted features are only presented in each individual data.

Finally, we give the flow diagram of the proposed common and individual feature analysis (CIFA) in Fig.3.

4 Two Applications

4.1 Classification Using Common Features

In classification and pattern recognition tasks, we have a set of training data consisting of training samples and their labels. It is natural that the objects belonging to a same category must share some common features. Let denote the common features extracted from the th category, . Then for a new test sample , we compute its matching score with each :

(38)

As the samples in a same class should share some same features, the label of is estimated as

(39)

There are many choices to define , such as the Euclidean distance or correlation (angle) between and the space spanned by , which can be solved via least-square and CCA, respectively.

Note that for the linear discriminative analysis classifier (LDA), the number of features should be significantly less than the number of samples to ensure the positive definiteness of the covariance matrix. The proposed method has no such a limitation.

4.2 Clustering Using Individual Features

Clustering is a task of assigning a set of objects into clusters such that the objects belonging to a same cluster are of the most similarity. Cluster analysis is widely applied to data mining, machine learning, information retrieval, and bioinformatics. Different from classification, clustering is a typical unsupervised learning approach, that is, there are no training data available. In cluster analysis, we need to compare the similarity between samples. For many practical applications, all the samples may have some common features, although they are in different clusters and certainly have some dissimilarity. For example, in human face image analysis, every face has common facial organs such as cheek, nose, eyes, and mouth, etc, and they often share some same features to some extend reflecting their shapes and locations, etc. The common features presented in all samples are useless for clustering as they do not provide any discriminative information between them. It is therefore reasonable to remove these common/similar features at first and then used their individual features to cluster the objects. Intuitively, this should significantly improve the clustering accuracy when all objects have common features.


Fig. 4: Flow diagram of classification by using common features extracted from each class.
(a)
(b)
Fig. 5: Illustration of how COBE incorporating CNFE is able to extract the common features on the PIE database. (a) Common faces. (b) The first 64 samples of individual faces obtained by removing the common components. Their local individual features are accentuated.

In Fig.5 we show how COBE incorporating CNFE is able to extract common faces (features) on the PIE database (Details of the PIE database are given in the next section). Here we manually set and used CNFE to extract the common nonnegative components. From the common faces shown in Fig.5(a), we can observe some basic profile of human faces. In Fig.5(b), their individual local features are accentuated. These individual features should be quite helpful to improve the accuracy of clustering and recognition tasks. Generally, in our individual features based clustering method we follow the steps below:

  1. Randomly split the samples into groups to construct , where and .

  2. Run COBE to extract the common features of .

  3. Remove their common features from by letting .

  4. Perform dimensionality reduction and feature extraction on to obtain features .

  5. Run clustering algorithms on , where are the columns of corresponding to the original objects .

See Fig.3 for more details. Note that the dimensionality reduction and feature extraction methods considered here should be able to substantially benefit the clustering purpose.

5 Simulations and Experiments

Linked BSS. In this simulation we generated a total of ten matrices , , whose first four columns were the speech signals included in the benchmark of ICALAB (named Speech4.mat) [31], and the other six components were drawn from independent standard normal distributions. The entries of the mixing matrices were also drawn from independent standard normal distributions. Finally let , where models white Gaussian noise (SNR=20dB). We first used the COBE, JIVE [2], JBSS [1], and PCA methods to extract the common components. Then we ran the SOBI method [32] to extract the latent speech signals (As the JBSS performed not so good in this simulation we also used SOBI to improve its results). TABLE I show the simulation results averaged over 50 Monte-Carlo runs, where SIR, i.e., the signal-to-interface ratio (SIR) of the th estimated signal, is defined as follows to evaluate the separation accuracy:

(40)

where , are normalized random variables with zero mean and unit variance, and is an estimate of . It can be seen that JIVE and COBE achieved higher SIRs than JBSS and PCA, although the performance of JBSS has been improved after incorporating the SOBI method compared with its original version. Moreover, although PCA has a close relation with COBE, it can be seen again from the table that the common features extracted by PCA are often contaminated by individual features. COBE and JIVE almost achieved the same separation accuracy, but COBE was much faster. Particularly, the performance of JIVE is quite sensitive to the estimate of the rank of joint/common and individual components. If the rank is given accurately, JIVE performs well. Otherwise the efficiency will be significantly reduced. For example, for this instance if the ranks of individual components were specified as 7 (denoted as JIVE in TABLE I), which were actually 6, JIVE consumed more than 77 seconds to converge. In [2] a method to estimate the number of components was proposed, however, it is quite time consuming and the performance depends on skillful selection of its parameters (for this instance, JIVE costed more than two hours to estimate the rank). For the COBE method, first of all, the total time consumption depends on the number of common components and the size of the problem. This makes COBE much more efficient than JIVE. Moreover, the estimation of the number of common components is simpler and more intuitive. Generally, we can estimate the number of components by tracking the value of , as illustrated in Fig.6. As there is often a big GAP between the values of corresponding to the common components and the others, we can use SORTE [21] to detect the number of components. Note also that the threshold bounds the correlations between the common components (see proposition 1), or how identical they are. This provides us another intuitive way to select the parameter.

Algorithm Runtime (s)
COBE 21.1 23.5 23.9 24.6 0.5
COBE 21.1 23.3 24.2 24.7 0.6
JIVE 21.2 23.8 24.2 25.0 7.4
JIVE 21.2 23.8 24.1 24.9 77.5
JBSS 15.1 15.4 15.9 16.3 1.7
PCA 15.8 17.1 17.8 19.4 0.5
TABLE I: Performance comparison in linked BSS. The latent signals were estimated by applying the SOBI method to the common components extracted by each algorithm.
Fig. 6: Illustration of how to detect the the number of common components by locating the GAP between the values of , under different noise levels.

In Fig.7 we showed the performance in terms of running time and separation accuracy of COBE when we projected the observations into lower -dimensional space by multiplying an random matrix . The results were averaged over 50 independent runs. In each run the entries of the project matrix were drawn from independent standard normal distributions. From the figure, when the value of increases, the running time creased approximately linearly whereas the improvement on accuracy tends to mild, which justified the analysis in Section 2.6. Based on this fact, we may use projection to significantly improve the efficiency of COBE when is extremely large.

Fig. 7: Illustration of the averaged performance of COBE after projecting the -dimensional observations onto a lower -dimensional space over 50 Monte-Carlo runs.

Dual-energy X-ray image decomposition. Accurate detection of lung nodules using dual-energy chest X-ray imaging is an important diagnostic task to find the early sign of lung cancers [33]

. Unfortunately, the presence of ribs, clavicles overlapped with soft tissues and environmental noise makes it quite challenging to detect subtle nodules. Accurate separation of bone from soft tissues is quite helpful to make correct diagnosis. In this experiment, we assumed that we had a series of X-ray images which were mixtures of soft and bone tissues and noise. The mixed soft and bone tissues formed their nonnegative common components. Our aim was to extract separated soft and bone tissues. We generated four sets of sources whose first two common components were respectively the soft and bone tissues and the other eight components were drawn from independent uniform distributions between 0 and 1 to model interference. They were mixed via different mixing matrices whose elements were drawn from independent uniform distributions between 0 and 1. It is known that the sources in this example are highly correlated and consequently they cannot be separated by using ICA methods. Due to the presence of random dense noise, they are also uneasy to be separated by using ordinary NMF algorithms on each single set of mixtures. As the soft and bone tissues existed in all images, we ran COBE to extract the basis of common sources and then used CNFE to extract the soft and bone tissues. One typical realization is shown in Fig.

8(b). Fig.8(d) displays four samples of nonnegative components extracted by using the nLCA-IVM method [33]. Due to the presence of dense noise (thus the identifiability conditions of nLCA-IVM are not satisfied here), nLCA-IVM cannot extract the desired source images in this example. This experiment shows how the proposed method can be used to extract common nonnegative features, or equivalently, used as nonnegative high correlation analysis.

(a) The sources
Samples of the observations/mixtures
(c) Samples of the observations/mixtures
(d) Partial images extracted by nLCA-IVM
(b) The images extracted by CNFE
(b) The images extracted by CNFE
Fig. 8: Illustration of common nonnegative feature extraction.
k Accuracy (%) Normalized Mutual Information (%)
PCA tSNE GNMF MMCut CIFA PCA tSNE GNMF MMCut CIFA
11 40.50.2 47.14.0 28.80.7 45.40.5 51.73.9 45.00.8 48.93.1 26.60.4 48.10.7 54.73.8
12 49.81.2 48.43.8 29.50.2 47.80.3 49.92.8 53.91.0 50.92.9 27.10.1 47.90.6 53.12.3
13 49.50.9 50.03.4 29.40.1 43.30.1 49.82.7 56.21.1 52.22.4 29.90.2 47.10.1 54.02.4
14 46.20.6 48.42.9 28.60.4 42.90.5 47.32.6 52.40.5 51.82.1 32.40.0 48.30.6 53.82.2
15 40.10.8 45.32.3 27.20.4 42.40.0 45.62.6 48.30.9 51.11.8 31.70.4 50.80.1 52.72.2
Avg. 45.2 47.8 28.7 44.4 48.9 51.2 51.0 29.5 48.4 53.7
TABLE II: Clustering Performance on Yale
k Accuracy (%) Normalized Mutual Information (%)
PCA tSNE GNMF MMCut CIFA PCA tSNE GNMF MMCut CIFA
20 69.00.1 69.73.7 53.50.0 64.81.8 70.33.4 78.30.1 80.51.9 65.20.5 75.10.9 81.12.1
25 64.80.3 67.63.1 55.10.7 65.30.7 76.83.5 79.60.3 82.01.4 70.30.5 77.80.1 86.22.0
30 67.90.8 67.12.7 51.00.1 71.2