1 Introduction and Motivation
Massive high-dimensional data are increasingly prevalent in many areas of science. A variety of data analysis tools have been proposed for different purposes, such as data representation, interpretation, information retrieve, etc. Recently, multi-block data analysis has attracted increasing attention[1, 2, 3, 4]. Multi-block data is encountered when multiple measurements are taken from a set of experiments on a same subject using various techniques or on multiple subjects under similar configurations. For example, in biomedical studies, human electrophysiological signals responding to some pre-designed stimuli will be collected from different individuals and trials. A number of different existing technologies and devices may be used to collect diverse information from different aspects. All these result in naturally linked multi-block data. These data should share some common information due to the background in which they are collected, and at the same time they also possess their individual features. It is consequently very meaningful to analyze the data in a connected and linked way instead of a separate one. This study is devoted to such an interesting and promising topic.
Actually there have been some methods developed for multi-block data analysis. For example, canonical correlation analysis (CCA) was proposed to maximize the correlations between the random variables in two data sets[5, 6, 7]. Later CCA was generalized to analyze multiple data sets and applied to joint blind source separation and feature extraction [8, 1, 9, 10]. In contrast to CCA, Partial Least Squares (PLS) maximizes the covariance rather than correlations [11, 12, 13]. To analyze image populations, a framework named Population Value Decomposition (PVD) was proposed for the data sets which have exactly same size 
. It turns out that PVD can actually be studied in the more general framework of tensor (Tucker) decompositions, which is another hot topic for high-dimensional data analysis and exploration in recent years, see[14, 15] and references therein. A method named Joint and Individual Variation Explained (JIVE) was proposed for integrated analysis of multiple data types , together with a new algorithm which extracts their joint and individual components simultaneously. To our best knowledge, however, their potential as a common and individual feature analysis tool has not been fully exploited.
In this study a general framework of Common and Individual Feature Analysis (CIFA) was proposed for multi-block data analysis. Compared with the existing works, our main contributions include:
New efficient algorithms were proposed to extract common orthogonal basis from multi-block data according to whether the number of common components is given or not.
A detailed analysis on the relationship between the proposed methods and other related methods such as CCA and principal component analysis (PCA) was discussed. Our results show that common feature extraction can be interpreted as high-correlation analysis and it performs PCA on the common space shared by all data rather than on the whole data which are used in ordinary PCA.
In the proposed framework various well-established data analysis methods proposed for a single matrix, e.g., dimensionality reduction [16, 17], Blind Source Separation (BSS) , Nonnegative Matrix Factorization (NMF) , can be easily applied to common and individual spaces separately in order to extract components with desired features and properties, which provides a quite flexible and versatile facility for multi-block data analysis tasks.
Two important applications of CIFA, i.e., classification and clustering, were discussed, which illustrated how the extracted common and individual features are able to improve the performance of data analysis.
The rest of the paper is organized as follows. In Section 2 the common orthogonal basis extraction (COBE) is discussed, including the problem statement, model, algorithms, and its relationship with CCA, PCA, and other related methods. In Section 3 the general framework of common and individual feature analysis (CIFA) is presented. The applications of CIFA in classification and clustering are discussed in Section 4. In Section 5 simulations on synthetic data and real data justify the efficiency and validity of the proposed methods. Finally we provided some concluding remarks and suggestions for future work in Section 6.
2 Common Orthogonal Basis Extraction
2.1 Problem Formulation
Given a set of matrices , , consider the following matrix factorization problem of each matrix :
where the columns of consist of the latent variables in (sources, basis, etc ), denotes the corresponding coefficient matrix (mixing, encoding, etc ). is the number of latent components with , which generally corresponds to a compact/compressed representation of . The necessity and justification of this assumption will be discussed in Section 2.4.
So far a very wide variety of matrix factorization techniques has been proposed for (1
), such as PCA, independent component analysis (ICA)[19, 20], BSS , etc. In these methods the matrices are treated independently and separately. Here we consider the case where the data are naturally linked and share some common components such that
where , , and . In (2), contains the common components shared by all matrices in while contains the individual information only presented in . In this way, the matrices in are factorized in a linked way such that
where and are the compatible partition of . In other words, each matrix is represented by two parts: the common space and the individual space , which are spanned by the common components (i.e., columns of ) existing in all () and its individual components only presented in , respectively. Our problem is to seek and from a given set of matrices , , without the knowledge of and possibly the number . Note that two special cases of (3) have been extensively studied in the past decades:
. No common components exist in and the problem is simply equivalent to factorizing each matrix separately.
for all . The problem is equivalent to ordinary matrix factorization of a large matrix created by stacking all matrices . This will be further detailed in the end of this sub-section.
Note that the solution is not unique since is also a solution to (3
) for an arbitrary invertible matrixwith proper size. To shrink the solution space and simplify the computation we let
be the QR-decomposition ofsuch that (the matrix
denotes the identity matrix with proper size. In the case where the size should be specified explicitly we useto denote the -by- identity matrix). Substitute them into (3) we have
Taking our purpose into consideration, we further assume that , where
is the zero matrix with proper size. This assumption means that there is no any interaction between common features and individual features. This assumption will not cause any additional factorization error. To see this,, if , from (3) and the fact that
Furthermore, we consider the truncated singular value decomposition (SVD) of, where , , and is invertible. Then . Define and . We have and .
Based on the above analysis, the general problem we consider can be formally formulated as:
In (7) we have implicitly assumed that the number of common components
is known. How to estimatein practice and how to solve (7) will be discussed in section 2.2 and 2.3. Compared with (1), the procedure of (2)-(6) does not cause any additional decomposition error. Hence the restriction of implicitly guarantees that we are seeking common components. Indeed, once contains information other than common components, the total decomposition error will increase under this rank restriction.
It is worth noticing that once for all , the problem is reduced to be ordinary PCA, or equivalently, low-rank approximation of matrices. In this case can be found by solving
Let be the matrix by stacking all matrices horizontally, and similarly let . Then (8) can be viewed as a partitioned version of PCA
When , model (7) is distinguished from (8) due to the involved individual parts and the rank restriction discussed above. From this sense (7) can also be interpreted as the principal components of their common space
, i.e., the residuals after removing their individual components. Unfortunately, as the individual parts are also unknown and can have very large variance (energy), we cannot solveby running standard PCA on .
We use two steps to solve (7): in step 1 matrices in (7) are updated by their optimal rank- approximation by solving (1) separately for each . To distinguish, we call the original raw data while call the reduced version cleaned data. In step 2, (7) is solved by using the cleaned data. Due to (2)-(6) which means that no additional error arises from the separation of individual and common spaces, theoretically we have
In section 2.2 and 2.3 we will focus on the second step.
2.2 The COBE Algorithm: the Number of Common Components is Unknown
From (7) (or (10)), once has been estimated, can be computed from 111If is exact as in (10), Equation (11) is also exact. Otherwise (11) is interpreted as the least square solution of . Similarly for equations (12) and (13).
After that can be computed via truncated singular value decomposition (tSVD) of the residual matrix , . In other words, estimating plays the central role to solve (7). In this section we focus on the problem of how to estimate efficiently.
For any and , the optimal and in (7) satisfy that
where (as in Eq.(2)), , and denotes the Moore-Penrose pseudo inverse of a matrix. Let such that (For each matrix this only needs to be computed once by using, e.g., QR decomposition or truncated SVD of ). Then we define , and (12) is equivalent to
and hence for any , , there holds that
where and are the th column of and , respectively.
According to (14), the first column of , i.e., , can be obtained by solving the following optimization model:
We use alternating least-square (ALS) iterations to solve (15). Fix first and the optimal is given by
and then is normalized to have unit norm. Then is fixed and we get
Now suppose that have been found and we seek the next common basis . To avoid finding repeated common basis, we consider a useful property of . Let denote the matrix consisting of the first columns of . From (13) we have
which means that , i.e., is in the null space of . Hence we update as
where . Then can be found by solving
Repeating the procedure done for (15)222For the matrix is not orthogonal any more. However, it can be verified that is the More-Penrose pseudoinverse of , thereby leading to the least square solution ., the minimum of can be obtained. Again, there are two cases:
Otherwise, no common basis vector exists any more and a total of common orthogonal basis vectors are found as .
By this way an orthogonal basis of common space can be found sequentially. This procedure is called common orthogonal basis extraction (COBE) and is presented as Algorithm 1.
The parameter controls how identical the extracted components are. If , the extracted components are exactly the same. Otherwise, approximately identical (or equivalently, highly correlated) components will be extracted (see section 2.5 for detailed discussion). We can adopt the SORTE method proposed in  to select the parameter
automatically. Basically, the SORTE detects the gap between the eigenvalues corresponding to signal space and those to noise space. Here we can detect the gap between common space and individual space similarly. We will illustrate this in simulations.
2.3 The COBE Algorithm With Specified Number of Common Components
We briefly discuss the case where is given. Following the analysis in section 2.1, we solve the following model:
Again, we optimize with respect to and alternatively. When is fixed, the optimal is computed from
And when , , are fixed, (21) is equivalent to
where denotes the trace of a matrix and
Let be the truncated SVD of , where is a diagonal matrix with . Motivated by the work in  (page 601)333The main difference is that here is unnecessarily square., we show that the optimal solution of (23) is
As and , we have , which means that . Clearly, when , there holds that , and reaches its upper bound .
The pseudo-code is presented in Algorithm 2.
2.4 Pre-processing: Dimensionality Reduction
Like CCA, COBE loses its sense if for all , because in this case for any invertible matrix there always exists matrices such that , i.e., any invertible matrix forms a common basis. Hence in model (7) is required for all . Fortunately, this requirement is actually not so restrictive as we think. This is because that, in practice although observation data can be of very high dimensionality, the latent rank is often significantly lower than the dimensionality of observation data . And even if this condition does not hold, we perform dimensionality reduction, such as PCA, on the raw data by solving (1) before running COBE, which has been stated at the end of section 2.1. By the dimensionality reduction step only principal components are targeted and subsequent computational complexity can be significantly reduced. Another strong reason of running dimensionality reduction is to reduce noise. Indeed, the significance of dimensionality reduction has been extensively justified in the literature. From this sense, if in (1) is interpreted as the PCA of each matrix , in COBE we simply rotate/transform the columns of such that the common basis and the individual basis are completely separated.
One of the most widely used dimensionality reduction method may be principal component analysis (PCA) which is based on the assumption that the noise is drawn from independent identical Gaussian distributions. Otherwise if the noise is very sparse, we may consider robust PCA (RPCA). Moreover, we may use the SORTE  or related techniques to estimate the number of latent components and then use PCA to perform dimensionality reduction.
2.5 Relation With Other Methods
The COBE has a very close relation with canonical correlation analysis (CCA). For two given sets of data and , CCA seeks vectors and such that the correlation is maximized. In COBE, however, only the components with the correlation higher than a specified threshold will be extracted. Let are row-centered (i.e., with zero mean) random variables. We have
Suppose that with , . Then
This ends the proof.
From the proposition, once in (15) and (20) are upper bounded, the correlations between the projected variables are consequently lower bounded. Particularly, as . This shows that COBE actually can be interpreted as high correlation analysis (HCA) that differs from canonical correlation analysis (CCA) for multiple data sets.
The following Fig.1 illustrates the relationship between COBE and CCA. Given two matrices , let , i.e. the first column of , be the sine wave and , where
. The entries of the other components were drawn from independent standard normal distributions. Each matrix was mixed via a different matrixwhose entries were drawn from independent standard normal distributions such that (). The red line in Fig.1(a) shows the common components extracted by COBE, and the corresponding components projected onto , i.e., , , match very well to the projected components obtained by CCA. In this sense, COBE realizes CCA from another aspect. However, in COBE only the components with very high correlations will be extracted, as stated in Proposition 1. From the figure, can be interpreted as the principal component of , the information that is not provided in CCA. Note also that in the proposed method, the common components (highly correlated features) satisfy that , which makes COBE like a regularized CCA [25, 9, 26]. Finally, due to its close relationship with CCA, COBE inherits most connections and differences from CCA with other related methods such as PLS , alternating conditional expectation (ACE) , etc.
From (7) and (8), we know that COBE also has close relation with PCA. Fig.2 illustrates the difference between COBE and PCA using the above data (The principle component is computed from the concatenated version defined in (9)). Basically, COBE seeks the principal components of the common space (spanned by common or very similar basis) of all data whereas PCA seeks the principal components of all data, which makes COBE quite useful to find highly relevant and related information from a large number of sets of signals. Moreover, as (7) can be interpreted as the PCA of the common space of all data sets, or the PCA of the individual space of each single data set, we may optimize (7) by using a series of alternating truncated SVD (PCA), which is the way adopted by the JIVE method . This way involves frequent SVDs of huge matrices formed by all data in the computation of the common space and hence it is quite time consuming. Compared with JIVE, the COBEc method is more efficient in optimization and more intuitive and flexible in the estimation of number of common components.
2.6 Scalability For Large-Scale Problems
In multi-block data analysis it is common that the data we encounter is huge. Here huge means that both and () are quite large. Note that in (15), (20), and (21) we actually use the dimensionality reduced matrices with . Hence the value of is generally not a big issue. In the case where is extremely large, we consider the following way to significantly reduce the time and memory consumptions of COBE. Let
be a random matrix with. From (12) we may solve the following model first:
where is much smaller than , and . After have been estimated by using COBE or COBE, the corresponding common features can be computed from . Obviously, as long as . In other words, this way will not lose any common features. In the worst case, however, (31) may give fake common features when occasionally lies in the null space of . Fortunately this rarely happens in practice and these fake common features can be easily detected by examining the value of .
3 Common And Individual Feature Analysis
3.1 Linked BSS with Pre-whitening
So far we only impose orthogonality on the components . In this case the common components are not unique as the columns of
also form a common orthogonal basis for any orthogonal matrix. Sometimes we want to project the common components onto a feature space with some desired property or uniqueness. This can be done typically by, for example, blind source separation (BSS) . BSS is a problem of finding latent variables from their linear mixtures such that
where denotes a BSS algorithm, is the mixing matrix. and are a permutation matrix and a diagonal matrix, respectively, denoting unavoidable ambiguities of BSS. In other words, by using BSS methods the sources can be exactly recovered from their mixtures, only remaining a scale and permutation ambiguity without any knowledge of the mixing matrix
. Hence BSS is quite attractive and has been severed as feature extraction tools in a wide range of applications, such as pattern recognition, classification, etc. If we assume that the latent features (sources)satisfy that
where is defined in (3). From we have
Consequently, the columns of are just the linear mixtures of and hence can be estimated via
by using a proper BSS algorithm . In this case is actually the pre-whitened version of (33), from (34) and the fact that . By using BSS, we may obtain the common features with desired properties such as sparsity, independence, temporal correlations, nonnegativity, etc, by imposing proper penalties on , or even nonlinear common features by using kernel tricks . We call the above BSS procedure linked BSS because we perform BSS on multi-block linked data . Note that the JBSS method in  also performs BSS involving multi-block data. It extracts a group of signals with the highest corrections each time and it requires that the extracted groups have distinct corrections. In other words, the JBSS method is actually a way to realize BSS by applying multiple-set CCA. In contrast, the linked BSS method extracts common basis first and then applies ordinary BSS to it to discover common components with some desired property and diversity.
3.2 Common Nonnegative Features Extraction
In the case where is required to be nonnegative, we cannot run NMF methods on directly. In this case we use two steps to extract nonnegative common components. First, from (11) the common space, i.e., , can be extracted. Then we consider the following low-rank approximation based (semi-) nonnegative matrix factorization (NMF) model :
By using low-rank NMF (if is also nonnegative) or low-rank semiNMF (where is arbitrary) we can extract the common nonnegative components . For example, by using the following multiplicative update rules iteratively both and are nonnegative:
where and are element-wise product and division of matrices. See  for detailed convergence analysis.
3.3 Individual Feature Extraction (IFE)
In the above section we discussed common feature extraction (CFE). Besides the common features or , each data also has its own individual features contained in the matrix . These individual features are often quite helpful in classification and recognition tasks. Although has the same size as , it is rank deficient and satisfies that . Hence dimensionality reduction on should be a top priority before further analysis. We can run any dimensionality reduction method discussed in section 2.4 on each separately to estimate and , and then use BSS or related methods to extract the features in and . However, there is a major difference between the dimensionality reduction methods considered here and those in the pre-processing stage. In the pre-processing stage dimensionality reduction is rather general-purposed and relatively simpler, whereas in this stage, the dimensionality reduction is more closely related to the specific purpose of tasks at hand. For example, if we want to visualize the data in low-dimensional space, we may consider the methods discussed in . For classification and recognition tasks, we may need to extract discriminative information, neighbor relationship, etc, as much as possible . See also  for a unified least-squares framework for various component analysis. In summary, careful selection of dimensionality reduction methods in this stage is quite critical to successfully achieve ultimate purpose. The above procedure is called as individual feature extraction (IFE) as the extracted features are only presented in each individual data.
Finally, we give the flow diagram of the proposed common and individual feature analysis (CIFA) in Fig.3.
4 Two Applications
4.1 Classification Using Common Features
In classification and pattern recognition tasks, we have a set of training data consisting of training samples and their labels. It is natural that the objects belonging to a same category must share some common features. Let denote the common features extracted from the th category, . Then for a new test sample , we compute its matching score with each :
As the samples in a same class should share some same features, the label of is estimated as
There are many choices to define , such as the Euclidean distance or correlation (angle) between and the space spanned by , which can be solved via least-square and CCA, respectively.
Note that for the linear discriminative analysis classifier (LDA), the number of features should be significantly less than the number of samples to ensure the positive definiteness of the covariance matrix. The proposed method has no such a limitation.
4.2 Clustering Using Individual Features
Clustering is a task of assigning a set of objects into clusters such that the objects belonging to a same cluster are of the most similarity. Cluster analysis is widely applied to data mining, machine learning, information retrieval, and bioinformatics. Different from classification, clustering is a typical unsupervised learning approach, that is, there are no training data available. In cluster analysis, we need to compare the similarity between samples. For many practical applications, all the samples may have some common features, although they are in different clusters and certainly have some dissimilarity. For example, in human face image analysis, every face has common facial organs such as cheek, nose, eyes, and mouth, etc, and they often share some same features to some extend reflecting their shapes and locations, etc. The common features presented in all samples are useless for clustering as they do not provide any discriminative information between them. It is therefore reasonable to remove these common/similar features at first and then used their individual features to cluster the objects. Intuitively, this should significantly improve the clustering accuracy when all objects have common features.
In Fig.5 we show how COBE incorporating CNFE is able to extract common faces (features) on the PIE database (Details of the PIE database are given in the next section). Here we manually set and used CNFE to extract the common nonnegative components. From the common faces shown in Fig.5(a), we can observe some basic profile of human faces. In Fig.5(b), their individual local features are accentuated. These individual features should be quite helpful to improve the accuracy of clustering and recognition tasks. Generally, in our individual features based clustering method we follow the steps below:
Randomly split the samples into groups to construct , where and .
Run COBE to extract the common features of .
Remove their common features from by letting .
Perform dimensionality reduction and feature extraction on to obtain features .
Run clustering algorithms on , where are the columns of corresponding to the original objects .
See Fig.3 for more details. Note that the dimensionality reduction and feature extraction methods considered here should be able to substantially benefit the clustering purpose.
5 Simulations and Experiments
Linked BSS. In this simulation we generated a total of ten matrices , , whose first four columns were the speech signals included in the benchmark of ICALAB (named Speech4.mat) , and the other six components were drawn from independent standard normal distributions. The entries of the mixing matrices were also drawn from independent standard normal distributions. Finally let , where models white Gaussian noise (SNR=20dB). We first used the COBE, JIVE , JBSS , and PCA methods to extract the common components. Then we ran the SOBI method  to extract the latent speech signals (As the JBSS performed not so good in this simulation we also used SOBI to improve its results). TABLE I show the simulation results averaged over 50 Monte-Carlo runs, where SIR, i.e., the signal-to-interface ratio (SIR) of the th estimated signal, is defined as follows to evaluate the separation accuracy:
where , are normalized random variables with zero mean and unit variance, and is an estimate of . It can be seen that JIVE and COBE achieved higher SIRs than JBSS and PCA, although the performance of JBSS has been improved after incorporating the SOBI method compared with its original version. Moreover, although PCA has a close relation with COBE, it can be seen again from the table that the common features extracted by PCA are often contaminated by individual features. COBE and JIVE almost achieved the same separation accuracy, but COBE was much faster. Particularly, the performance of JIVE is quite sensitive to the estimate of the rank of joint/common and individual components. If the rank is given accurately, JIVE performs well. Otherwise the efficiency will be significantly reduced. For example, for this instance if the ranks of individual components were specified as 7 (denoted as JIVE in TABLE I), which were actually 6, JIVE consumed more than 77 seconds to converge. In  a method to estimate the number of components was proposed, however, it is quite time consuming and the performance depends on skillful selection of its parameters (for this instance, JIVE costed more than two hours to estimate the rank). For the COBE method, first of all, the total time consumption depends on the number of common components and the size of the problem. This makes COBE much more efficient than JIVE. Moreover, the estimation of the number of common components is simpler and more intuitive. Generally, we can estimate the number of components by tracking the value of , as illustrated in Fig.6. As there is often a big GAP between the values of corresponding to the common components and the others, we can use SORTE  to detect the number of components. Note also that the threshold bounds the correlations between the common components (see proposition 1), or how identical they are. This provides us another intuitive way to select the parameter.
In Fig.7 we showed the performance in terms of running time and separation accuracy of COBE when we projected the observations into lower -dimensional space by multiplying an random matrix . The results were averaged over 50 independent runs. In each run the entries of the project matrix were drawn from independent standard normal distributions. From the figure, when the value of increases, the running time creased approximately linearly whereas the improvement on accuracy tends to mild, which justified the analysis in Section 2.6. Based on this fact, we may use projection to significantly improve the efficiency of COBE when is extremely large.
Dual-energy X-ray image decomposition. Accurate detection of lung nodules using dual-energy chest X-ray imaging is an important diagnostic task to find the early sign of lung cancers 
. Unfortunately, the presence of ribs, clavicles overlapped with soft tissues and environmental noise makes it quite challenging to detect subtle nodules. Accurate separation of bone from soft tissues is quite helpful to make correct diagnosis. In this experiment, we assumed that we had a series of X-ray images which were mixtures of soft and bone tissues and noise. The mixed soft and bone tissues formed their nonnegative common components. Our aim was to extract separated soft and bone tissues. We generated four sets of sources whose first two common components were respectively the soft and bone tissues and the other eight components were drawn from independent uniform distributions between 0 and 1 to model interference. They were mixed via different mixing matrices whose elements were drawn from independent uniform distributions between 0 and 1. It is known that the sources in this example are highly correlated and consequently they cannot be separated by using ICA methods. Due to the presence of random dense noise, they are also uneasy to be separated by using ordinary NMF algorithms on each single set of mixtures. As the soft and bone tissues existed in all images, we ran COBE to extract the basis of common sources and then used CNFE to extract the soft and bone tissues. One typical realization is shown in Fig.8(b). Fig.8(d) displays four samples of nonnegative components extracted by using the nLCA-IVM method . Due to the presence of dense noise (thus the identifiability conditions of nLCA-IVM are not satisfied here), nLCA-IVM cannot extract the desired source images in this example. This experiment shows how the proposed method can be used to extract common nonnegative features, or equivalently, used as nonnegative high correlation analysis.
|k||Accuracy (%)||Normalized Mutual Information (%)|
|k||Accuracy (%)||Normalized Mutual Information (%)|