1 Introduction
In recent years, an increasing number of data mining applications have arisen which deal with the problem of finding patterns in data points for which several blocks of highdimensional measurements, or “views”, have been obtained. Each view generally provides a different quantitative representation of the available random samples. For instance, in genomics, one view may represent the expression levels of all genes in the genome, and the paired view may represent the genetic alternations in each gene observed on the same biological samples [31]; in web mining applications, web pages may be represented by the text content of the page and the hyperlinks pointing to it [3]. The fastdeveloping area of multiview learning [7, 8] is concerned with how to make combined use of the information provided by the available views in order to perform specific data mining tasks, such as clustering and predictive modelling.
A recurrent objective is that of detecting naturally occurring data clusters. When multiple views are available, there may be reasons to believe that the observations should cluster in the same way under each of the available views, hence the process of inferring the true clustering can be improved by making joint use of all views. A growing number of multiview clustering algorithms have been proposed, and two main approaches seem to have emerged in the literature: late and early
fusion methods. The late fusion methods first recover the clusters independently from each view (e.g. by using the Kmeans algorithm), and then attempt to infer a “consensus” clustering by combining the partitioning obtained within each view such that some measure of disagreement between the individual partitionings is minimised
[17, 15, 18, 6].On the other hand, early fusion methods start by learning any common patterns that may be shared by the views and that could yield a joint clustering [3, 29, 9, 10]. A common assumption is that the data under each view are generated from a mixture of distributions where the mixing proportions which determine the cluster assignments are unknown but shared between views. Several methods rely on a twostep approach where the clusters are ultimately defined in terms of geometric separation between points lying in lowdimensional projections: first, a joint dimensionality reduction step is performed using both views, generally using canonical correlations analysis (CCA) which recovers latent factors explaining the correlation between the views; second, Kmeans is used to detect the clusters among data points in the lower dimensional space found in the first stage. Using CCA to initially perform dimensionality reduction has been shown to increase the separation between the clusters [9], and some nonlinear extensions have also been explored [10].
These multiview clustering methods have become particularly popular for a broad class of problems in web mining where data of many different types may cooccur on the same page, for example text, hyperlinks, images and video. A common application is that of clustering web pages [3]
. Pages can be clustered together based on the similarity of both their text content and link structure, and the clusters identify broad subject categories. Another problem in the web mining domain involves clustering and annotating images represented as a vector of pixel intensities as well as bag of words containing the corresponding textual annotation
[6]. The ability to accurately cluster these data have implications in web search and advertising.A different multiview learning scenario arises when one representation of the data is treated as a highdimensional predictor or “explanatory” vector, and the paired view represents a highdimensional “response” vector. The task then consists in fitting a regression model such that, when a new observation has been observed under the explanatory view, the corresponding representation of that observation in the response view can be optimally predicted.
In settings when both the explanatory and response views are highdimensional, twoblock partial least squares (TBPLS) regression has proved to be a particularly useful method for modelling a linear predictive relationship between two highdimensional views [30, 25]
. TBPLS performs dimensionality reduction in both predictor and response views simultaneously by assuming that each multivariate representation of the data can be factorized in a set of mutually orthogonal latent factors that maximise the covariance between the views. This regression model overcomes problems relating to multicollinearity by estimating least squares regression coefficients using the lower dimensional projections. Among other applications, the model has been successfully used in genomics, where regions of DNA that are highly predictive of gene expression need to be detected
[16], and in computational finance, where the returns of several indices have been predicted using a large basket of assets [21].In this article we also assume that two views are available, and set out to discover any potential partitions of the data by using them jointly. Whereas other multiview clustering approaches rely on geometrical structures, we assume that any two points should be assigned to the same cluster if they both appear to be modelled equally well by the same regression model. In this respect, multiview clustering is framed as a problem of learning the unknown number of multiresponse regression models in highdimensions. This is accomplished by first introducing a novel criterion for quantifying the predictive influence of an observation under a TBPLS regression model, whereby a data point is deemed unusual for the model if it has high predictive influence. The rationale is that, under a given clusterspecific regression model, any unusual observation should be removed from that cluster and allocated to a different one.
The article is organised as follows. In Section 2 we review the twoblock PLS regression model, and describe the problem of modelling heterogeneous data. In Section 3 we introduce a measure of predictive influence for TBPLS regression, and address the predictive partitioning problem; an objective function is first proposed, and an iterative multiview predictive partitioning (MVPP) algorithm is presented. To be best of our knowledge, no other multiview predictive clustering algorithm has been proposed in the literature. In Section 4 we describe a number of Monte Carlo simulation settings that will be used to illustrate the performance of the proposed methods under different scenarios. The results, as well as comparisons to alternative multiview clustering algorithms, are then presented in Section 5. The applications to real web pages and academic paper clustering in Section 6 demonstrate the performance of the algorithm on real data. Concluding remarks are found in Section 7.
2 Highdimensional multiresponse regression
2.1 Two block partial least squares regression
Suppose we have observed a random sample of independent and identically distributed data points, , for , where each is the “explanatory view” and each is the “response view” observed on the sample. The dimensions of both views, and , are allowed to be very large. The observations can then be arranged in two paired matrices, one containing all the explanatory variables observed in the samples,
, and one containing the corresponding response variables,
. The variables in both views are centred and scaled so as to have zero mean and unit variance.
The TBPLS regression model assumes that the predictor and response views are noisy realisations of linear combinations of hidden variables, or latent factors.The specific form of the TBPLS model is given in the following definition.
Definition 1.
The TBPLS model assumes the existence of pairs of orthogonal latent factors, and , for such that
(1) 
where and are matrices of residuals. For each , the latent factors are and where and are weight vectors of unit length. The vectors and are the factor loadings.
For any given , each pair of latent factors provides a onedimensional representation of both views and is obtained by identifying the directions on which the projected views have maximal covariance. Therefore, the paired latent factors satisfy the property that
(2) 
under the constraints that for all . For this optimisation problem is equivalent to
(3) 
under the same constraints posed on the weights. Here,
is the largest singular value of
and the weights are the corresponding left and right singular vectors. The weight vectors that satisfy Eq. (2) can then be found by computing the singular value decomposition (SVD) of
, that is , where and are orthonormal matrices whose columns are the left and right singular vectors of , respectively. is a diagonal matrix whose entries are the ordered singular values of . Therefore and are taken to be the left and right singular vectors of , respectively.The predictive relationship between the two views is driven by a linear regression model involving the
pairs of latent factors. For each , the response latent variable depends on the explanatory latent variable, as follows(4) 
where each is a scalar regression coefficient which describes the projection of the latent factor relating to the response onto the latent factor relating to the predictors, and each is the vector of residual errors. Since the latent factors are assumed to have zero mean, there is no intercept term. Using the inner regression models (4), the TBPLS model (1) can now be rewritten in the more familiar form
(5) 
where the regression coefficients have been defined as
(6) 
and depends on the parameter sets , with . Each one of the factor loadings are obtained by performing univariate regressions,
(7) 
and each of the regression coefficients , from the inner model of Eq. (4), is estimated by least squares regression of on , so that
(8) 
In highdimensional settings, such as the one we consider, it is generally appropriate to assume the data has spherical covariance within each view [9], and so and . Although clearly incorrect in many realworld applications, this assumption has been shown to provide better results, particularly in classification problems, than attempting to estimate the true covariance matrices especially when [2]. This can be seen as an extreme form of regularisation which introduces a large bias to reduce the variance in the estimated parameters and has been widely used in applications involving genomic data [28, 11, 23, 31].
2.2 Modelling heterogeneous data
The TBPLS regression model rests on the assumption that the independent samples are representative of a single, homogeneous population. Under this assumption, the latent factors that determine the regression coefficients in Eq. (6) can be optimally estimated using all the available data. However, in many applications the observations may be representative of a number of different populations, each one characterised by a different betweenviews covariance structure. Failing to recognise this would lead to a biased estimation of the latent factors, which would in turn provide a suboptimal predictive model.
We are interested in situations in which the observations have been sampled from different subpopulations, where the exact value of may be unknown. It can be noted that in general the optimal dimension is not necessarily the same across clusters. The problem involves simultaneously recovering the cluster assignments and their parameter sets, as well as learning the optimal . Learning the optimal dimensionality in each cluster is a much harder problem which we address later.
A simple illustrative example is given in Figure 1(a) where , , and . Here, under the
view, the points are uniformly distributed along either one of two lower dimensional subspaces, a line and a plane, both embedded in the threedimensional space. To generate data points under the
view that can be linearly predicted using the explanatory view, we take a linear combination of variables in the explanatory view and add some Gaussian noise. Clearly, fitting a global TBPLS model would be inappropriate here, as shown in Figure 1(b), which shows that the estimated subspaces differ from the true ones, so the predictive ability of the model is suboptimal. We will revisit this example in Section 5 and show that our multiview clustering algorithm recovers the true subspaces, as in Figure 6.3 Predictive partitioning
3.1 A PRESSbased predictive influence measure
The issue of detecting influential observations has been extensively studied in the context of OLS regression [1, 22]. A common approach is based on examining the elements of the “hat matrix”, . The term is known as the leverage of the point, and determines the contribution of the point in estimating its associated response. The partial leverage, is a related quantity which gives the contribution of the point for estimating the response associated to the point. These quantities are used to detect influential observations which have a larger contribution to the estimated responses relative to other points. However, does not take into account any information from and so these leverage terms alone are not always sufficient to determine which observations are influential [1].
After fitting the regression model, a seemingly obvious way to identify influential observations might be to examine the residual error. However, it has been observed that points which exert a large leverage on the regression may obtain relatively smaller residual errors compared to other points as a result of overfitting [24].
A more effective approach to assessing the influence of a particular observation considers the effects of its removal from the regression model. This involves estimating the regression parameter times, leaving out each observation in turn, and then evaluating the prediction error on the unused observation. If we let be the OLS regression coefficient estimated by using all but the observation, the corresponding leaveoneout (LOO) error is . An observation can then be labelled as influential if its LOO error is particularly large. The choice of threshold for identifying an observation as influential is an open question and many strategies have been suggested in the literature [1].
The approach above is related to the leaveoneout cross validation error (LOOCV) which is often used to quantify the predictive performance of a regression model [26], and is defined as the mean of the individual prediction errors,
(9) 
For OLS, it is well known that each prediction error featuring in Eq. (9) can be computed without the need to remove an observation and refit the regression model. This can be accomplished through a closedform expression known as the PRESS [1], which gives
(10) 
In this form, the leaveoneout residual can be seen as the residual, scaled by one minus its leverage, . Since the PRESS only depends on quantities estimated using least squares it has a computational cost in the order of a single least squares fit and, as such, is extremely efficient to compute.
In previous studies, the PRESS has also been used for identifying influential observations in the context of PLS regression with univariate responses [32, 19]. However, in practice its computation requires the regression model to be fit times, each time using data points. A similar strategy for the evaluation of the PRESS in an TBPLS model, when the response is multivariate, would require SVD computations, each one having a computational cost of [12]. This approach is particularly expensive when the dimension of the data in either view is large, as in our settings.
Recently, we proposed a closedform expression for computing the PRESS statistic under a TBPLS model which reduces the computational cost of explicitly evaluating the leaveoneout errors [20]. We overcome the need to recompute the SVD times by approximating the leaveoneout estimates of the singular vectors with .
Definition 2.
The derivation of Eq. (11) is provided in Appendix A. The error introduced by approximating the leaveoneout estimates of the singular vectors is of order . The denominator of Eq. (11) is a scaling term related to the contribution of each data point to the latent factors, and . In this form, it can be seen that the TBPLS PRESS has similarities with the PRESS for OLS regression in Eq. (10) where these scaling terms are related to the leverage each point exerts on the regression [1].
Using Eq. (11), we now consider how to measure the influence each point exerts on the TBPLS model. Since we are interested in the predictive performance of the TBPLS model, we aim to identify influential points as those observations having the greatest effect on the prediction error. In order to quantify this effect, we define the predictive influence of an observation as the rate of change of the PRESS at that point whilst all other observations remain constant.
Definition 3.
The predictive influence of a data point , which we denote as , is the total derivative of the PRESS with respect to the variables in and the variables in ,
(12) 
The closedform expression for the computation of this quantity is reported in Appendix B. The predictive influence offers a way of measuring how much the prediction error would increase in response to an incremental change in the observation or alternatively, the sensitivity of the prediction error with respect to that observation. The rate of change of the PRESS at this point is given by the magnitude of the predictive influence vector, . If this quantity is large, this implies a small change in the observation will result in a large change in the prediction error relative to other points. In this case, removing such a point from the model would cause a large improvement in the prediction error. We can then identify influential observations as those for which the increase in the PRESS is large, relative to other observations.
In the remainder of this Section we develop further the idea of using the predictive influence measure for multiview clustering.
3.2 The MVPP clustering algorithm
Initially we assume that the number of clusters, , is known. As mentioned in Section 2.2, we want to allocate each observation , into one of nonoverlapping clusters such that each cluster contains exactly observations, with , and these points are as similar as possible in a predictive sense. Accordingly, we first define a suitable objective function to be minimised.
Definition 4.
The withinclusters sum of predictive influences is
(13) 
where is the predictive influence of a point under the TBPLS model.
Clearly, when Eq. (13) is minimised, each cluster consists of points that exert minimal predictive influence for that specific TBPLS model, and therefore the overall prediction error is minimised. We refer to these optimal clusters as predictive clusters. If the true cluster assignments were known a priori, fitting these models and thus minimising the objective function would be trivial. However, since the true partitioning is unknown, there is no analytic solution to this problem, and we resort to an iterative algorithm that alternates between finding optimal cluster assignments and optimal model parameters. Specifically, the algorithm we suggest alternates between the following two steps:

Given TBPLS models with parameters , and keeping these fixed, find the cluster assignments which solve
(14) 
Given a set of cluster assignments, and keeping these fixed, estimate the parameters of the predictive models which solve
(15)
We summarise the entire algorithm below.
Initialisation (I): At iteration , an initial, random partitioning of the data, , is generated; both the TBPLS models parameters, , and the predictive influences, , are computed for all clusters and observations.
At each subsequent iteration the following two steps are performed in sequence until convergence:
Partitioning (P): Keeping the model parameters fixed, the cluster assignments that minimise (14) are obtained by assigning each point to the cluster for which its predictive influence is smallest,
(16) 
Estimation (E): Keeping the cluster allocations fixed, the parameters that minimise (15) are estimated using the data points for all and according to Eqs. (2), (7) and (8).
The computational cost of each iteration of the MVPP algorithm is of the order of computing an SVD in each cluster, .
In order for the algorithm to converge to a local minimum we require that at each P step and each E step, the objective function must be decreasing. In Step P we assign observations to clusters based on the assignment rule (16) which minimises the predictive influence and so this decreases the objective function by definition. In Step E we do not directly seek to minimise the predictive influence, instead we estimate parameters in each cluster using TBPLS. In order for these parameters to decrease the objective function it must be the case that they are closer to the optimal MVPP parameters, than the parameters estimated using TBPLS at the previous iteration .
To see why this will be the case, we must consider what happens when points are reassigned to clusters. A large magnitude predictive influence is assigned to points which are influential under a given TBPLS model. Therefore in Step E, points which have been newly assigned to cluster will be influential under the TBPLS model, relative to other points in that cluster. If we estimate a new TBPLS model, , these point will be assigned a smaller magnitude predictive influence and so the sum of square predictive influences within each cluster will be decreased. The algorithm converges to a local minimum of the objective function for any initial cluster configuration.
3.3 Model selection
Model selection in both clustering and TBPLS are challenging problems which have previously only been considered separately. Within our framework, the PRESS statistic provides a robust method for efficiently evaluating the fit of the TBPLS models to each cluster. A straightforward application of the PRESS allows us to identify the optimal number of clusters, . We also apply a similar intuition to attempt to learn the number of latent factors of each TBPLS model, .
Since our algorithm aims to recover predictive relationships on subsets of the data, the number of clusters is inherently linked to its predictive performance. If is estimated correctly, the resulting prediction error should be minimised since the correct model has been found. We therefore propose a method to select the number of clusters by minimising the outofsample prediction error which overcomes the issue of overfitting as we increase . The strategy consists in running the MVPP algorithm using values of between and some maximum value, . We then select the value of for which the mean PRESS value is minimised. This is possible due to our computationally efficient formulation of the PRESS for TBPLS and the fact that we aim to recover clusters which are maximally predictive. The performance of this approach using simulated data is discussed in Section 5.5.
In the case where there is little noise in the data, the number of latent factors can be learned by simply evaluating the PRESS in each cluster at each iteration. Therefore, in the cluster, the value of is selected such that the PRESS is minimised. Since we select the value of which minimises the PRESS, this also guaranteed to decrease the objective function. However, as the amount of noise in the data increases, selecting each optimal value becomes a more difficult task due to the iterative nature of the algorithm. In this case, setting tends to capture the important predictive relationships which define the clusters whereas increasing each can actually be detrimental to clustering performance. This issue is discussed in Section 5.5.
4 Monte Carlo simulation procedures
4.1 Overview
In order to evaluate the performance of predictive partitioning and compare it to other multiview clustering methods, we devise two different simulation settings which are designed to highlight situations both where current approaches to multiview clustering are expected to succeed and fail.
Commonly, clusters are considered to be formed by geometrically distinct groups of data points. This notion of geometric distance is also encountered implicitly in mixture models. Separation conditions have been developed for the exact recovery of mixtures of Gaussian distributions, for instance, for which the minimum required separation between means of the clusters is proportional to the cluster variances
[14, 9].In scenario A, we construct clusters according to the assumption that data points have a similar geometric structure under both views which should be recovered by existing multiview clustering algorithms. We assess the performance as a function of the signal to noise ratio. As the level of noise is increased, the betweencluster separation diminishes to the point that all clusters are undetectable using a notion of geometric distance whereas a clustering approach based on predictive influence is expected to be more robust against noise. On the other hand, under scenario B the clustering of data points is not defined by geometric structures. We simulate data under clusterwise regression models where the geometric structure is different in each view. In this situation, clustering based on geometric separation is expected to perform poorly regardless of the signal to noise ratio. In all of these settings we set the number of latent factors, and the number of clusters, . A detailed description of these two settings is given below.
4.2 Scenario A: geometric clusters
The first simulation setting involves constructing geometric clusters (up until the addition of noise). We simulate each pairs of latent factors and , with
, from a bivariate normal distribution. Each
element, where , is simulated as where the means of the latent factors, defines the separation between clusters. The covariance matrix is given by having unit variances and off diagonal elements .In order to induce a covariance structure in the loadings, we first generate a vector of length where each of the elements is sampled from a uniform distribution
The elements of the loadings and the elements of the loadings are then simulated as , and . We then normalise the vectors so that . Finally, for , each pair of observations is generated from the TBPLS model in the following way
where each element of and are sampled i.i.d from a normal distribution, . The signal to noise ratio (SNR), and thus the geometric separation between clusters, is decreased by increasing .
Figure 2 shows an example of data points generated under this simulation setting; the SNR is large and the geometric clusters are well separated. As the SNR decreases, the geometric clusters become less well separated and so this setting tests the suitability of the predictive influence for clustering when the data is noisy.
4.3 Scenario B: predictive clusters
The second setting directly tests the predictive nature of the algorithm by breaking the link between geometric and predictive clusters. In this setting, the geometric position of the clusters in and the predictive relationship between and are no longer related. We start by constructing the data as in the previous section for . However, we now split the first cluster in space into three equal parts and translate each of the parts by a constant . For all
We then split the second cluster in space into two equal parts and perform a similar translation operation with a constant . For all
The result is that there are now four distinct geometric clusters in space but still only two clusters which are predictive of the points in space. Parametrising the data simulation procedure to depend on the constants and means that we can generate artificial datasets where one of the geometric clusters in are geometrically much closer to however the predictive relationship remains unchanged. We call these structures “confounding clusters”.
Figure 3 shows an example of this simulation setting when the SNR is large. In this setting, noise is only added in the response which preserves the confounding geometric clusters in but removes the separation between clusters in . Therefore we expect methods which do not take into account predictive influence to fail to recover the true clusters and instead only recover the confounding geometric clusters.
5 Examples and simulation results
5.1 Influential observations
Initially we assess the ability of our criterion for detecting influential observations under a TBPLS model, and demonstrate why using residuals only is unsuitable. For this assessment, we assume a homogeneous population consisting of bivariate points under each view, so . We also assume that one latent factor only is needed to explain a large portion of covariance between the views.
In order to generate data under the TBPLS model, we first create the paired vectors by simulating elements from a bivariate normal distribution with zero mean, unit variances and off diagonal elements . The corresponding factor loadings and are simulated independently from a uniform distribution, Unif. We then randomly select three observations in the view and add standard Gaussian noise to each so that the betweenview predictive relationship for those observations are perturbed. Figure 4(a) shows a plot of the predictors and the responses . The three influential observations are circled in each view. Since these observations are only different in terms of their predictive relationships, they are undetectable by visually exploring this scatter plot.
Using all points, we fit a TBPLS model with and compute both the residual error and the predictive influence of each observation. In Figure 4(b), the observations in are plotted against their corresponding residuals (shown in the lefthand plot) and predictive influences (shown in the righthand plot). Since TBPLS aims to minimise the residual error of all observations, including the influential observations results in a biased model fit; although the influential observations exhibit large residuals, this is not sufficient to distinguish them from noninfluential observations. On the other hand, the predictive influence of each point is computed by implicitly performing leaveoneout cross validation and, as a consequence of this, the predictive influence of those points is larger than that of any of the other points. This simple example provides a clear indication that the influential observations can be identified by comparing the relative predictive influence between all points.
We also perform a more systematic and realistic evaluation in higher dimensions. For this study, we simulate independent datasets, whereby each dataset has , and three influential observations. We follow a similar simulation procedure as the one described before, and set . Once the TBPLS has been fit, all points are ranked in decreasing order, from those having the largest predictive influence and largest residual. We then select the first top ranked observations (with ) and define a true positive as any truly influential observation that is among the selected ones; all other observations among those are considered false positives.
Figure 5 compares the receiver operating characteristic (ROC) curve obtained using the predictive influence and the residual error for this task. This figure shows that the predictive influence consistently identifies the true influential observations with fewer false positives than when the residual is used.This suggests that using the residuals to detect influential observations in high dimensions is approximately equivalent to a random guess, and clearly demonstrates the superiority of the proposed predictive influence measure for this task.
5.2 Confounding geometric structures
Here we revisit the illustrative example described in Section 2.2. The predictors shown in Figure 1 consist of threedimensional points sampled uniformly along a line and a plane, and these two subspaces intersect. The response consists of a noisy linear combination of the points in each cluster. Using the same simulated data, we can explore the performance of both our MVPP algorithm and a different multiview clustering algorithm, MVCCA [9]. MVCCA fits a single, global CCA model which assumes all points belong to the same low dimensional subspace and that clusters are well separated geometrically in this subspace.
Figure 6 shows the clustering results on this example dataset using both MVCCA and MVPP. The result of clustering using MVCCA shown in Figure 6(a) highlights the weaknesses of using a global, geometric distancebased method since the existence of clusters is only apparent if local latent factors are estimated using the correct subset of data. MVCCA fits a single plane to the data which is similar to the one estimated by a global TBPLS model, as in Figure 1(b). The points are then clustered based on their geometric separation on that plane which results in an incorrect cluster allocation.
In comparison, Figure 6(b) shows the result of clustering with MVPP showing how the ability to recover the true clusters, and therefore deal with the confounding geometric structures, by inferring the true underlying predictive models. Moreover, since the noise in the data low, in this example we are able to let MVPP learn the true number of latent factors in each cluster using the procedure described in Section 3.3.
5.3 Clustering performance
Using data simulated under scenarios A and B, we assess the mean clustering and predictive performance of the MVPP algorithm in comparison to some multiview clustering algorithms over 200 Monte Carlo simulations. In each simulation, the latent factors, loadings and noise are randomly generated as described in section 4. We also examine issues relating to model selection in the MVPP algorithm.
Figure 7 shows the result of the comparison of clustering accuracy between methods when in scenario A. A SNR of indicates that signal variance is approximately times that of the noise variance and so the clusters in both views are well separated whereas a SNR of indicates that the clusters overlap almost completely. It can be seen that when the noise level is low, MVPP is able to correctly recover the true clusters. As the noise increases, and the geometric separation between clusters is removed, the clustering accuracy of the competing methods decreases at a faster rate than MVPP.
Since MVCCA assumes that the clusters are well separated geometrically, as the noise increases the estimated latent factor is biased which decreases the separation between the clusters. Another reason for the difference in performance between MVCCA and MVPP lies with how the multiple views are used for clustering. Although MVCCA clustering derives a low dimensional representation of the data using both views, the actual clustering is performed using the latent factors of only one view. MVPP considers the important predictive contribution from both views in constructing the predictive influences and so clustering occurs jointly between the views.
The MVkernel method [10]
relies on the Euclidean distance between points in constructing the similarity matrix. This method works well only when the clusters are well separated in each view. Computing the Euclidean distance between points in high dimensions before performing dimensionality reduction means that the MVkernel method is affected by the curse of dimensionality. As such, its performance degrades rapidly as the SNR decreases.
WCC [17] clusters each view separately using means and combines the partitions to obtain a consensus. Since it does not take into account the relationship between the two views, when the data is noisy this can result in two extremely different partitions being recovered in each view and therefore a poor consensus clustering.
Figure 8 shows the result of the comparison between methods in scenario B. It can be seen that MVPP consistently clusters the observations correctly in this challenging setting and is extremely robust to noise due to the implicit use of crossvalidation. Since none of the other methods takes into account the predictive relationship between the clusters and instead only find geometric clusters, they all consistently fail to identify the true clusters. The similar performance for low levels of noise corresponds to these methods consistently misclustering the points based on their geometric position. As the noise increases, the performance of WCC, MVCCA and MVkernel remains fairly constant. This confirms that these methods are not correctly utilising the important information in the second view of data even when the predictive clusters in the response are well separated.
5.4 Predictive performance
Since only MVPP considers the predictive performance of the clustering by evaluating the PRESS error in each cluster, in order to test the predictive performance of the competing multiview clustering algorithms we must perform clustering and prediction in two steps. Therefore we first perform clustering with each of the methods on the full dataset and then train a TBPLS model in each of the obtained clusters. We then test the predictive ability by evaluating the leaveoneout cross validation error within each cluster. For comparison, we also evaluate the LOOCV error of a global TBPLS model which we fit using all of the data.
Figure 9 shows the result of predictive performances under scenario A. This figure shows that MVPP achieves the lowest prediction error amongst the multiview clustering methods. This is to be expected since the clusters are specifically obtained such that they are maximally predictive through implicit cross validation. The prediction error of the competing multiview methods is larger than MVPP which indicates that these methods are really not selecting the truly predictive clusters. As the noise increases, the prediction performance of all methods decreases however as MVPP is more robust to noise than the competing methods, its relative decrease in performance is smaller. It can be noted that for low levels of noise the global predictive model performs worst of all. This further supports the notion of attempting to uncover locally predictive models within the data.
Figure 10 shows the prediction performance in scenario B. Since MVPP is able to accurately recover the predictive clusters, it displays the lowest prediction error amongst the multiview clustering methods. As noted above, the other multiview clustering methods only recover the geometric clusters and so their prediction performance is worse. The relative performance difference between competing methods stays similar as noise increases however, since MVPP is affected by noise in , its predictive performance decreases relative to the other methods.
5.5 Model selection
The ability of MVPP to learn the true number of clusters in the data is assessed using the procedure in Section 3.3. In this experiment, the data was simulated under setting A and the true number of clusters was set as . Figure 11 shows a comparison between the PRESS prediction error and the objective function for different values of over 200 Monte Carlo simulations. As expected, the objective function decreases monotonically as is increased whereas the PRESS exhibits a global minimum at .
In the above simulation settings, the number of latent factors was fixed to be . According to the TBPLS model in Section 2.1 the first latent factor is the linear combination of each of the views which explains maximal covariance between and . Therefore, the first latent factor is the most important for prediction. Each successive latent factor explains a decreasing amount of the covariance between the views and so contributes less to the predictive relationship.
Figure 12 shows the effect of the number of latent factors, on the clustering accuracy of MVPP in scenario A. It can be seen that for low levels of noise, when the clusters are well separated, increasing has little effect on the clustering accuracy. As the noise increases, the first latent factor appears to capture all of the important predictive relationships in the data whereas subsequent latent factors only fit the noise which causes a detrimental effect on the clustering accuracy as more latent factors are added.
6 An applications to web data
6.1 Data description
The proposed MVPP method, as well as alternative multiview clustering algorithms, have been tested on two real world datasets. The first is the WebKB^{2}^{2}2http://www.cs.cmu.edu/ webkb dataset which consists of a collection of interconnected web pages taken from the computer science departments of four universities: Cornell, Texas, Washington and Wisconsin. This dataset is commonly used to test multiview clustering algorithms, and therefore provides an ideal benchmark [3, 13, 10]. We treat each web page as an observation, where the predictor vector, is the page text and the response vector, is the text of the hyperlinks pointing to that page. The dimensions of these vectors for each university is given in Table 1. Both views of the pages consist of a bag of words representation of the important terms where the stop words have been removed. Each word has been normalised according to its frequency in each page and the inverse of the frequency at which pages containing that word occur (term frequencyinverse document frequency normalisation).
There are two separate problems associated with the WebKB dataset. The first problem, which we denote WebKB2, involves clustering the pages into two groups consisting of “course” and “noncourse” related pages respectively. The second problem, WebKB4, involves clustering the pages into four groups consisting of “course”, “student”, “staff” and “faculty” related pages. It is known that a predictive relationship exists between views [27] and so we expect the results obtained by MVPP to reflect the ability to exploit that relationship in order to correctly identify clusters.
We also evaluate the clustering and prediction performance of MVPP and competing methods on a second benchmark dataset, the Citeseer dataset [4], which consists of scientific publications belonging to one of six classes of approximately equal sizes. The predictor view, , consists of a bag of words representation of the text of each publication in the same form as the WebKB dataset . We perform two different analyses: in the first one, the response view comprises of a binary vector of the incoming references between a paper and the other publications in the dataset (); in the second, the response view comprises of a binary vector of the outgoing references from each paper ().
6.2 Experimental results
For the WebKB2 clustering problem there are two true clusters of approximately equal size. We again compare MVPP with the WCC, MVCCA and MVKernel clustering methods. For each method, we then evaluate the leaveoneout prediction error for the previously recovered clusterings. We also evaluate the leaveoneout prediction error for global PLS which has been estimated using all the data.
Table 2 shows the results of clustering and prediction on the WebKB2 dataset. In all cases, MVPP achieves almost 100% clustering accuracy whereas the other methods achieve between accuracy which suggests that there is a predictive relationship between the text view of the webpage and the incoming links which MVPP is able to exploit to recover the true clusterings. MVPP also achieves a much lower prediction error than the other clustering methods which vary widely. This suggests that since the dimensionality of the problem is large, a small error in cluster assignment can lead to fitting a poor predictive model.
For the WebKB4 clustering problem there are four true clusters where one of the clusters is much larger than the others. This poses a particularly challenging scenario since Kmeans based techniques favour clusters which are of a similar size. Table 3 details the results on this dataset. Again, in all cases, MVPP achieves the highest clustering accuracy. In this dataset, the clustering accuracy for MVPP is approximately 15% lower than for due to the irregular cluster sizes and the poorer separation between clusters. The other methods also generally achieve poorer clustering accuracy however the relative decrease is not as large. Similarly for the previous dataset, the better clustering performance of the multiview methods does not necessarily achieve better prediction performance. Despite achieving a relatively poorer clustering accuracy, fitting four clusters instead of two greatly improves the prediction performance of all clustering methods.
Table 4 shows the results for clustering and prediction using the Citeseer dataset. It can be seen that in both configurations, MVPP achieves the highest clustering accuracy although the relative difference is not as large as for the WebKB dataset. In this case, MVPP achieves the lowest prediction error of all methods. The large variance in prediction error between the multiview clustering methods despite their similar clustering accuracy again suggests that incorrectly clustering observations can severely affect the prediction performance due to the high dimensionality of the data.
7 Conclusions
In this work, we have considered the increasingly popular situation in machine learning of identifying clusters in data by combining information from multiple views. We have highlighted some cases where the notion of a predictive cluster can better uncover the true partitioning in the data. In order to exploit this, our work consolidates the notion of predictive and cluster analysis which were previously mostly considered separately in the multiview learning literature.
In order to identify the true predictive models in the data, we have developed a novel method for assessing the predictive influence of observations under a TBPLS model. We then perform multiview clustering based on grouping together observations which are similarly important for prediction. The resulting algorithm, MVPP, is evaluated on data simulated under the TBPLS model such that the true clusters are predictive rather than geometric. The results demonstrate how geometric distance based multiview clustering methods are unable to uncover the true partitions even if those methods explicitly assume the data is constructed using latent factors. On the other hand, MVPP is able to uncover the true clusters in the data to a great degree of accuracy even in the presence of noise and confounding geometric structure. Furthermore, the clusters obtained by MVPP provide the basis of a better predictive model than the clusters obtained by the competing methods. An application to real web page and academic paper data show similar results.
The computational complexity of MVPP is at least times more expensive compared with the other CCAbased multiview clustering algorithms. This computational cost in incurred due to the need to iteratively fit a predictive model in each cluster which can be expensive when the dimensionality of the data is high. However, as shown by our results on simulated and real datasets, it appears that such a strategy is necessary in order to recover an accurate partitioning of the data.
Determining an initial partitioning which performs better than a random initialisation is a difficult problem since identifying the local predictive relationships a priori is not always possible using global methods. An obvious choice would be to initialise the algorithm using the results from the MVCCA method. However, it can be seen that in certain situations (such as scenario B), MVCCA results in a poor cluster assignment which may result in the MVPP algorithm getting stuck in a poor local minimum.
We have also attempted to unify the difficult issues of model selection in clustering and TBPLS which have previously only been considered separately. We have shown that our prediction based clustering criterion can be used to learn the true number of clusters. However, we have also seen that learning the number of latent factors in each of the TBPLS models remains a difficult problem due to the effects of noise and the iterative nature of the algorithm.
The idea of multiview clustering based on prediction has not been explored before in the literature, but there are examples of clustering using mixtures of linear regression models in which the response is univariate [5]. However, it is well known that the least squares solution is prone to overfitting and does not represent the true predictive relationship inherent between the views. Furthermore, the least squares regression applies only to a univariate response variable, and is not suitable for situations where the response is high dimensional.
A possible extension to the MVPP method for high dimensional and noisy data is to apply an additional constraint on the norm of the TBPLS weights estimated in Eq. (2). Such a constraint induces sparsity in the TBPLS solution such that only a small number of variables contribute to the predictive relationship between the views. This can be achieved, for example, using the Sparse PLS method of [21]. However, this requires the specification of additional tuning parameters which cannot be easily learned from the data.
Appendix A Pls Press
For one latent factor we can write the leave one out error as , where is estimated using all but the observation. Since , we can write . The difference between the singular vectors estimated using all the data and the leaveoneout estimate, is of order [20] so that if is large, we can write .
Using the matrix inversion lemma, we can obtain recursive update equations for which only depends on and does not require an explicit leaveoneout step in the following way
(17) 
where the expression for is given by Equation (8). In the same way we derive the following expression , where the expression for is given by Equation (7). Equation (11) is then obtained by using these values for and in and simplifying.
Appendix B Predictive influence
Taking the partial derivative of the PRESS function, with respect to
Taking derivatives of the constituent parts of in Equation (11) with respect to we obtain
(18) 
The derivation of predictive influence with respect to follows the same argument and so the predictive influence, .
References
 [1] D. Belsley and E. Kuh. Regression diagnostics: Identifying influential data and sources of collinearity. Wiley, New York, New York, USA, 1 edition, 2004.

[2]
P. Bickel and E. Levina.
Some theory for fisher’s linear discriminant function, ’naive bayes’, and some alternatives when there are many more variables than observations.
Bernoulli, 10(6):989–1010, 2004.  [3] S. Bickel and T. Scheffer. Multiview clustering. In Proceedings of the IEEE international conference on data mining. Citeseer, 2004.
 [4] S. Bickel and T. Scheffer. Estimation of mixture models using CoEM. In Machine Learning: ECML 2005, volume 3720, pages 35–46. Springer, 2005.
 [5] C. Bishop. Pattern recognition and machine learning, volume 4. Springer New York, Aug. 2006.
 [6] E. Bruno and S. MarchandMaillet. Multiview clustering: A late fusion approach using latent models. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 736–737. ACM, 2009.
 [7] R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
 [8] N. CesaBianchi, D. Hardoon, and G. Leen. Guest Editorial: Learning from multiple sources. Machine Learning, 79(1):1–3, 2010.
 [9] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multiview clustering via canonical correlation analysis. Proceedings of the 26th Annual International Conference on Machine Learning  ICML ’09, pages 1–8, 2009.
 [10] V. de Sa, P. Gallagher, J. Lewis, and V. Malave. Multiview kernel construction. Machine learning, 79(1):47–71, 2010.
 [11] S. Dudoit, J. Fridlyand, and T. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Technical report, UC Berkeley, 2000.
 [12] G. Golub and C. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996.
 [13] C. Hou, C. Zhang, Y. Wu, and F. Nie. Multiple view semisupervised dimensionality reduction. Pattern recognition, 43(3):720–730, 2010.
 [14] R. Kannan and S. Vempala. Spectral Algorithms. Foundations and Trends in Theoretical Computer Science, 4(3–4):157–288, 2008.
 [15] T. Lange and J. Buhmann. Fusion of similarity data in clustering. In Advances in Neural Information Processing Systems, volume 18, page 723. Citeseer, 2006.
 [16] K.A. Lê Cao, P. G. P. Martin, C. RobertGranié, and P. Besse. Sparse canonical methods for biological data integration: application to a crossplatform study. BMC bioinformatics, 10:34, 2009.
 [17] T. Li and C. Ding. Weighted consensus clustering. In Proceedings of the 2008 SIAM International Conference on Data Mining, volume 1, pages 798–809. Citeseer, 2008.

[18]
B. Long, P. Yu, and Z. Zhang.
A general model for multiple view unsupervised learning.
In Proceedings of the 2008 SIAM International Conference on Data Mining, 2008.  [19] H. Martens. Modified Jackknife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Quality and Preference, 11(12):5–16, Jan. 2000.
 [20] B. McWilliams and G. Montana. A PRESS statistic for twoblock partial least squares regression. In Computational Intelligence (UKCI), 2010 UK Workshop on, pages 1–6. IEEE, 2010.
 [21] B. McWilliams and G. Montana. Sparse partial least squares regression for online variable selection with multivariate data streams. Statistical Analysis and Data Mining, 3(3):170–193, 2010.
 [22] M. Meloun. Detection of single influential points in OLS regression model building. Analytica Chimica Acta, 439(2):169–191, July 2001.
 [23] E. Parkhomenko, D. Tritchler, and J. Beyene. Sparse Canonical Correlation Analysis with Application to Genomic Data Integration. Statistical Applications in Genetics and Molecular Biology, 8(1), 2009.
 [24] D. Pregibon. Logistic regression diagnostics. The Annals of Statistics, 9(4):705–724, 1981.

[25]
R. Rosipal and N. Krämer.
Overview and Recent Advances in Partial Least Squares.
In
Subspace, Latent Structure and Feature Selection
, pages 34–51. Springer, 2006.  [26] M. Stone. Crossvalidation and multinomial prediction. Biometrika, 61(3):509–515, 1974.
 [27] Y. Tian, T. Huang, and W. Gao. Robust Collective Classification with Contextual Dependency Network Models. Advanced Data Mining and Applications, pages 173–180, 2006.
 [28] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Class prediction by nearest shrunken centroids, with applications to dna microarrays. Statistical Science, 18(1):104–117, 2003.

[29]
G. Tzortzis and A. Likas.
Convex mixture models for multiview clustering.
Artificial Neural Networks–ICANN 2009
, pages 205–214, 2009.  [30] J. Wegelin. A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the TwoBlock Case. Technical report, University of Washington, 2000.
 [31] D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics (Oxford, England), 10(3):515–34, July 2009.
 [32] S. Wold, A. Ruhe, H. Wold, and W. J. Dunn. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM journal on Scientific Computing, 5(3):735–743, 1984.
Comments
There are no comments yet.