1 Introduction
Given a set of observable data, Latent Variable Models (LVMs) aim to extract a reduced set of hidden variables able to summarise the information into a low dimensional space. These models have become crucial in multiview problems (Atrey et al., 2010; Sharma et al., 2012; Li et al., 2019), where data are represented by different modalities or views, since LVMs are able to explain the common information among all the modalities.
Classical MultiVariate Analysis (MVA) methods, such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA)
(Pearson, 1901; Hotelling, 1936), aim to exploit the data correlation to obtain a low dimensional latent representation of the data. Its usage has been generalised due to its easy nonlinear extension by means of kernel methods. (Schölkopf et al., 1998; Zhang et al., 2016). The fact of supporting a kernel formulation allows these methods to learn arbitrarily complex nonlinear models with a complexity determined by the number of training points (Yu et al., 2013)and make them highly convenient in scenarios with high dimensional data.
Factor Analysis (FA) (Harman, 1976) emerges as a linear bayesian framework where one can obtain the desired latent representation together with a measure of the uncertainty. Among their many variants, such as Probabilistic PCA (Tipping and Bishop, 1999), Supervised PCA (Yu et al., 2006), Bayesian Factor Regression (Bernardo et al., 2003) or Bayesian CCA (Klami et al., 2013), InterBattery FA models (Klami et al., 2013) stand out for their capability of handling not only latent variables associated to the common information among all the views, but also for being able to model the intraview information. This model has been recently extended in (Anonymous, 2020), named as Sparse Semisupervised InterBattery Bayesian Analysis (SSHIBA), to incorporate both missing attributes, feature sparsity/selection, and the ability to handle heterogeneous data such as categorical or multidimensional binary data.
The use of kernel methods in bayesian approaches has been mostly developed with Gaussian Processes (Williams and Rasmussen, 2006) and their nonsupervised version to perform dimensionality reduction (GP latent variable models, GPLVMs) (Lawrence, 2005). These approaches combine the advantages of the kernels methods, exploiting the nonlinear relationships among the data, with that of a probabilistic framework. In (Damianou et al., 2012), the authors propose a shared GPLVMs approach, called Manifold Relevance Determination (MRD), to provide a nonlinear latent representation for multiview learning problems. This model is extended in (Damianou et al., 2016), including an Automatic Relevance Determination (ARD) prior (Neal, 2012) over the kernel formulation, to endow it with feature selection capabilities.
GPLVMs come with practical scalability drawbacks that need to be addressed. The cubic complexity with the number of training points requires the use of inducing points and variational approaches (Titsias, 2009). Selecting the number of inducing points to use, and where to place them in the latent space, is still a challenging problem, being a common solution to place them in a regular basis along the latent space and only optimize the pseudoobservation at those points (Wilson and Nickisch, 2015). Furthermore, up to our knowledge, there is no versatile implementation in the stateoftheart of a multiview GPLVM able to handle heterogeneous observations (integer, categorical, real and positive observations) and missing values.
In this paper we propose a novel method to implement nonlinear probabilistic LVMs that still builds upon a linear generative model, hence inheriting their computational and scalability properties. Instead of implementing a kernel method, i.e. a GP, to move from the latent representation to the observed data, we propose to reformulate probabilistic FA so that it generates kernel relationships instead of data observations. In the same way that Kernelized PCA (KPCA) or Kernelized CCA (KCCA) are able to generate nonlinear latent variables by linearly combining element of a kernel vector, here, from a bayesian generative point of view, we first i.i.d. sample latent representations and project on an dimensional space (being the number of points) using a weight matrix representing the dual parameters. We apply this trick over the SSHIBA formulation (Anonymous, 2020) to exploit their functionalities over this kernelized formulation. Thanks to that, we can efficiently face semisupervised heterogeneous multiview problems combining linear and nonlinear data representations; in this way, one can combine kernelized views to deal with nonlinear relationships with linearly kernelized to work with high dimensional problems. Besides, we can force the automatic selection of Support Vectors (SVs) to obtain a scalable solution as well as include an ARD prior over the kernel to obtain feature selection capabilities.
2 Bayesian sparse factor analysis with kernelized observations
Let’s consider a multiview problem where we have data samples represented in different modalities, , and our goal is to find an inter and intraview nonlinear latent representation, . That is, given that is the th data of the th view, has to compress, in a low dimensional space of size , both the common and particular information of over all the views exploiting the correlations among the data.^{1}^{1}1Given a matrix , we denote the th row by and the th column by .
Whereas kernel LVMs obtain this latent representation as a linear combination, by some dual variables, of the kernel representation of the th data, here we propose to reformulate this idea from a generative point of view. In particular, we start from the SSHIBA algorithm formulation (Anonymous, 2020) and consider that there exist some latent variables which are linearly combined with a set of dual variables to generate a kernel vector, , as:
(1) 
where
is zeromean gaussian noise, with noise power following a Gamma distribution of parameters
and , and is the kernel representation of the nth data; that is, given a mapping function and its associated kernel function , is a vector with the kernel between and all the training data . The dual variable matrix plays the role of the linear projection matrix and it is defined using the same structured ARD prior considered in both (Klami et al., 2013) and (Anonymous, 2020). Namely, an ARD prior that promotes that full rows of this matrix are cancelled, i.e. with , so that in the product in (1) the appropriate set of latent factors is selected.Figure 1 shows the graphical model of KSSHIBA. Following (Klami et al., 2013), for the data views that are directly explained given the latent projection we have , where the weight matrix follows the same structured ARD prior mentioned above. We can refer to these as primal observations. For some other views, we might be interested in explaining them indirectly through a kernelized observation following (1). This conversion can be of interest when the view’s dimensionality is much larger than the number of data points . When both primal and kernelized observations are used, the learned latent projection attempts to faithfully reconstruct each of the primal views, and the joint relation between each pair of data points through the reconstruction of the kernel matrix. The posterior distribution of all model parameters and latent projections is approximated using variational inference with a fully factorized posterior, as detailed in the Supplementary Material, where it can be noted that each update has a computational cost of .
Note that sampling from the model in (1) does not ensure a valid kernel positive semidefinite matrix. The kernel matrix is simply treated as an observation (a kernelized observation) and, as such, the model parameters will be chosen to minimize the reconstruction error. Experimental results were also shocking for the authors, as fairly good kernel matrices are typically reconstructed after model training. In Figure 2 we include a graphical representation of both a kernelized observation and the map reconstruction through (1) using the mean of the posterior distribution of . Certainly, more appropriate models could be used to adapt the observation model (given ) to the properties of a kernelized observation. To address this issue, we have explored alternative formulations based in nonindependent noise; for example, defining the noise distribution as an inversewishart to have a full rank covariance noise or modelling its covariance as the product of two low rank matrices. However, these schemes led to considerably more complicated (less flexible) formulation which limited the rest of the properties of this proposal (as the ones proposed in the following sections). Henceforth, we restrict to the model in (1), and leave this line of work open for future research.
2.1 Automatic bayesian support vector selection
On the basis of a full kernel, with a more structured ARD prior we can achieve not only the shrinkage of the number of effective latent factors, but also a more compact representation of the data by means of a reduced kernel matrix in which only a reduced set of support vectors (SVs) are kept.
For this purpose, the proposed formulation can introduce a double ARD prior over the dual variables , . This way, continues forcing rowwise sparsity to automatically select the number of latent factors and, additionally, induces columnwise sparsity in the columns weight matrix to learn the set of bayesian SVs. This process can be carried out during the inference process, removing the least relevant SVs (and their corresponding columns in ) by setting a threshold, providing additional computational improvement.
2.2 Automatic feature selection
Furthermore, we can additionally endow the proposed kernelized data representation with feature selection capabilities. If by using the double ARD structured we can cancel full rows or columns, equivalently, by using an ARD kernel we can perform feature selection. In the ARD kernel, each feature of the original observations is multiplied by a variable in the kernel definition. For example, for a RBF kernel, ,we can optimise by maximising the lower bound of our mean field approach given by direct optimisation over the variational lower bound. In our model, if the th view is kernelized then the only terms in the lower bound where the ARD kernel kicks in are (see Supplementary Material for details):
(2) 
We alternate between meanfield updates over the variational bound with direct maximization of (2) w.r.t.
using any gradient ascend method (we use Pytorch and Adam for such updates). Finally, by setting a threshold for
the feature selection can be done while training.3 Results
Throughout this section the presented model is analysed in terms of performance and interpretability of the inferred model parameters and latent projections. Results on some other databases as well as a more extensive description of the experimental setup are available in the Supplementary Material. Furthermore, an exemplary notebook with the library will be uploaded to an open github repository^{2}^{2}2This notebook has been uploaded with the rest of files to the review system.
3.1 Performance evaluation of KSSHIBA for multidimensional regression
KSSHIBA can be trained in a semisupervised way, being capable of predicting by either sampling from the posterior or simply using the mean, as we will do. This section aims to analyse the performance of KSSHIBA for semisupervised multidimensional regression in comparison with some stateoftheart baselines. To do so, we used some multitask datasets from the Mulan repository (SpyromitrosXioufis et al., 2016; Karalič and Bratko, 1997; Džeroski et al., 2000). Table 1
shows the results obtained on the databases comparing the proposed model with: (1) reference regression methods, such as a Support Vector Regression machine with Gaussian RBF kernel (SVRRBF) and a MultiLayer Perceptron (MLP); (2) a KCCA+LR and KPCA+LR approach where KCCA/KPCA is used for feature extraction and a Linear Regressor (LR) for prediction purposes. In this cases, the number of latent factors has been fixed to the maximum possible,
(where is the number of classes) in KCCA and to those which explainof the variance in KPCA; (3) Multiview GPLVM (MRD), the number of latent factors is set to twice
. Two versions of KSSHIBA are included. One in which the number of latent factors is automatically learnt, and one in which we set to .We calculated the reported results with a nested 10folds crossvalidation (CV). The outer CV is used to divide the dataset into training and test partitions, while the inner CV is in charge of validation and, therefore, it divides the training partition into a second training set and a validation set. This way we were able to estimate the performance of the whole framework and, additionally, validate the model parameters. We used R2 score to measure the performance of the methods. Further information is detailed in the Supplementary Material.
In particular, we can see that KSSHIBA outperforms most methods in terms of R2 score while providing dimensionality reduction. At the same time, the results obtained by KSSHIBA with imply that a less restrictive pruning would not deteriorate the results (except for edm, jura and enb where is or ). Besides providing dimensionality reduction, KSSHIBA proves to be able to perform as well as MLP or even outperform it in terms of R2.
3.2 Evaluation of the solution in terms of SVs
Now, we want to test the capabilities of the KSSSHIBA approach to automatically select a subset of training points. For this purpose, we use the same databases and setup as the previous evaluation to compare to KPCA+LR and KCCA+LR. In these last two models, in order to decide the number of SVs used to build the kernel matrix, a crossvalidation has been done following a Nyström (Williams and Seeger, 2001) subsampling technique.
Table 2 shows that the inclusion of the automatic SV selection on KSSHIBA keeps the original model performance for most databases, even improving it for oes97 and edm. This is done while drastically reducing the model complexity; in fact, analysing this in detail, it is observed that the fact of reducing the number of SVs favours an additional reduction in the final number of latent factors. When comparing to KPCA+LR and KCCA+LR, KSSHIBA mostly shows a lower percentage of SVs needed to describe the kernel. This is due to the fact that KSSHIBA learns the relevance of each element and eliminates them accordingly, whereas KPCA and KCCA obtain this compact solutions with a random selection of SVs.
To complete this analysis, Figure 3 depicts the mean R2 over 10 folds of the analysed algorithms for the databases where KSSHIBA is outperformed in Table 2, the rest are available in the Supplementary Material. For the sake of comparison, we also included the MRD results when its percentage of inducing points is varied. Whereas MRD, KPCA+LR and KCCA+LR present fluctuations in their performance requiring to adjust the number of SVs to obtain an accurate performance, KSSHIBA has a relatively constant R2 value. This phenomenon occurs because KSSHIBA learns the relevance of each SV and weight their influence on the update of the parameters during all the model inference.
3.3 Analysis of the feature selection
In order to test the feature selection extension (see Section 2.2), we now study KSSHIBA on different classification databases where the input view is an image, and the output view is the category label. We used the faces dataset Labeled Faces in the Wild (LFW) Huang et al. (2007) and warpAR10P, Yale and Olivetti, which can be found from the Feature Selection Repository^{3}^{3}3http://featureselection.asu.edu/datasets.php. We applied over the input view the feature selection extension, obtaining the masks in Figure 4. Despite having different image resolution in each database, we can see how the proposed extension is capable of focusing on the most relevant features (white). For instance, Figures 3(b) and 3(b) learn to focus on the area related to glasses, while Figures 3(a) and 3(c) are learning the general face features of the images.
Feature masks learnt by the feature selection extension of KSSHIBA for different face recognition problems. The mask represent the importance of each pixel: lighter colours imply the pixel is more relevant while darker ones represent the pixel is less relevant.
3.4 Analysis of the extracted latent factors
In this section we want to evaluate the interpretability of the extracted latent factors obtained by the proposed model in comparison to the MRD approach based on shared GPLVMs. We used their available library (Damianou et al., 2012) to compare it with KSSHIBA on the oil classification database (Bishop and James, 1993). For this purpose we have trained both models with latent factors combined with ARD latent factors selection. KSSHIBA uses a RBF kernel for the input view and MRD uses it for both their input and output views. Under this conditions, the accuracy in the prediction of the labels for the MRD was of and KSSHIBA achieved a . With the available MRD implementation (in Matlab), the computational time is not scalable for the number of data. As seen in the Supplementary Material, there is a difference of two orders of magnitude in computational time.
Figure 5 shows the relevance parameter for each of the learnt latent factors for both models. MRD does not find any view dependent latent factor and all latents are shared by both views (Figure 4(a) shows the relevance for all these common factors) and, besides, it mainly focuses on latents 12, 13 and 14. On the other hand, KSSHIBA presents independent weights for each view (see Figures 4(b) and 4(c)); these results indicate there are certain latent factors that are not relevant and could be pruned (latent 7), some that are only relevant for the input view (latents 5 and 14) and the remaining are common (highlighting latents 0, 2 and 8).
3.5 Multiview KSSHIBA
One of the main functionalities of KSSHIBA is its capacity to combine multiple views into a single model. We can take advantage of this property when reconstructing kernel representations to combine different types of kernels (one per view). To prove the possibilities of this formulation we used a subset of 1.000 samples from the MNIST database (LeCun et al., 2010) and trained the model in three twoview scenarios for different kernel types in the input kernelized view: a linear kernel, a gaussian one and second degree polynomial kernel, and using the labels as output view; and, additionally, we include a fourth scenario with four views where each kernel is in an input view and the categories are in the output view. The obtained results show that using the linear kernel has an accuracy of , the gaussian has , the polynomial and their combination increases the performance up to .
Figure 6 shows the relevance of each latent factor in the joint scenario (all kernels used). From the original latent factors, the output view only uses , being most of them private; in fact, we can observe that the improved performance of the model is obtained using only three common factors to all the views, two additional latents shared with RBF kernel and other two shared by linear and polynomial kernel. Besides, the polynomial and linear kernels share all their latent factors; although this may imply a possible redundancy in their information, we have checked that this is actually reinforcing the latent learning since removing any of these views degrades the model performance.
4 Conclusions
We propose a novel probabilistic latent variable model to generate kernel relationships, instead of data observations, based on a linear generative model. We introduce this model using the Bayesian interbattery factor analysis approach proposed in (Anonymous, 2020) to show its capabilities to efficiently face semisupervised heterogeneous multiview problems combining linear and nonlinear data representations. Besides, we extend the model formulation to provide the automatic selection of SVs, obtaining scalable solutions, as well as include an ARD prior over the kernel to obtain feature selection capabilities. The model performance is evaluated in multidimensional regression, feature selection over images and multiplekernel learning problems demonstrating that the inclusion of kernelized observations provide fruitful results.
Broader Impact
This article proposes a multiview semisupervised sparse model with kernelized observations that combine dimensionality reduction with estimation and classification problems functionalities. As such, it may impact potential solutions for problems characterized by multiple view data. Open sourcecode with exemplary Python notebooks will be released to ensure maximal dissemination.
Acknowledgments and Disclosure of Funding
The work of Pablo M. Olmos is supported by Spanish government MCI under grant PID2019108539RBC22 and RTI2018099655B100,by Comunidad de Madrid under grants IND2017/TIC7618, IND2018/TIC9649, and Y2018/TCS4705, by BBVA Foundation under the DeepDARWiN project, and by the European Union (FEDER and the European Research Council (ERC) through the European Unions Horizon 2020 research and innovation program under Grant 714161). C. SevillaSalcedo and V. GómezVerdejo’s work has been partly funded by the Spanish MINECO grant TEC201783838R.
References
 Sparse semisupervised heterogeneous interbattery bayesian analysis. arXiv preprint arXiv:2001.08975. Cited by: §1, §1, §2, §4.
 Multimodal fusion for multimedia analysis: a survey. Multimedia systems 16 (6), pp. 345–379. Cited by: §1.
 Bayesian factor regression models in the “large p, small n” paradigm. Bayesian statistics 7, pp. 733–742. Cited by: §1.

Analysis of multiphase flows using dualenergy gamma densitometry and neural networks
. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 327 (23), pp. 580–593. Cited by: §3.4.  Manifold relevance determination. arXiv preprint arXiv:1206.4610. Cited by: §1, §3.4.
 Multiview learning as a nonparametric nonlinear interbattery factor analysis. arXiv preprint arXiv:1604.04939. Cited by: §1.
 Predicting chemical parameters of river water quality from bioindicator data. Applied Intelligence 13 (1), pp. 7–17. Cited by: §3.1.
 Modern factor analysis. University of Chicago press. Cited by: §1.
 Relations between two sets of variates. Biometrika 28 (34), pp. 321–377. External Links: ISSN 00063444, Document, Link, https://academic.oup.com/biomet/articlepdf/28/34/321/586830/2834321.pdf Cited by: §1.
 Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 0749, University of Massachusetts, Amherst. Cited by: §3.3.
 First order regression. Machine learning 26 (23), pp. 147–176. Cited by: §3.1.
 Bayesian canonical correlation analysis. Journal of Machine Learning Research 14 (Apr), pp. 965–1003. Cited by: §1, §2, §2.
 Probabilistic nonlinear principal component analysis with gaussian process latent variable models. Journal of machine learning research 6 (Nov), pp. 1783–1816. Cited by: §1.
 MNIST handwritten digit database.. URL http://yann. lecun. com/exdb/mnist 7, pp. 23. Cited by: §3.5.
 Generative multiview and multifeature learning for classification. Information Fusion 45, pp. 215–226. Cited by: §1.
 Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §1.
 LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11), pp. 559–572. Cited by: §1.

Nonlinear component analysis as a kernel eigenvalue problem
. Neural computation 10 (5), pp. 1299–1319. Cited by: §1. 
Generalized multiview analysis: a discriminative latent space.
In
2012 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2160–2167. Cited by: §1.  Multitarget regression via input space expansion: treating targets as inputs. Machine Learning 104 (1), pp. 55–98. Cited by: §3.1.
 Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 611–622. Cited by: §1.

Variational learning of inducing variables in sparse gaussian processes.
In
Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics
, D. van Dyk and M. Welling (Eds.), Proceedings of Machine Learning Research, Vol. 5, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, pp. 567–574. External Links: Link Cited by: §1.  Using the nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13 (NIPS 2000), T.K. Leen, T.G. Dietterich, and V. Tresp (Eds.), pp. 682–688 (English). Cited by: §3.2.
 Gaussian processes for machine learning. Vol. 2, MIT press Cambridge, MA. Cited by: §1.

Kernel interpolation for scalable structured gaussian processes (kissgp)
. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1775–1784. External Links: Link Cited by: §1.  Kernelbased data fusion for machine learning. Springer. Cited by: §1.
 Supervised probabilistic principal component analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 464–473. Cited by: §1.
 Multiview dimensionality reduction via canonical random correlation analysis. Frontiers of Computer Science 10 (5), pp. 856–869. Cited by: §1.
Comments
There are no comments yet.