Bayesian Sparse Factor Analysis with Kernelized Observations

Latent variable models for multi-view learning attempt to find low-dimensional projections that fairly capture the correlations among multiple views that characterise each datum. High-dimensional views in medium-sized datasets and non-linear problems are traditionally handled by kernel methods, inducing a (non)-linear function between the latent projection and the data itself. However, they usually come with scalability issues and exposition to overfitting. To overcome these limitations, instead of imposing a kernel function, here we propose an alternative method. In particular, we combine probabilistic factor analysis with what we refer to as kernelized observations, in which the model focuses on reconstructing not the data itself, but its correlation with other data points measured by a kernel function. This model can combine several types of views (kernelized or not), can handle heterogeneous data and work in semi-supervised settings. Additionally, by including adequate priors, it can provide compact solutions for the kernelized observations (based in a automatic selection of bayesian support vectors) and can include feature selection capabilities. Using several public databases, we demonstrate the potential of our approach (and its extensions) w.r.t. common multi-view learning models such as kernel canonical correlation analysis or manifold relevance determination gaussian processes latent variable models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 8

11/13/2014

Multi-view Anomaly Detection via Probabilistic Latent Variable Models

We propose a nonparametric Bayesian probabilistic latent variable model ...
01/24/2020

Sparse Semi-supervised Heterogeneous Interbattery Bayesian Analysis

The Bayesian approach to feature extraction, known as factor analysis (F...
12/16/2009

Multi-Way, Multi-View Learning

We extend multi-way, multivariate ANOVA-type analysis to cases where one...
03/01/2018

Kernel Embedding Approaches to Orbit Determination of Spacecraft Clusters

This paper presents a novel formulation and solution of orbit determinat...
05/19/2016

Inter-Battery Topic Representation Learning

In this paper, we present the Inter-Battery Topic Model (IBTM). Our appr...
04/17/2016

Multi-view Learning as a Nonparametric Nonlinear Inter-Battery Factor Analysis

Factor analysis aims to determine latent factors, or traits, which summa...
11/14/2017

Robust Matrix Elastic Net based Canonical Correlation Analysis: An Effective Algorithm for Multi-View Unsupervised Learning

This paper presents a robust matrix elastic net based canonical correlat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a set of observable data, Latent Variable Models (LVMs) aim to extract a reduced set of hidden variables able to summarise the information into a low dimensional space. These models have become crucial in multi-view problems (Atrey et al., 2010; Sharma et al., 2012; Li et al., 2019), where data are represented by different modalities or views, since LVMs are able to explain the common information among all the modalities.

Classical MultiVariate Analysis (MVA) methods, such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA)

(Pearson, 1901; Hotelling, 1936), aim to exploit the data correlation to obtain a low dimensional latent representation of the data. Its usage has been generalised due to its easy non-linear extension by means of kernel methods. (Schölkopf et al., 1998; Zhang et al., 2016). The fact of supporting a kernel formulation allows these methods to learn arbitrarily complex non-linear models with a complexity determined by the number of training points (Yu et al., 2013)

and make them highly convenient in scenarios with high dimensional data.

Factor Analysis (FA) (Harman, 1976) emerges as a linear bayesian framework where one can obtain the desired latent representation together with a measure of the uncertainty. Among their many variants, such as Probabilistic PCA (Tipping and Bishop, 1999), Supervised PCA (Yu et al., 2006), Bayesian Factor Regression (Bernardo et al., 2003) or Bayesian CCA (Klami et al., 2013), Inter-Battery FA models (Klami et al., 2013) stand out for their capability of handling not only latent variables associated to the common information among all the views, but also for being able to model the intra-view information. This model has been recently extended in (Anonymous, 2020), named as Sparse Semi-supervised Inter-Battery Bayesian Analysis (SSHIBA), to incorporate both missing attributes, feature sparsity/selection, and the ability to handle heterogeneous data such as categorical or multi-dimensional binary data.

The use of kernel methods in bayesian approaches has been mostly developed with Gaussian Processes (Williams and Rasmussen, 2006) and their non-supervised version to perform dimensionality reduction (GP latent variable models, GPLVMs) (Lawrence, 2005). These approaches combine the advantages of the kernels methods, exploiting the non-linear relationships among the data, with that of a probabilistic framework. In (Damianou et al., 2012), the authors propose a shared GPLVMs approach, called Manifold Relevance Determination (MRD), to provide a non-linear latent representation for multi-view learning problems. This model is extended in (Damianou et al., 2016), including an Automatic Relevance Determination (ARD) prior (Neal, 2012) over the kernel formulation, to endow it with feature selection capabilities.

GPLVMs come with practical scalability drawbacks that need to be addressed. The cubic complexity with the number of training points requires the use of inducing points and variational approaches (Titsias, 2009). Selecting the number of inducing points to use, and where to place them in the latent space, is still a challenging problem, being a common solution to place them in a regular basis along the latent space and only optimize the pseudo-observation at those points (Wilson and Nickisch, 2015). Furthermore, up to our knowledge, there is no versatile implementation in the state-of-the-art of a multi-view GPLVM able to handle heterogeneous observations (integer, categorical, real and positive observations) and missing values.

In this paper we propose a novel method to implement non-linear probabilistic LVMs that still builds upon a linear generative model, hence inheriting their computational and scalability properties. Instead of implementing a kernel method, i.e. a GP, to move from the latent representation to the observed data, we propose to reformulate probabilistic FA so that it generates kernel relationships instead of data observations. In the same way that Kernelized PCA (KPCA) or Kernelized CCA (KCCA) are able to generate non-linear latent variables by linearly combining element of a kernel vector, here, from a bayesian generative point of view, we first i.i.d. sample latent representations and project on an -dimensional space (being the number of points) using a weight matrix representing the dual parameters. We apply this trick over the SSHIBA formulation (Anonymous, 2020) to exploit their functionalities over this kernelized formulation. Thanks to that, we can efficiently face semi-supervised heterogeneous multi-view problems combining linear and non-linear data representations; in this way, one can combine kernelized views to deal with non-linear relationships with linearly kernelized to work with high dimensional problems. Besides, we can force the automatic selection of Support Vectors (SVs) to obtain a scalable solution as well as include an ARD prior over the kernel to obtain feature selection capabilities.

2 Bayesian sparse factor analysis with kernelized observations

Let’s consider a multi-view problem where we have data samples represented in different modalities, , and our goal is to find an inter and intra-view non-linear latent representation, . That is, given that is the -th data of the -th view, has to compress, in a low dimensional space of size , both the common and particular information of over all the views exploiting the correlations among the data.111Given a matrix , we denote the -th row by and the -th column by .

Whereas kernel LVMs obtain this latent representation as a linear combination, by some dual variables, of the kernel representation of the -th data, here we propose to reformulate this idea from a generative point of view. In particular, we start from the SSHIBA algorithm formulation (Anonymous, 2020) and consider that there exist some latent variables which are linearly combined with a set of dual variables to generate a kernel vector, , as:

(1)

where

is zero-mean gaussian noise, with noise power following a Gamma distribution of parameters

and , and is the kernel representation of the n-th data; that is, given a mapping function and its associated kernel function , is a vector with the kernel between and all the training data . The dual variable matrix plays the role of the linear projection matrix and it is defined using the same structured ARD prior considered in both (Klami et al., 2013) and (Anonymous, 2020). Namely, an ARD prior that promotes that full rows of this matrix are cancelled, i.e. with , so that in the product in (1) the appropriate set of latent factors is selected.


Figure 1: Graphical model of SSHIBA with kernelized views (KSSHIBA).

Figure 1 shows the graphical model of KSSHIBA. Following (Klami et al., 2013), for the data views that are directly explained given the latent projection we have , where the weight matrix follows the same structured ARD prior mentioned above. We can refer to these as primal observations. For some other views, we might be interested in explaining them indirectly through a kernelized observation following (1). This conversion can be of interest when the view’s dimensionality is much larger than the number of data points . When both primal and kernelized observations are used, the learned latent projection attempts to faithfully reconstruct each of the primal views, and the joint relation between each pair of data points through the reconstruction of the kernel matrix. The posterior distribution of all model parameters and latent projections is approximated using variational inference with a fully factorized posterior, as detailed in the Supplementary Material, where it can be noted that each update has a computational cost of .

Note that sampling from the model in (1) does not ensure a valid kernel positive semi-definite matrix. The kernel matrix is simply treated as an observation (a kernelized observation) and, as such, the model parameters will be chosen to minimize the reconstruction error. Experimental results were also shocking for the authors, as fairly good kernel matrices are typically reconstructed after model training. In Figure 2 we include a graphical representation of both a kernelized observation and the map reconstruction through (1) using the mean of the posterior distribution of . Certainly, more appropriate models could be used to adapt the observation model (given ) to the properties of a kernelized observation. To address this issue, we have explored alternative formulations based in non-independent noise; for example, defining the noise distribution as an inverse-wishart to have a full rank covariance noise or modelling its covariance as the product of two low rank matrices. However, these schemes led to considerably more complicated (less flexible) formulation which limited the rest of the properties of this proposal (as the ones proposed in the following sections). Henceforth, we restrict to the model in (1), and leave this line of work open for future research.



(a) Original kernel matrix
(b) Reconstructed matrix
(c) Reconstruction error
Figure 2: Example of the generative properties of KSSHIBA to reconstruct a complete kernel matrix.

2.1 Automatic bayesian support vector selection

On the basis of a full kernel, with a more structured ARD prior we can achieve not only the shrinkage of the number of effective latent factors, but also a more compact representation of the data by means of a reduced kernel matrix in which only a reduced set of support vectors (SVs) are kept.

For this purpose, the proposed formulation can introduce a double ARD prior over the dual variables , . This way, continues forcing row-wise sparsity to automatically select the number of latent factors and, additionally, induces column-wise sparsity in the columns weight matrix to learn the set of bayesian SVs. This process can be carried out during the inference process, removing the least relevant SVs (and their corresponding columns in ) by setting a threshold, providing additional computational improvement.

2.2 Automatic feature selection

Furthermore, we can additionally endow the proposed kernelized data representation with feature selection capabilities. If by using the double ARD structured we can cancel full rows or columns, equivalently, by using an ARD kernel we can perform feature selection. In the ARD kernel, each feature of the original observations is multiplied by a variable in the kernel definition. For example, for a RBF kernel, ,we can optimise by maximising the lower bound of our mean field approach given by direct optimisation over the variational lower bound. In our model, if the -th view is kernelized then the only terms in the lower bound where the ARD kernel kicks in are (see Supplementary Material for details):

(2)

We alternate between mean-field updates over the variational bound with direct maximization of (2) w.r.t.

using any gradient ascend method (we use Pytorch and Adam for such updates). Finally, by setting a threshold for

the feature selection can be done while training.

3 Results

Throughout this section the presented model is analysed in terms of performance and interpretability of the inferred model parameters and latent projections. Results on some other databases as well as a more extensive description of the experimental setup are available in the Supplementary Material. Furthermore, an exemplary notebook with the library will be uploaded to an open github repository222This notebook has been uploaded with the rest of files to the review system.

3.1 Performance evaluation of KSSHIBA for multi-dimensional regression

KSSHIBA can be trained in a semi-supervised way, being capable of predicting by either sampling from the posterior or simply using the mean, as we will do. This section aims to analyse the performance of KSSHIBA for semi-supervised multi-dimensional regression in comparison with some state-of-the-art baselines. To do so, we used some multitask datasets from the Mulan repository (Spyromitros-Xioufis et al., 2016; Karalič and Bratko, 1997; Džeroski et al., 2000). Table 1

shows the results obtained on the databases comparing the proposed model with: (1) reference regression methods, such as a Support Vector Regression machine with Gaussian RBF kernel (SVR-RBF) and a MultiLayer Perceptron (MLP); (2) a KCCA+LR and KPCA+LR approach where KCCA/KPCA is used for feature extraction and a Linear Regressor (LR) for prediction purposes. In this cases, the number of latent factors has been fixed to the maximum possible,

(where is the number of classes) in KCCA and to those which explain

of the variance in KPCA; (3) Multi-view GPLVM (MRD), the number of latent factors is set to twice

. Two versions of KSSHIBA are included. One in which the number of latent factors is automatically learnt, and one in which we set to .

We calculated the reported results with a nested 10-folds cross-validation (CV). The outer CV is used to divide the dataset into training and test partitions, while the inner CV is in charge of validation and, therefore, it divides the training partition into a second training set and a validation set. This way we were able to estimate the performance of the whole framework and, additionally, validate the model parameters. We used R2 score to measure the performance of the methods. Further information is detailed in the Supplementary Material.

max width= KSSHIBA KSSHIBA MRD KPCA + LR KCCA + LR SVR-RBF MLP at1pd gray!10 gray!10 gray!10 gray!10 gray!10 at7pd gray!10 gray!10 gray!10 gray!10 gray!10 oes97 gray!10 gray!10 gray!10 gray!10 gray!10 oes10 gray!10 gray!10 gray!10 gray!10 gray!10 edm gray!10 gray!10 gray!10 gray!10 gray!10 jura gray!10 gray!10 gray!10 gray!10 gray!10 wq gray!10 gray!10 gray!10 gray!10 gray!10 enb gray!10 gray!10 gray!10 gray!10 gray!10

Table 1:

Results on multitask databases of KSSHIBA and the baselines. The white subrow represents the mean and standard deviation of R2 score and the gray subrow the number of effective latent factors found.

In particular, we can see that KSSHIBA outperforms most methods in terms of R2 score while providing dimensionality reduction. At the same time, the results obtained by KSSHIBA with imply that a less restrictive pruning would not deteriorate the results (except for edm, jura and enb where is or ). Besides providing dimensionality reduction, KSSHIBA proves to be able to perform as well as MLP or even outperform it in terms of R2.

3.2 Evaluation of the solution in terms of SVs

Now, we want to test the capabilities of the KSSSHIBA approach to automatically select a subset of training points. For this purpose, we use the same databases and setup as the previous evaluation to compare to KPCA+LR and KCCA+LR. In these last two models, in order to decide the number of SVs used to build the kernel matrix, a cross-validation has been done following a Nyström (Williams and Seeger, 2001) subsampling technique.

Table 2 shows that the inclusion of the automatic SV selection on KSSHIBA keeps the original model performance for most databases, even improving it for oes97 and edm. This is done while drastically reducing the model complexity; in fact, analysing this in detail, it is observed that the fact of reducing the number of SVs favours an additional reduction in the final number of latent factors. When comparing to KPCA+LR and KCCA+LR, KSSHIBA mostly shows a lower percentage of SVs needed to describe the kernel. This is due to the fact that KSSHIBA learns the relevance of each element and eliminates them accordingly, whereas KPCA and KCCA obtain this compact solutions with a random selection of SVs.

max width=.9 Sparse KSSHIBA KPCA + LR KCCA + LR R2 - R2 - R2 - at1pd gray!10 gray!10 gray!10 at7pd gray!10 gray!10 gray!10 oes97 gray!10 gray!10 gray!10 oes10 gray!10 gray!10 gray!10 edm gray!10 gray!10 gray!10 jura gray!10 gray!10 gray!10 wq gray!10 gray!10 gray!10 enb gray!10 gray!10 gray!10

Table 2: Results on the multitask databases for the automatic SV selection. The first subcolumn shows on the white subrow the mean and standard deviation of the R2 score and on the gray subrow the number of effective latent factors (), the second subcolumn includes the number of selected SVs ().

To complete this analysis, Figure 3 depicts the mean R2 over 10 folds of the analysed algorithms for the databases where KSSHIBA is outperformed in Table 2, the rest are available in the Supplementary Material. For the sake of comparison, we also included the MRD results when its percentage of inducing points is varied. Whereas MRD, KPCA+LR and KCCA+LR present fluctuations in their performance requiring to adjust the number of SVs to obtain an accurate performance, KSSHIBA has a relatively constant R2 value. This phenomenon occurs because KSSHIBA learns the relevance of each SV and weight their influence on the update of the parameters during all the model inference.

(a) at1pd database
(b) at7pd database
Figure 3: R2 results with different percentages of SVs in the the KSSHIBA, KCCA+LR and KPCA+LR or inducing points in MRD.

3.3 Analysis of the feature selection

In order to test the feature selection extension (see Section 2.2), we now study KSSHIBA on different classification databases where the input view is an image, and the output view is the category label. We used the faces dataset Labeled Faces in the Wild (LFW) Huang et al. (2007) and warpAR10P, Yale and Olivetti, which can be found from the Feature Selection Repository333http://featureselection.asu.edu/datasets.php. We applied over the input view the feature selection extension, obtaining the masks in Figure 4. Despite having different image resolution in each database, we can see how the proposed extension is capable of focusing on the most relevant features (white). For instance, Figures 3(b) and 3(b) learn to focus on the area related to glasses, while Figures 3(a) and 3(c) are learning the general face features of the images.

(a) LFW
(b) warpAR10P
(c) Yale
(d) Olivetti
Figure 4:

Feature masks learnt by the feature selection extension of KSSHIBA for different face recognition problems. The mask represent the importance of each pixel: lighter colours imply the pixel is more relevant while darker ones represent the pixel is less relevant.

3.4 Analysis of the extracted latent factors

In this section we want to evaluate the interpretability of the extracted latent factors obtained by the proposed model in comparison to the MRD approach based on shared GPLVMs. We used their available library (Damianou et al., 2012) to compare it with KSSHIBA on the oil classification database (Bishop and James, 1993). For this purpose we have trained both models with latent factors combined with ARD latent factors selection. KSSHIBA uses a RBF kernel for the input view and MRD uses it for both their input and output views. Under this conditions, the accuracy in the prediction of the labels for the MRD was of and KSSHIBA achieved a . With the available MRD implementation (in Matlab), the computational time is not scalable for the number of data. As seen in the Supplementary Material, there is a difference of two orders of magnitude in computational time.

(a) MRD - common
(b) KSSHIBA - input
(c) KSSHIBA - output
Figure 5: Measure of relevance for each learnt latent factor on the Oil database. Figure 4(a) shows the relevance of the commons for MRD model (all latents have resulted to be shared by both views). Figures 4(b) and 4(c) show, respectively, the relevance for the input view and the output view for KSSHIBA.

Figure 5 shows the relevance parameter for each of the learnt latent factors for both models. MRD does not find any view dependent latent factor and all latents are shared by both views (Figure 4(a) shows the relevance for all these common factors) and, besides, it mainly focuses on latents 12, 13 and 14. On the other hand, KSSHIBA presents independent weights for each view (see Figures 4(b) and 4(c)); these results indicate there are certain latent factors that are not relevant and could be pruned (latent 7), some that are only relevant for the input view (latents 5 and 14) and the remaining are common (highlighting latents 0, 2 and 8).

3.5 Multi-view KSSHIBA

One of the main functionalities of KSSHIBA is its capacity to combine multiple views into a single model. We can take advantage of this property when reconstructing kernel representations to combine different types of kernels (one per view). To prove the possibilities of this formulation we used a subset of 1.000 samples from the MNIST database (LeCun et al., 2010) and trained the model in three two-view scenarios for different kernel types in the input kernelized view: a linear kernel, a gaussian one and second degree polynomial kernel, and using the labels as output view; and, additionally, we include a fourth scenario with four views where each kernel is in an input view and the categories are in the output view. The obtained results show that using the linear kernel has an accuracy of , the gaussian has , the polynomial and their combination increases the performance up to .

Figure 6 shows the relevance of each latent factor in the joint scenario (all kernels used). From the original latent factors, the output view only uses , being most of them private; in fact, we can observe that the improved performance of the model is obtained using only three common factors to all the views, two additional latents shared with RBF kernel and other two shared by linear and polynomial kernel. Besides, the polynomial and linear kernels share all their latent factors; although this may imply a possible redundancy in their information, we have checked that this is actually reinforcing the latent learning since removing any of these views degrades the model performance.


Figure 6: Measure of relevance on multiple views for each latent factor combining different kernels on MNIST database.

4 Conclusions

We propose a novel probabilistic latent variable model to generate kernel relationships, instead of data observations, based on a linear generative model. We introduce this model using the Bayesian inter-battery factor analysis approach proposed in (Anonymous, 2020) to show its capabilities to efficiently face semi-supervised heterogeneous multi-view problems combining linear and non-linear data representations. Besides, we extend the model formulation to provide the automatic selection of SVs, obtaining scalable solutions, as well as include an ARD prior over the kernel to obtain feature selection capabilities. The model performance is evaluated in multi-dimensional regression, feature selection over images and multiple-kernel learning problems demonstrating that the inclusion of kernelized observations provide fruitful results.

Broader Impact

This article proposes a multi-view semi-supervised sparse model with kernelized observations that combine dimensionality reduction with estimation and classification problems functionalities. As such, it may impact potential solutions for problems characterized by multiple view data. Open source-code with exemplary Python notebooks will be released to ensure maximal dissemination.

Acknowledgments and Disclosure of Funding

The work of Pablo M. Olmos is supported by Spanish government MCI under grant PID2019-108539RB-C22 and RTI2018-099655-B-100,by Comunidad de Madrid under grants IND2017/TIC-7618, IND2018/TIC-9649, and Y2018/TCS-4705, by BBVA Foundation under the Deep-DARWiN project, and by the European Union (FEDER and the European Research Council (ERC) through the European Unions Horizon 2020 research and innovation program under Grant 714161). C. Sevilla-Salcedo and V. Gómez-Verdejo’s work has been partly funded by the Spanish MINECO grant TEC2017-83838-R.

References

  • A. Anonymous (2020) Sparse semi-supervised heterogeneous interbattery bayesian analysis. arXiv preprint arXiv:2001.08975. Cited by: §1, §1, §2, §4.
  • P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia systems 16 (6), pp. 345–379. Cited by: §1.
  • J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West (2003) Bayesian factor regression models in the “large p, small n” paradigm. Bayesian statistics 7, pp. 733–742. Cited by: §1.
  • C. M. Bishop and G. D. James (1993)

    Analysis of multiphase flows using dual-energy gamma densitometry and neural networks

    .
    Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 327 (2-3), pp. 580–593. Cited by: §3.4.
  • A. Damianou, C. Ek, M. Titsias, and N. Lawrence (2012) Manifold relevance determination. arXiv preprint arXiv:1206.4610. Cited by: §1, §3.4.
  • A. Damianou, N. D. Lawrence, and C. H. Ek (2016) Multi-view learning as a nonparametric nonlinear inter-battery factor analysis. arXiv preprint arXiv:1604.04939. Cited by: §1.
  • S. Džeroski, D. Demšar, and J. Grbović (2000) Predicting chemical parameters of river water quality from bioindicator data. Applied Intelligence 13 (1), pp. 7–17. Cited by: §3.1.
  • H. H. Harman (1976) Modern factor analysis. University of Chicago press. Cited by: §1.
  • H. Hotelling (1936) Relations between two sets of variates. Biometrika 28 (3-4), pp. 321–377. External Links: ISSN 0006-3444, Document, Link, https://academic.oup.com/biomet/article-pdf/28/3-4/321/586830/28-3-4-321.pdf Cited by: §1.
  • G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: §3.3.
  • A. Karalič and I. Bratko (1997) First order regression. Machine learning 26 (2-3), pp. 147–176. Cited by: §3.1.
  • A. Klami, S. Virtanen, and S. Kaski (2013) Bayesian canonical correlation analysis. Journal of Machine Learning Research 14 (Apr), pp. 965–1003. Cited by: §1, §2, §2.
  • N. Lawrence (2005) Probabilistic non-linear principal component analysis with gaussian process latent variable models. Journal of machine learning research 6 (Nov), pp. 1783–1816. Cited by: §1.
  • Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database.. URL http://yann. lecun. com/exdb/mnist 7, pp. 23. Cited by: §3.5.
  • J. Li, B. Zhang, G. Lu, and D. Zhang (2019) Generative multi-view and multi-feature learning for classification. Information Fusion 45, pp. 215–226. Cited by: §1.
  • R. M. Neal (2012) Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §1.
  • K. Pearson (1901) LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11), pp. 559–572. Cited by: §1.
  • B. Schölkopf, A. Smola, and K. Müller (1998)

    Nonlinear component analysis as a kernel eigenvalue problem

    .
    Neural computation 10 (5), pp. 1299–1319. Cited by: §1.
  • A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs (2012) Generalized multiview analysis: a discriminative latent space. In

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2160–2167. Cited by: §1.
  • E. Spyromitros-Xioufis, G. Tsoumakas, W. Groves, and I. Vlahavas (2016) Multi-target regression via input space expansion: treating targets as inputs. Machine Learning 104 (1), pp. 55–98. Cited by: §3.1.
  • M. E. Tipping and C. M. Bishop (1999) Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 611–622. Cited by: §1.
  • M. Titsias (2009) Variational learning of inducing variables in sparse gaussian processes. In

    Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics

    , D. van Dyk and M. Welling (Eds.),
    Proceedings of Machine Learning Research, Vol. 5, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, pp. 567–574. External Links: Link Cited by: §1.
  • C. K. I. Williams and M. Seeger (2001) Using the nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13 (NIPS 2000), T.K. Leen, T.G. Dietterich, and V. Tresp (Eds.), pp. 682–688 (English). Cited by: §3.2.
  • C. K. Williams and C. E. Rasmussen (2006) Gaussian processes for machine learning. Vol. 2, MIT press Cambridge, MA. Cited by: §1.
  • A. Wilson and H. Nickisch (2015)

    Kernel interpolation for scalable structured gaussian processes (kiss-gp)

    .
    In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1775–1784. External Links: Link Cited by: §1.
  • S. Yu, L. Tranchevent, B. De Moor, and Y. Moreau (2013) Kernel-based data fusion for machine learning. Springer. Cited by: §1.
  • S. Yu, K. Yu, V. Tresp, H. Kriegel, and M. Wu (2006) Supervised probabilistic principal component analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 464–473. Cited by: §1.
  • Y. Zhang, J. Zhang, Z. Pan, and D. Zhang (2016) Multi-view dimensionality reduction via canonical random correlation analysis. Frontiers of Computer Science 10 (5), pp. 856–869. Cited by: §1.