1 Introduction
Since the seminal work of [Lanckriet et al.2004a], data fusion has become an integral part of data analysis especially in the field of computational biology and bioinformatics. For instance, they showed that a set of proteins can be described by a number of relevant data sources, such as proteinprotein interaction, gene expression, and amino acid sequences. This is because relevant data sources provide complementary perspectives or “views” of the objects and, together, these pieces of information present a bigger picture of the relations the objects have with each other. This notion of exploiting the multiple views of the data for better learning is more commonly known as multiview learning. The data sources, however, may come in various forms (e. g. strings, trees, or graphs), and kernel methods [Schölkopf and Smola2002, ShaweTaylor and Cristianini2004] provide a way of integrating such heterogeneous data by transforming them into a common format: as kernel matrices. A Bayesian formulation for efficient multiple kernel learning was presented by [Gönen2012]
, while early works in computational biology utilized multiview learning to classify protein functions
[Deng et al.2004, Lanckriet et al.2004a, Lanckriet et al.2004b, Noble and BenHur2008].A shortcoming of multiview learning, however, is incomplete data. Incomplete data is relatively common in almost all researches, no matter how welldesigned the experiments or the data gathering methods are. A few examples of incomplete data occurrences are: a sensor may suddenly fail and go off in a remote sensing experiment; participants may not have answered some questions in a questionnaire; and inevitable data acquisition error, among others. Analysis of incomplete data may lead to invalid conclusions, since they only give minimal insights about the objects at hand.
Thus, in addition to dealing with the heterogeneity of the data, kernel methods are utilized in several studies to handle missing information. A data source with some missing information leads to an incomplete kernel matrix (i. e., a matrix with missing entries); however, complete kernel matrices derived from complete data sources can be exploited to provide solutions to the incompletedata problem. Studies addressing this problem via kernel methods have progressed over time—from completion of a kernel matrix through a single complete kernel matrix [Kin et al.2004], and through multiple complete kernel matrices [Kato et al.2005]—to simultaneous completion of multiple incomplete kernel matrices [Rivero et al.2017, Bhadra et al.2017]. The kernel completion technique in [Rivero et al.2017]
associates the kernel matrices to the covariance of a zeromean Gaussian distribution, and employs the expectationmaximization (EM) algorithm
[Dempster et al.1977] to minimize the objective function. On the other hand, the technique in [Bhadra et al.2017] learns reconstruction weights to express a particular incomplete kernel matrix as a convex combination of the other kernel matrices. Although these two methods tackle a similar setting, the main difference between them is that [Bhadra et al.2017] employ Euclidean metric to assess the distance between kernel matrices, which requires additional constraints to keep all kernel matrices positive definite. In [Rivero et al.2017], LogDet divergence [Matsuzawa et al.2016, Davis et al.2007] is employed, and this not only keeps the positive definiteness automatically but also brings a strong connection to the classical approach of estimating missing values in vectorial data. With the missing entries inferred, the completed kernel matrices can now be fused and utilized for tasks such as multiview clustering and classification.In the previous solution to the task of multiple kernel matrix completion [Rivero et al.2017], a model matrix is introduced as a representative kernel matrix of the given multiple kernel matrices. The model matrix is allowed to move to any point in the positive definite cone which is a very broad manifold in the set of symmetric matrices. Like the classical model fitting task, a too flexible model tends to overfit to the given empirical data. The flexibility of the model should be adjusted to make the model generalize well, but is impossible to do in the previous model.
In view of this, we present alternative approaches to the previous method for multiple kernel matrix completion by defining parametric models that can move only on a submanifold in the positive definite cone. Both the previous and the new methods can be related to a statistical framework. The previous method can be explained with maximum likelihood estimation of a full covariance Gaussian, which often tends to overfit the data due to large degrees of freedom. On the other hand, the proposed methods can be associated to a parametric model that imposes a restriction to the covariance matrix parameter. The number of degrees of freedom in the new models can be adjusted, thereby improving the generalization performance.
2 Problem Setting
Suppose we have data sources, some or all of which have missing information (Figure 1). The algorithm works on the corresponding incomplete kernel matrices of these data sources, each of size . The rows and columns with missing entries in each of the kernel matrices are rearranged such that the first objects in contain the available information, and unavailable for the remaining objects. This rearrangement of the rows and columns results in the following symmetric partitioned matrix:
(1) 
where , with denoting the set of strictly positive definite symmetric matrices. The algorithm then mutually infers the (missing) entries for the submatrices , , and , for .
3 FCMKMC: Existing Model
In the previous study [Rivero et al.2017], an algorithm for mutual kernel matrix completion had already been developed. Henceforth, we will refer to the previous method as the full covariance mutual kernel matrix completion (FCMKMC), and review the method in this section. To infer the missing values in incomplete kernel matrices, FCMKMC introduces an model matrix , and finds the set of kernel matrices that are as close to each other as possible through the model matrix . The objective function of FCMKMC is the sum of LogDet divergences [Matsuzawa et al.2016, Davis et al.2007]:
(2) 
where is the set of submatrices containing missing entries, and is the model matrix. The LogDet divergence is defined as
(3)  
An advantage of using LogDet divergence is that a necessary property for valid kernel matrices, the positive definiteness, is ensured for the resultant completed kernel matrices. The approach of FCMKMC is essentially similar to the wellknown probabilistic approach for classical incomplete data completion [McLachlan and Krishnan2008]
, where missing values in incomplete vectorial data are to be inferred. In the approach for the classical task, a probabilistic model is introduced to be fitted to the temporarily completed data, and the missing values are imputed with the most probable values using the current inference of the probabilistic model. The number of degrees of freedom of a probabilistic model provides an important perspective for the success or for the failure of data completion: too rigid models cannot capture the underlying data distribution, while too flexible models are often overfitted to the data set. In FCMKMC, the model matrix can take any values without restriction, with
degrees of freedom. This model may be too flexible. In the next section, we shall present two new models in which the number of degrees of freedom can be adjusted.4 Parametric Models
The model matrix of FCMKMC is too flexible and is not tunable. Hence, we introduce two types of the model matrix: PCA model and FA model.
4.1 PcaMkmc
In the PCA model, the form of the model matrix is restricted to
(4) 
where the matrix and scalar are the adaptive parameters of the PCA model. The number of columns in , say , is arbitrary. Larger yields a more flexible model and viceversa. The number of degrees of freedom of this model is .
Meanwhile, the objective function of PCA model is expressed as
(5) 
Since the objective function is not jointly convex of the three arguments, , , and , the optimal solution cannot be given in closed form. Hence, we adopt the following block coordinate descent method that repeats the following two steps:

Imputation step:
(6) 
Model update step:
(7)
Therein, the iteration number is the superscript of and , and the subscript of . By letting , the imputation step can be performed in the same fashion as that of FCMKMC [Rivero et al.2017]. For each data source, the rows and the columns in are reordered as , and partitioned to obtain , , and . Using these submatrices in the model matrix, and the known submatrix in the empirical matrix , the unknown submatrices in are reestimated as
(8) 
and
(9) 
Finally, the submatrices are reordered back to to obtain a new solution that minimizes over missing values , with the model parameters and held fixed. Denote by the new value of at th iteration.
In the model update step, the empirical kernel matrices are fixed, and the two model parameters, and , are optimized. We here denote by ‘const’ the terms independent of , and so the objective function can be rewritten as
(10) 
where we have defined
(11) 
Even though the missing values are fixed to , the function is still not convex on the space of the model parameters . Nevertheless, surprisingly enough, the joint optimal solution of the two model parameters and is given in closed form [Tipping and Bishop1999]. Let , ,
be the eigenvalues of
. Assume that , and denote bytheir corresponding eigenvectors. It can be shown that the optimal
, denoted by , is expressed as(12) 
Now, let and
. Here, for vector
, we denote the diagonal matrix with diagonal entries by . Meanwhile, denotes an dimensional vector containing the diagonal entries in a square matrix . And so, the optimal value of , denoted by , is given by(13) 
where is an arbitrary orthonormal matrix (i. e., ).
4.2 FaMkmc
In this section, we introduce FA model, which is an alternative variant of PCA model. FA model uses the following parametric model as a model matrix:
(14) 
The difference of this from the PCA model is the second term. In PCA model, the second term is , whereas in FA model, the term can take any diagonal matrix. The number of degrees of freedom of FA model is , and the objective function is expressed as
(15) 
Similar to the fitting algorithm of PCA model, we adopt the block coordinate descent method to fit FA model to empirical kernel matrices. The imputation step is same as PCA model. In PCA model, when fixing , the optimal can be expressed in closed form. However, in FA model, the optimal cannot be given in closed form even if is fixed. In the FA model, we just improve at the model update step.
5 Statistical Interpretation
As described in [Rivero et al.2017], FCMKMC falls in a statistical framework. Concretely, FCMKMC is an algorithm that performs the maximum likelihood estimation of a model parameter of a probabilistic model , where is an dimensional random variate. In the statistical framework for FCMKMC, maximum likelihood estimation is performed by finding the maximizer of the loglikelihood function
(19) 
over the model parameter . Therein, is the subvectorial variate in associated with the visible objects in th data source; is the submatrix of associated with ; and is the empirical distribution associated with the
th data source such that the second moments satisfy
(20) 
From the loglikelihood function defined in (19), it is possible to derive an EM algorithm in which Estep computes the expected value by (8) and (9), based on the current model parameter. In the EM algorithm, Mstep updates the model parameter by the maximizer of the expected completedata loglikelihood function [McLachlan and Krishnan2008] over . However, the model matrix is too flexible, and therefore may be overfitted to the given empirical data.
5.1 EM Algorithm for PCA Model
Here, we present a connection between FCMKMC algorithm and the classical statistical approach for missing value estimation. Let us discuss the case of replacing the full covariance model with the probabilistic principal component analysis (PPCA) model introduced in
[Tipping and Bishop1999]. When employing PPCA model with the mean parameter fixed to zero, the probabilistic densities of the dimensional random variate are defined as(21)  
where is the submatrix of containing the rows associated with the visible objects. The loglikelihood function of this model is given by
(22) 
which is used in finding the maximum likelihood estimate (MLE) of the model parameters and of the PPCA model. The expected completedata loglikelihood function, also known as the Qfunction, can be written as
(23)  
where we have dropped the terms that do not depend on the model parameters. Therein, the operator takes mathematical expectation under the joint posterior densities defined from the current value of and . By letting , the negative Qfunction is equal to up to constants, implying that the Mstep of the EM algorithm is given by (12) and (13). Hence, we can say that the PCAMKMC algorithm presented in the previous section is an EM algorithm.
5.2 EM Algorithm for FA Model
This section is concluded by showing that FAMKMC is an EM algorithm for fitting the probabilistic factor analysis (PFA) model [Bartholomew et al.2008]. In the PFA model, a latent variable vector , drawn from the isotropic Gaussian , is introduced for each data source. Then, is generated by the process , where is a Gaussian noise drawn from . For this FA model, we treat as the complete data for th data source to develop an EM algorithm for maximum likelihood estimation. The probabilistic densities of the dimensional random variate are obtained by marginalizing and out from the joint densities of the complete data:
(24) 
The Qfunction is written as
(25)  
where here is the mathematical expectation that operates under the joint posterior densities depending on the current value of and obtained at the iteration. It can be shown that, by letting , the expected values computed in the th iteration in the EM algorithm are expressed as
(26) 
In the Mstep of EM algorithm, the model parameters that maximizes the Qfunction are found. Setting the derivative of the Qfunction with respect to the model parameters to zero, it turns out that the optimal factor loading matrix and noise variance vector are given by (17). Hence, FAMKMC is an EM algorithm. (See Sect. A.1 and Sect. A.2 for derivations of Estep and Mstep, respectively. )
Class  zeroSVM  meanSVM  FCMKMC  PCAGK  PCAK  FAGK  FAK 

1  0.7914  0.7915  0.7995  0.8015  0.8022  0.8010  0.8006 
2  0.7918  0.7925  0.7975  0.8025  0.8032  0.8014  0.8021 
3  0.7941  0.7933  0.8000  0.8045  0.8052  0.8029  0.8032 
4  0.8418  0.8431  0.8497  0.8529  0.8534  0.8516  0.8519 
5  0.8839  0.8844  0.8956  0.8972  0.8979  0.8961  0.8967 
6  0.7665  0.7669  0.7745  0.7780  0.7783  0.7770  0.7770 
7  0.8321  0.8328  0.8414  0.8437  0.8444  0.8429  0.8440 
8  0.7336  0.7336  0.7354  0.7407  0.7418  0.7391  0.7386 
9  0.7621  0.7630  0.7651  0.7706  0.7714  0.7694  0.7695 
10  0.7441  0.7445  0.7485  0.7551  0.7570  0.7525  0.7556 
11  0.5766  0.5757  0.5825  0.5791  0.5807  0.5793  0.5772 
12  0.9357  0.9347  0.9435  0.9448  0.9453  0.9443  0.9444 
13  0.6818  0.6845  0.6794  0.6913  0.6911  0.6840  0.6838 
6 Experimental Settings
To test how much information the kernel matrices will retain after the completion processes, we subject the completed kernel matrices to a classification task: the functional classification prediction of yeast proteins. For this task, a collection of six kernel matrices representing different data types is used: the enriched kernel matrix ; the three interaction kernel matrices , , and ; a Gaussian kernel defined directly on gene expression profiles ; and the SmithWaterman matrix ; as described in [Lanckriet et al.2004b]. While each kernel representation contains partial information on the similarities among yeast proteins, the combination of these kernel matrices is known to provide a bigger picture of the relationships among these proteins through the different views of the data [Lanckriet et al.2004a, Lanckriet et al.2004b, Noble and BenHur2008]—making the combined form more suitable in the overall predictions for the function classification task.
Meanwhile, the 13 functional classes considered are listed in [Lanckriet et al.2004b], which include metabolism, transcription, and protein synthesis, among others. If, for example, a certain protein is known to carry out metabolism and protein synthesis, then this protein is labeled as in these categories and elsewhere. This setting can then be viewed as 13 binary classification tasks.
In this study we utilized related data sources for function prediction of yeast proteins. Initially, the kernel matrices may have missing rows and columns, which correspond to some missing information about the relationships among yeast proteins in the data sources. Our goal is to infer the missing entries in the kernel matrices, whilst retaining as much valuable information about the protein relationships as possible.
Our experiments consist of two stages: the kernel matrix completion (or missing data inference) stage, and the classification stage—the details of which are given in the subsequent sections.
6.1 Data Inference Stage
In this stage, mutual completion of the kernel matrices is performed. Since our data set has no missing entries, we generated incomplete kernel matrices by artificially removing some entries, following the process in [Rivero et al.2017]. Here, rows and (corresponding) columns were randomly picked, and undetermined values (zeros for zeroimputation method, and unconditional mean for meanimputation method) were imputed; the details of which are referred to in [Rivero et al.2017]. For numerical stability of the two EMbased methods, is transformed to at each iteration, a trick that is often used in Gaussian fitting. In our experiments, different percentages of missing entries were considered, and the incomplete kernel matrices were initialized by zeroimputation before proceeding with the completion processes, as specified in Alg. 1.
6.2 Classification Stage
After the completion process, a support vector machine (SVM)
[Cristianini and ShaweTaylor2000, Schölkopf and Smola2002] is used to predict whether a yeast protein belongs to a certain functional class or not. Since a yeast protein is not limited to a single functional class, the prediction problem is structured as 13 binary classification tasks, where an SVM classifier is trained on 20% randomlypicked data points on the combined kernel matrices. We then assess, in each functional class, the classification performance of the algorithms via receiver operator characteristic (ROC)—a widelyused performance measure for imbalanced data sets. Higher ROC score means better classification performance. The experiments were performed ten times, and the averages of ROC scores across the ten trials are recorded in Tab. 1.7 Experimental Results
In this section we present experimental comparisons among the five multiple kernel completion techniques: zeroSVM, meanSVM, FCMKMC, PCAMKMC, and FAMKMC. We refer to the completion methods zeroimputation and meanimputation as zeroSVM and meanSVM, respectively, after an SVM classifier has been trained. In our experiments, we used two criteria in choosing the number of principal components for PCA and FA models: the GuttmanKaiser and Kaiser criterion, where is the number of all eigenvalues greater than the mean of the eigenvalues and greater than one, respectively [Jolliffe1986]. Henceforth, we will use PCAGK and PCAK to refer to PCAMKMC, and FAGK and FAK to refer to FAMKMC, with principal components via GuttmanKaiser criterion and Kaiser criterion, respectively.
The ROC scores of the completion methods after inferring the missed 20% of the data are summarized in Tab. 1. Note, however, that the experiments were done on different percentages of missed entries; but due to lack of space, the results on the other cases are reported in a longer version of this paper.
In the case of completion of 20% missed data, the proposed method of restricting the model covariance achieves the highest ROC score in all classes, except at the 11th functional class where FCMKMC obtains the highest ROC score; and in this case PCAGK and PCAK has no statistical difference from FCMKMC according to a onesample test. It can also be noted that in most cases, the classification performance of the restricted covariance models are not significantly lower than the highest ROC scores.
8 Conclusion
In this study we present new methods, called PCAMKMC and FAMKMC, to solve the problem of mutually inferring the missing entries of kernel matrices, while controlling the flexibility of the model. In contrast to the fullcovariance model parameter in the existing method, our algorithm imposes a restriction to the model covariance, capturing only the most relevant information in the data set through the principal components or factors of the combined kernel matrices. Moreover, utilizing the LogDet divergence ensures the positive definiteness of the resulting inferred kernel matrices. Our proposed method of restricting the model covariance matrix via PPCA and PFA resulted to significant improvements in the generalization performance, as shown in our empirical results for the function classification prediction task in yeast proteins.
Acknowledgment
This work was supported by JSPS KAKENHI Grant Number 40401236.
References
 [Bartholomew et al.2008] D.J. Bartholomew, F. Steele, J. Galbraith, and I. Moustaki. Analysis of Multivariate Social Science Data, Second Edition. Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences. Taylor & Francis, 2008.
 [Bhadra et al.2017] Sahely Bhadra, Samuel Kaski, and Juho Rousu. Multiview kernel completion. Machine Learning, 106(5):713–739, May 2017.
 [Cristianini and ShaweTaylor2000] Nello Cristianini and John ShaweTaylor. An introduction to support vector machines and other kernelbased learning methods. Cambridge university press, 2000.
 [Davis et al.2007] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Informationtheoretic metric learning. In Proceedings on International Conference on Machine Learning, pages 209–216. ACM, 2007.
 [Dempster et al.1977] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
 [Deng et al.2004] Minghua Deng, Ting Chen, and Fengzhu Sun. An integrated probabilistic model for functional prediction of proteins. Journal of Computational Biology, 11(2–3):463–475, 2004.
 [Gönen2012] Mehmet Gönen. Bayesian efficient multiple kernel learning. In 29th International Conference on Machine Learning, 2012.
 [Jolliffe1986] I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 1986.
 [Kato et al.2005] Tsuyoshi Kato, Koji Tsuda, and Kiyoshi Asai. Selective integration of multiple biological data for supervised network inference. Bioinformatics, 21(10):2488–2495, 2005.
 [Kin et al.2004] Taishin Kin, Tsuyoshi Kato, and Koji Tsuda. Protein classification via kernel matrix completion. In Kernel Methods in Computational Biology, chapter 3, pages 261–274. The MIT Press, 2004. In B. Schölkopf, K. Tsuda and J.P. Vert (eds).
 [Lanckriet et al.2004a] Gert R. G. Lanckriet, Tijl De Bie, Nello Cristianini, Michael I. Jordan, and William Stafford Noble. A statistical framework for genomic data fusion. Bioinformatics, 20(16):2626–2635, nov 2004. http://dx.doi.org/10.1093/bioinformatics/bth294.
 [Lanckriet et al.2004b] G.R.G. Lanckriet, M. Deng, N. Christianini, M.I. Jordan, and W.S. Noble. Kernelbased data fusion and its application to protein function prediction in yeast, 2004.

[Matsuzawa et al.2016]
Tomoki Matsuzawa, Raissa Relator, Jun Sese, and Tsuyoshi Kato.
Stochastic dykstra algorithms for metric learning with positive
definite covariance descriptors.
In
The 14th European Conference on Computer Vision (ECCV2016)
, pages 786–799, 2016.  [McLachlan and Krishnan2008] Geoffrey J. McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions, 2nd Edition. Wiley series in probability and statistics. Wiley, Hoboken, NJ, 2008.
 [Noble and BenHur2008] William Stafford Noble and Asa BenHur. Integrating Information for Protein Function Prediction, chapter 35, pages 1297–1314. WileyVCH Verlag GmbH, Weinheim, Germany, Feb 2008.
 [Rivero et al.2017] Rachelle Rivero, Richard Lemence, and Tsuyoshi Kato. Mutual kernel matrix completion. IEICE Transactions on Information & Systems, E100D(8):1844–1851, Aug 2017.
 [Schölkopf and Smola2002] B. Schölkopf and AJ. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, dec 2002.
 [ShaweTaylor and Cristianini2004] John ShaweTaylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004.
 [Tipping and Bishop1999] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Comput, 11(2):443–82, Feb 1999.
Appendix A Proofs and Derivations
a.1 Derivation of EStep for FAMKMC
The joint densities of the complete data, say , are the Gaussian distribution with zero mean and covariance matrix given by
(27) 
Let . The expectation in the Qfunction (25) at th iteration is taken under the posterior distribution of the complete data given by
(28) 
where
(29) 
From the nature of Gaussian, we have
(30) 
In Estep, the terms of the second moments contained in the Qfunction are computed. Using a similar derivation described in [Rivero et al.2017], the second moment of under is expressed as
(31) 
This allows us to write the second moment of as
(32)  
and the second moment of as
(33)  
where
(34) 
a.2 Derivation of MStep for FAMKMC
The derivatives of the function with respect to and for are written as
(35) 
and
(36) 
Setting them to zero yields (17).
a.3 Proof of Proposition 4.1
Let . The objective function of the model update step in FAMKMC algorithm can be rewritten as
(37)  
where the last equality follows because for any and any , it holds that
(38) 
Meanwhile, the Qfunction in the EM algorithm can be rearranged as
(39)  
Recall that is determined to be the value maximizing , which implies that, if we denote by
the KullbackLeibler divergence between two probabilistic density functions, we have
(40)  
where the second inequality follows since KullbackLeibler divergence is nonnegative. The proof is now complete.
Comments
There are no comments yet.