1 Introduction
In this paper we tackle the problem of supervised multiview learning, where each labeled example is observed under several views. These views might be not only correlated, but also complementary, redundant or contradictory. Thus, learning over all the views is expected to produce a final classifier (or regressor) that is better than each individual one. Multiview learning is wellknown in the semisupervised setting, where the agreement among views is usually optimized [4, 28]. Yet, the supervised setting has proven to be interesting as well, independently from any agreement condition on views. Coregularization and multiple kernel learning (MKL) are two well known kernelbased frameworks for learning in the presence of multiple views of data [31]. The former attempts to optimize measures of agreement and smoothness between the views over labeled and unlabeled examples [26]; the latter tries to efficiently combine multiple kernels defined on each view to exploit information coming from different representations [11]. More recently, vectorvalued reproducing kernel Hilbert spaces (RKHSs) have been introduced to the field of multiview learning for going further than MKL by incorporating in the learning model both withinview and betweenview dependencies [20, 14]. It turns out that these kernels and their associated vectorvalued reproducing Hilbert spaces provide a unifying framework for a number of previous multiview kernel methods, such as coregularized multiview learning and manifold regularization, and naturally allow to encode withinview as well as betweenview similarities [21].
Kernels of vectorvalued RKHSs are positive semidefinite matrixvalued functions. They have been applied with success in various machine learning problems, such as multitask learning
[10], functional regression [15] and structured output prediction [5]. The main advantage of matrixvalued kernels is that they offer a higher degree of flexibility in encoding similarities between data points. However finding the optimal matrixvalued kernel of choice for a given application is difficult, as is the question of how to build them. In order to overcome the need for choosing a kernel before the learning process, we propose a supervised metric learning approach that learns a matrixvalued multiview kernel jointly with the decision function. We refer the reader to [3] for a review of metric learning. It is worth mentioning that algorithms for learning matrixvalued kernels have been proposed in the literature, see for example [9, 8, 17]. However, these methods mainly consider separable kernels which are not suited for multiview setting, as will be illustrated later in this paper.The main contributions of this paper are: 1) we introduce and learn a new class of matrixvalued kernels designed to handle multiview data 2) we give an iterative algorithm that learns simultaneously a vectorvalued multiview function and a blockstructured metric between views, 3) we provide generalization analysis of our algorithm with a Rademacher bound; and 4) we show how matrixvalued kernels can be efficiently computed via a blockwise Nyström approximation in order to reduce significantly their high computational cost.
2 Preliminaries
We start here by briefly reviewing the basics of vectorvalued RKHSs and their associated matrixvalued kernels. We then describe how they can be used for learning from multiview data.
2.1 Vectorvalued RKHSs
Vectorvalued RKHSs were introduced to the machine learning community by Micchelli and Pontil [19] as a way to extend kernel machines from scalar to vector outputs. In this setting, given a random training sample on , optimization problem
(1) 
where is a vectorvalued function and
is a loss function, can be solved in a vectorvalued RKHS
by the means of a vectorvalued extension of the representer theorem. To see this more clearly, we recall some fundamentals of vectorvalued RKHSs.Definition 1.
(vectorvalued RKHS)
A Hilbert space of functions from to
is called a reproducing kernel Hilbert space if
there is a positive definite valued kernel
on such that:

the function belongs to ,

(reproducing property).
Definition 2.
(matrixvalued kernel)
An valued kernel on
is a function
; it is positive semidefinite if:

, where denotes the transpose of a matrix,

and, for every and all , .
Important results for matrixvalued kernels include the positive semidefiniteness of the kernel and that we obtain a solution for regularized optimization problem (1) via a representer theorem. It states that solution for a learning problem can be written as
Some wellknown classes of matrixvalued kernels include separable and transformable kernels. Separable kernels are defined by
where is a matrix in . This class of kernels is very attractive in terms of computational time, as it is easily decomposable. However the matrix acts only on the outputs independently of the input data, which makes it difficult for these kernels to encode necessary similarities in multiview setting. Transformable kernels are defined by
Here and are indices of the output matrix (views in multiview setting) and operators , are used to transform the data. In contrast to separable kernels, here the operate on input data; however choosing them is a difficult task. For further reading on matrixvalued reproducing kernels, see, e.g., [1, 6, 7, 15].
2.2 Vectorvalued multiview learning
This section reviews the setup for supervised multiview learning in vectorvalued RKHSs [14, 21]. The main idea is to consider a kernel that measures not only the similarities between examples of the same view but also those coming from different views. Reproducing kernels of vectorvalued Hilbert spaces allow encoding in a natural way these similarities and taking into account both withinview and betweenview dependencies. Indeed, a kernel function in this setting outputs a matrix in , with the number of views, so, that , , is the similarity measure between examples and from the views and .
More formally, consider a set of labeled data , where and for classification or for regression. Also assume that each input instance is seen in views, where and . The supervised multiview learning problem can be thought of as trying to find the vectorvalued function , with , solution of
(2) 
Here is a vectorvalued function that groups learning functions, each corresponding to one view, and is combination operator for combining the results of the learning functions.
While the vectorvalued extension of the representer theorem provides an algorithmc way for computing the solution of the multiview learning problem (2), the question of choosing the multiview kernel remains crucial to take full advantage of the vectorvalued learning framework. In [14], a matrixvalued kernel based on crosscovariance operators on RKHS that allow modeling variables of multiple types and modalities was proposed. However, it has two major drawbacks: i) the kernel is fixed in advance and does not depend on the learning problem, and ii) it is computationally expensive and becomes infeasible when the problem size is very large. We avoid both of these issues by learning a block lowrank metric in kernel feature spaces.
3 MultiView Metric Learning
Here we introduce an optimization problem for learning simultaneously a vectorvalued multiview function and a positive semidefinite metric between kernel feature maps, as well as an operator for combining the answers from the views to yield the final decision. We then derive a threestep metric learning algorithm for multiview data and give Rademacher bound for it. Finally we demonstrate how it can be implemented efficiently via blockwise Nyström approximation and give a blocksparse version of our formulation.
3.1 Matrixvalued multiview kernel
We consider the following class of matrixvalued kernels that can operate over multiple views
(3) 
where (resp. ) is the feature map associated to the scalarvalued kernel (resp. ) defined on the view (resp. ). In the following we will leave out the view label from data instance when the feature map or kernel function already has that information, e.g. instead of we write . is a linear operator between the scalarvalued RKHSs and of kernels and , respectively. The operator allows one to encode both withinview and betweenview similarities.
The choice of the operator is crucial and depends on the multiview problem at hand. In the following we only consider operators that can be written as , where with and is a positive definite matrix which plays the role of a metric between the two features maps associated with kernels and defined over the views and . This is a large set of possible operators, but depends on a finite number of parameters. It gives us the following class of kernels
(4) 
where we have written . We note that this class is not in general separable or transformable. However in the special case when it is possible to write the kernel is transformable.
It is easy to see that the th block of the block kernel matrix built from the matrixvalued kernel (3.1) can be written as where for view . The block kernel matrix in this case has the form
(5) 
where ,^{1}^{1}1Given a set of matrices , is the block diagonal matrix satisfying . and the matrix encodes pairwise similarities between all the views. Multiview metric learning then corresponds to simultaneously learning the metric and the classifier or regressor.
From this framework, with suitable choices of , we can recover the crosscovariance multiview kernel of [14], or for example a MKLlike multiview kernel containing only oneview kernels.
3.2 Algorithm
Using the vectorvalued representer theorem, the multiview learning problem (2) becomes
We set to be the square loss function and assume the operator to be known. We choose it to be weighted sum of the outputs: giving us . Let be the output vector . The previous optimization problem can now be written as
where is the block kernel matrix associated to the matrixvalued kernel (3) and is a vector containing weights for combining the final result.
Using (5), and considering an additional regularizer we formulate the multiview metric learning (MVML) optimization problem:
(6)  
Here we have restricted the block metric matrix to be positive definite and we penalize its complexity via Frobenius norm.
Inspired by [8] we make a change of variable in order to obtain a solution. Using a mapping we obtain the equivalent learning problem:
(7)  
It is good to note that despite the misleading similarities between our work and that of [8], we use different mappings for solving our problems, which are also formulated differently. We also consider different classes of kernels as [8] considers only separable kernels.
Remark.
We use an alternating scheme to solve our problem. We arrive to the following solution for with fixed :
(8) 
The solution of (7) for for fixed is obtained by gradient descent, where the update rule is given by
(9) 
where is the step size. Technical details of the derivations can be found in the supplementary material (Appendix A.1). It is important to note here that Equation (9) is obtained by solving the optimization problem (7) without considering the positivity constraint on . Despite this, (when ) the obtained is symmetric and positive (compare to [13]), and hence the learned matrixvalued multiview kernel is valid.
If so desired, it is also possible to learn the weights w. For fixed and the solution for is
(10) 
where is filled columnwise from .
Our MVML algorithm thus iterates over solving , and if weights are to be learned (see Algorithm 1, version a). The complexity of the algorithm is for it computes the inverse of the matrix , required for calculating . We will show later how to reduce the computational complexity of our algorithm via Nyström approximation, while conserving the desirable information about the multiview problem.
Simple twoview dataset and its transformations  left: original data where one of the views is completely generated from the other by a linear transformation (a shear mapping followed by a rotation), left middle: MKL transformation, right middle: MVML transformation and right: OKL transformation. MVML shows a linear separation of classes (blue/pale red) of the views (circles/triangles), while MKL and OKL do not.
3.3 Illustration
We illustrate with simple toy data the effects of learning both within and betweenview metrics. We compare our method, MVML, to MKL that considers only withinview dependencies, and to output kernel learning (OKL) [8, 9] where separable kernels are learnt. We generated an extremely simple dataset of two classes and two views in , allowing for visualization and understanding of the way the methods perform classification with multiview data. The second view in the dataset is completely generated from the first, through a linear transformation (a shear mapping followed by a rotation). The generated data and transformation arising from applying the algorithms are shown in Figure 1. The space for transformed data is since we used linear kernels for simplicity. Our MVML is the only method giving linear separation of the two classes. This means that it groups the data points into groups based on their class, not view, and thus is able to construct a good approximation of the initial data transformations by which we generated the second view.
3.4 Rademacher complexity bound
We now provide a generalization analysis of MVML algorithm using Rademacher complexities [2]. The notion of Rademacher complexity has been generalizable to vectorvalued hypothesis spaces [18, 27, 24]. Previous work has analyzed the case where the matrixvalued kernel is fixed prior to learning, while our analysis considers the kernel learning problem. It provides a Rademacher bound for our algorithm when both the vectorvalued function and the metric between views are learnt. We start by recalling that the feature map associated to the matrixvalued kernel is the mapping , where is the input space, , and is the set of bounded linear operators from to (see, e.g., [19, 7] for more details). It is known that . We denote by the feature map associated to our multiview kernel (Equation 3.1). The hypothesis class of MVML is
with and is a regularization parameter. Let be an iid family of vectors of independent Rademacher variables where . The empirical Rademacher complexity of the vectorvalued class is the function defined as
Theorem 1.
The empirical Rademacher complexity of can be upper bounded as follows:
where , and is the Gram matrix computed from the training set with the kernel defined on the view . For kernels such that , we have
The proof for the theorem can be found in the supplementary material (Appendix A.2). Using wellknown results [22, chapter 10], this bound on Rademacher complexity can be used to obtain a generalization bound for our algorithm. It is worth mentioning that in our multiview setting the matrixvalued kernel is computed from the product of the kernel matrices defined over the views. This is why our assumption is on the trace of the square of the kernel matrices . It is more restrictive than the usual one in the oneview setting (), but is satisfied in some cases, like, for example, for diagonally dominant kernel matrices [25]. It is interesting to investigate whether our Rademacher bound could be obtained under a much less restrictive assumption on the kernels over the views, and this will be investigated in future work.
3.5 Blocksparsity and efficient implemen
tation via blockNyström approximation
In this section we consider variant of our formulation (6) which allows blocksparse solutions for the metric matrix , and further show how to reduce the complexity of the required computations for our algorithms.
Blocksparsity
We formulate a second optimization problem to study the effect of sparsity over . Instead of having for example norm regularizer over the whole matrix, we consider sparsity on a group level so that whole blocks corresponding to pairs of views are put to zero. Intuitively, the blocksparse result will give insight as to which views are interesting and worth taking into account in learning. For example, by tuning the parameter controlling sparsity level one could derive, in some sense, an order of importance to the views and their combinations. The convex optimization problem is as follows
(11) 
where we have a regularizer over set of groups we consider for sparsity. In our multiview setting these groups correspond to combinations of views; e.g. with three views the matrix would consist of six groups:
We note that when we speak of combinations of views we include both blocks of the matrix that this combination corresponds to. Using this group regularization, in essence, allows us to have viewsparsity in our multiview kernel matrix.
To solve this optimization problem we introduce the same mapping as before, and obtain the same solution for and . However (3.5) does not have an obvious closedform solution for so it is solved with proximal gradient method, the update rule being
(12)  
where is the step size, , and .
We note that even if we begin iteration with positive definite (pd) matrix the next iterate is not guaranteed to be always pd, and this is the reason for omitting the positivity constraint in the formulation of sparse problem (Equation 3.5). Nevertheless all blockdiagonal results are pd, and so are other results if certain conditions hold. In experiments we have observed that the solution is positive semidefinite. The full derivation of the proximal algorithm and notes about positiveness of are in supplementary material (Appendix A.1).
Nyström approximation
As a way to reduce the complexity of the required computations we propose using Nyström approximation on each oneview kernel matrix. In Nyström approximation method [30], a (scalarvalued) kernel matrix is divided in four blocks,
and is approximated by , where and . Denote as the number of rows of chosen to build . This scheme gives a lowrank approximation of by sampling examples, and only the last block, , will be approximated.
We could approximate the block kernel matrix directly by applying the Nyström approximation, but this would have the effect of removing the block structure in the kernel matrix and consequently the useful multiview information might be lost. Instead, we proceed in a way that is consistent with the multiview problem and approximate each kernel matrix defined over one view as . The goodness of approximation is based on the chosen. Before performing the approximation a random ordering of the samples is calculated. We note that in our multiview setting we have to impose the same ordering over all the views. We introduce the Nyström approximation to all our singleview kernels and define . We can now approximate our multiview kernel (5) as
where we have written . Using this scheme, we obtain a blockwise Nyström approximation of that preserves the multiview structure of the kernel matrix while allowing substantial computational gains.
We introduce Nyström approximation into (6) and (3.5) and write resulting in
(13)  
and
(14)  
We note that the optimization problems are not strictly equivalent to the ones before; namely we impose the Frobenius norm regularization over rather than over . The obtained solution for (13) will again satisfy the positivity condition when . For the sparse solution the positivity is unfortunately not always guaranteed, but is achieved if certain conditions hold.
We solve the problems as before, and obtain:
(15) 
(16) 
and
(17) 
Here is filled columnwise from . For our blocksparse method with we get update rule
(18)  
where and is the step size.
We follow the same algorithm than before for calculating the solution; now over and (Algorithm 1, version b). The complexity is now of order rather than , where is the number of samples chosen for the Nyström approximation in each block. From the obtained solution it is possible to calculate the original and if needed.
To also reduce the complexity of predicting with our multiview kernel framework, our blockwise Nyström approximation is used again on the test kernel matrices computed with the test examples. Let us recall that for each of our singleview kernels, we have an approximation . We choose to be first columns of the matrix , and define the approximation for the test kernel to be
In such an approximation, the error is in the last columns of . We gain in complexity, as if forced to use the test kernel as is, we would need to calculate from in operations.
4 Experiments
Here we evaluate the proposed multiview metric learning (MVML) method on realworld datasets and compare it to relevant methods. The chosen datasets are "pure" multiview datasets, that is to say, the view division arises naturally from the data.
We perform two sets of experiments with two goals. First, we evaluate our method in regression setting with a large range of Nyström approximation levels in order to understand the effect it has on our algorithm. Secondly, we compare MVML to relevant stateoftheart methods in classification. In both cases, we use non multiview methods to justify the multiview approach. The methods we use in addition to our own are:

MVML_Cov and MVML_I: we use preset kernels in our framework: MVML_Cov uses the kernel from [14] and MVML_I refers to the case when have only oneview kernel matrices in the diagonal of the multiview kernel.^{2}^{2}2Code for MVML is available at https://lives.lif.univmrs.fr/?page_id=12

lpMKL is an algorithm for learning weights for MKL kernel [16]. We apply it to kernel regression.

MLKR [29] is an algorithm for metric learnig in kernel setting.

KRR and SVM
: We use kernel ridge regression and support vector machines with oneview as well as in early fusion (ef) and late fusion (lf) in order to validate the benefits of using multiview methods.
We perform our experiments with Python, but for OKL and MLKR we use the MATLAB codes provided by authors^{3}^{3}3https://www.cs.cornell.edu/kilian/code/code.html ^{†}^{†}footnotetext: and https://github.com/cciliber/matMTL. . In MVML we set weights uniformly to . For all the datasets we use Gaussian kernels, .
METHOD  MVML  MVMLsp.  MVML_Cov  MVML_I  lpMKL  OKL  MLKR 

Flower17 (6%)  75.98 2.62  75.71 2.48  75.71 2.19  76.03 2.36  75.54 2.61  68.73 1.95  63.82 2.51 
Flower17 (12%)  77.89 2.41  77.43 2.44  77.30 2.36  78.36 2.52  77.87 2.52  75.19 1.97  64.41 2.41 
Flower17 (24%)  78.60 1.41  78.60 1.36  79.00 1.75  79.19 1.51  78.75 1.58  76.76 1.62  65.44 1.36 
uWaveG. (6%)  92.67 0.21  92.68 0.17  92.34 0.20  92.34 0.19  92.34 0.18  70.09 1.07  71.09 0.94 
uWaveG. (12%)  93.03 0.11  92.86 0.26  92.53 0.18  92.59 0.13  92.48 0.21  74.07 0.26  80.22 0.38 
uWaveG. (24%)  92.59 0.99  93.26 0.15  92.66 0.05  93.10 0.11  92.85 0.13  76.65 0.33  86.38 0.31 
METHOD  efSVM  lfSVM  1 view SVM  
Flower17 (6%)    15.32 1.94  11.59 1.54  
Flower17 (12%)    23.82 2.38  15.74 1.54  
Flower17 (24%)    38.24 2.31  22.79 0.79  
uWaveG. (6%)  80.00 0.74  71.24 0.41  56.54 0.38  
uWaveG. (12%)  82.29 0.63  72.53 0.16  57.50 0.17  
uWaveG. (24%)  84.07 0.23  72.99 0.06  58.01 0.05 

4.1 Effect of Nyström approximation
For our first experiment we consider SARCOSdataset^{4}^{4}4http://www.gaussianprocess.org/gpml/data., where the task is to map a 21dimensional input space (7 joint positions, 7 joint velocities, 7 joint accelerations) to the corresponding 7 joint torques. Here we present results to the first task.
The results with various levels of Nyström approximation  averaged over four approximations  from 1% to 100% of data are shown in Figure 2. Regularization parameters were crossvalidated over values e and e. Kernel parameter was fixed to be 1/number of features as a tradeoff between overfitting and underfitting. We used only 1000 data samples of the available 44484 in training (all 4449 in testing) to be feasibly able to show the effect of approximating the matrices on all levels, and wish to note that using more data samples with moderate approximation level we can yield a lower error than presented here: for example with 2000 training samples and Nyström approximation level of 8% we obtain error of 0.3915. However the main goal of our experiment was to see how our algorithm behaves with various Nyström approximation levels and because of the high complexity of our algorithm trained on the full dataset without approximation we performed this experiment with low amount of samples.
The lowest error was obtained with our MVMLsparse algorithm at 8% Nyström approximation level. All the multiview results seem to benefit from using the approximation. Indeed, approximating the kernel matrices can be seen as a form of regularization and our results reflect on that [23]. Overall our MVML learning methods have much higher computational cost with large Nyström parameters, as can be seen from Figure 2, rightmost plot. However with smaller approximation levels with which the methods are intended to be used, the computing time is competitive.
4.2 Classification results
In our classification experiments we use two realworld multiview datasets: Flower17^{5}^{5}5http://www.robots.ox.ac.uk/vgg/data/flowers/17. (7 views, 17 classes, 80 samples per class) and uWaveGesture^{6}^{6}6http://www.cs.ucr.edu/eamonn/time_series_data. (3 views, 8 classes, 896 data samples for training and 3582 samples for testing). We set the kernel parameter to be mean of distances, . The regularization parameters were obtained by crossvalidation over values e and e. The results are averaged over four approximations.
We adopted onevsall classification approach for multiclass classification. The results are displayed in Table 1. The MVML results are always notably better than the SVM results, or the results obtained with OKL or MLKR. Compared to MVML, OKL and MLKR accuracies decrease more with low approximation levels. We can see that all MVML methods perform very similarly, sometimes the best result is obtained with fixed multiview kernel, sometimes when is learned.
As an example of our sparse output with MVML we note that running the algorithm with Flower17 dataset with 12% approximation often resulted in a spd matrix as in Figure 3. Indeed the resulting sparsity is very interesting and tells us about importance of the views and their interactions.
5 Conclusion
We have introduced a general class of matrixvalued multiview kernels for which we have presented two methods for simultaneously learning a multiview function and a metric in vectorvalued kernel spaces. We provided iterative algorithms for the two resulting optimization problems, and have been able to significantly lower the high computational cost associated with kernel methods by introducing blockwise Nyström approximation. We have explained the feasibility of our approach onto a trivial dataset which reflects the objective of learning the withinview and betweenview correlation metrics. The performance of our approach was illustrated with experiments with real multiview datasets by comparing our method to standard multiview approaches, as well as methods for metric learning and kernel learning. Our sparse method is especially promising in the sense that it could give us information about importance of the views. It would be interesting to investigate the applicability of our framework in problems involving missing data in views, as well as the generalization properties with the Nyström approximation. We would also like to continue investigating the theoretical properties of our sparse algorithm in order to prove the positiveness of the learned metric matrix that we observed experimentally.
Acknowledgements
We thank the anonymous reviewers for their relevant and helpful comments. This work is granted by Lives Project (ANR15CE230026).
References
 [1] Mauricio A. Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vectorvalued functions: A review. Foundations and Trends® in Machine Learning, 4(3):195–266, 2012.
 [2] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.

[3]
Aurélien Bellet, Amaury Habrard, and Marc Sebban.
Metric learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning
, 9(1):1–151, 2015. 
[4]
Avrim B. Blum and Tom M. Mitchell.
Combining labeled and unlabeled data with cotraining.
In
11th Annual Conference on Computational Learning Theory (COLT)
, pages 92–100, 1998.  [5] Céline Brouard, Marie Szafranski, and Florence d’Alché Buc. Input output kernel regression: Supervised and semisupervised structured output prediction with operatorvalued kernels. Journal of Machine Learning Research, 17(176):1–48, 2016.
 [6] Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, and Yiming Ying. Universal multitask kernels. Journal of Machine Learning Research, 9(Jul):1615–1646, 2008.
 [7] Claudio Carmeli, Ernesto De Vito, Alessandro Toigo, and Veronica Umanita. Vector valued reproducing kernel hilbert spaces and universality. Analysis and Applications, 08(01):19–61, 2010.
 [8] Carlo Ciliberto, Youssef Mroueh, Tomaso Poggio, and Lorenzo Rosasco. Convex learning of multiple tasks and their structure. In International Conference in Machine Learning (ICML), 2015.
 [9] Francesco Dinuzzo, Cheng S. Ong, Gianluigi Pillonetto, and Peter V Gehler. Learning output kernels with block coordinate descent. In International Conference on Machine Learning (ICML), pages 49–56, 2011.
 [10] Theodoros Evgeniou, Charles A. Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615–637, 2005.
 [11] Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of machine learning research, 12(Jul):2211–2268, 2011.
 [12] Matthias Günther and Lutz Klotz. Schur’s theorem for a block hadamard product. Linear Algebra and its Applications, 437(3):948–956, 2012.
 [13] Pratik Jawanpuria, Maksim Lapin, Matthias Hein, and Bernt Schiele. Efficient output kernel learning for multiple tasks. In Advances in Neural Information Processing Systems (NIPS), pages 1189–1197, 2015.
 [14] Hachem Kadri, Stéphane Ayache, Cécile Capponi, Sokol Koço, FrançoisXavier Dupé, and Emilie Morvant. The multitask learning view of multimodal data. In Asian Conference in Machine Learning (ACML), pages 261–276, 2013.
 [15] Hachem Kadri, Emmanuel Duflos, Philippe Preux, Stéphane Canu, Alain Rakotomamonjy, and Julien Audiffren. Operatorvalued kernels for learning from functional response data. Journal of Machine Learning Research, 16:1–54, 2016.
 [16] Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. Lpnorm multiple kernel learning. Journal of Machine Learning Research, 12(Mar):953–997, 2011.

[17]
Néhémy Lim, Florence d’Alché Buc, Cédric Auliac, and George
Michailidis.
Operatorvalued kernelbased vector autoregressive models for network inference.
Machine learning, 99(3):489–513, 2015.  [18] Andreas Maurer. The Rademacher complexity of linear transformation classes. In International Conference on Computational Learning Theory (COLT), pages 65–78, 2006.
 [19] Charles A. Micchelli and Massimiliano Pontil. On learning vectorvalued functions. Neural Computation, 17:177–204, 2005.
 [20] Ha Quang Minh, Loris Bazzani, and Vittorio Murino. A unifying framework for vectorvalued manifold regularization and multiview learning. In International Conference in Machine Learning (ICML), 2013.
 [21] Ha Quang Minh, Loris Bazzani, and Vittorio Murino. A unifying framework in vectorvalued reproducing kernel hilbert spaces for manifold regularization and coregularized multiview learning. Journal of Machine Learning Research, 17(25):1–72, 2016.
 [22] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
 [23] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems (NIPS), pages 1657–1665, 2015.

[24]
Maxime Sangnier, Olivier Fercoq, and Florence d’Alché Buc.
Joint quantile regression in vectorvalued RKHSs.
In Advances in Neural Information Processing Systems (NIPS), pages 3693–3701, 2016.  [25] Bernhard Schölkopf, Jason Weston, Eleazar Eskin, Christina Leslie, and William Stafford Noble. A kernel approach for learning from almost orthogonal patterns. In European Conference on Machine Learning, pages 511–528. Springer, 2002.

[26]
Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin.
A coregularization approach to semisupervised learning with multiple views.
In Proceedings of ICML workshop on learning with multiple views, pages 74–79, 2005.  [27] Vikas Sindhwani, Minh Ha Quang, and Aurélie C. Lozano. Scalable matrixvalued kernel learning for highdimensional nonlinear multivariate regression and granger causality. In Uncertainty in Artificial Intelligence (UAI), 2012.
 [28] Vikas Sindhwani and David S. Rosenberg. An RKHS for multiview learning and manifold coregularization. In International Conference in Machine Learning (ICML), pages 976–983, 2008.
 [29] Kilian Q. Weinberger and Gerald Tesauro. Metric learning for kernel regression. In Artificial Intelligence and Statistics, pages 612–619, 2007.
 [30] Christopher K. I. Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 682–688, 2001.
 [31] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multiview learning. arXiv preprint arXiv:1304.5634, 2013.
 [32] Yu Zhang and DitYan Yeung. A convex formulation for learning task relationships in multitask learning. In Uncertainty in Artificial Intelligence (UAI), 2012.
Appendix A Appendix
a.1 MVML optimization
Here we go through the derivations of the solutions A, D and w for our optimization problem. The presented derivations are for the case without Nyström approximation; however the derivations with Nyström approximation are done exactly the same way.
Solving for and
Let us first focus on the case where A and w are fixed and we solve for g. We calculate the derivative of the expression in Equation (7):
By setting this to zero we obtain the solution
As for when A and g are fixed, we need only to consider optimizing
(19) 
If we denote that is equal to reshaping by taking the elements of the vector and arranging them onto the columns of , we obtain a following form:
(20) 
One can easily see by taking the derivative and setting it to zero that the solution for this is
(21) 
Solving for in (6)
When we consider (and ) to be fixed in the MVML framwork (6), for we have the following minimization problem:
Derivating this with respect to gives us
Thus the gradient descent step will be
when moving to the direction of negative gradient with step size .
Solving for in (3.5)
To solve from equation (3.5) we use proximal minimization. Let us recall the optimization problem after the change of the variable:
and denote
and
for the two terms in our optimization problem that contain the matrix .
Without going into detailed theory of proximal operators and proximal minimization, we remark that the proximal minimization algorithm update takes the form
It is wellknown that in traditional grouplasso situation the proximal operator is
where is a vector and denotes the maximum of zero and the value inside the brackets. In our case we are solving for a matrix, but due to the equivalence of Frobenious norm to vector 2norm we can use this exact same operator. Thus we get as the proximal update
where
We can see from the update fromula and the derivative that if is a positive matrix, the update without blockmultiplication, , will be positive, too. This is unfortunately not enough to guarantee the general positivity of . However we note that it is, indeed, positive if it is blockdiagonal, and in general whenever a matrix of the multipliers
is positive, then is, too (see [12] for reference  this is a blockwise Hadamard product where the blocks commute).
a.2 Proof of Theorem 1
Theorem 1.
Let be a vectorvalued RKHS associated with the the multiview kernel defined by Equation 3.1. Consider the hypothesis class , with . The empirical Rademacher complexity of can be upper bounded as follows:
where , and is the Gram matrix computed from the training set with the kernel defined on the view . For kernels such that , we have
Proof. We start by recalling that the feature map associated to the operatorvalued kernel is the mapping , where is the input space, , and is the set of bounded linear operators from to (see, e.g., [19, 7] for more details). It is known that . We denote by the feature map associated to our multiview kernel (Equation 3.1). We also define the matrix
Here (1) and (3) are obtained with reproducing property, (2) and (4) with CauchySchwarz inequality, and (5) with Jensen’s inequality. The last equality follows from the fact that . For kernels that satisfy , , we obtain that