Multi-view Machines

06/03/2015 ∙ by Bokai Cao, et al. ∙ Microsoft University of Illinois at Chicago 0

For a learning task, data can usually be collected from different sources or be represented from multiple views. For example, laboratory results from different medical examinations are available for disease diagnosis, and each of them can only reflect the health state of a person from a particular aspect/view. Therefore, different views provide complementary information for learning tasks. An effective integration of the multi-view information is expected to facilitate the learning performance. In this paper, we propose a general predictor, named multi-view machines (MVMs), that can effectively include all the possible interactions between features from multiple views. A joint factorization is embedded for the full-order interaction parameters which allows parameter estimation under sparsity. Moreover, MVMs can work in conjunction with different loss functions for a variety of machine learning tasks. A stochastic gradient descent method is presented to learn the MVM model. We further illustrate the advantages of MVMs through comparison with other methods for multi-view classification, including support vector machines (SVMs), support tensor machines (STMs) and factorization machines (FMs).



There are no comments yet.


page 3

page 5

Code Repositories


Some popular algorithms(dbscan,knn,fm etc.) on spark

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the era of big data, information is available not only in great volume but also in multiple representations/views from a variety of sources or feature subsets. Generally, different views provide complementary information for learning tasks. Thus, multi-view learning can facilitate the learning process and is prevalent in a wide range of application domains. For example, to fulfil an accurate disease diagnosis, one should consider laboratory results from different medical examinations, including clinical, imaging, immunologic, serologic and cognitive measures. For the business on the web, it is critical to estimate the probability that the display of an ad to a specific user when s/he searches for a query will lead to a click. This process involves three entities: users, ads, and queries. An effective integration of the features describing these different entities is directly related to a precise targeting of the advertising system.

One of the key challenges of multi-view learning is to model the interactions between different views, wherein complementary information is contained. Conventionally, multiple kernel learning algorithms combine kernels that naturally correspond to different views to improve the learning performance [5]. Basically, the coefficients are learned based on the usefulness/informativeness of the associated views, and thus the correlations are considered at the view-level. These approaches, however, fail to explicitly explore the correlations between features. In contrast to modeling on views, another direction for modeling multi-view data is to directly consider the abundant correlations between features from different views.

In this paper, we propose a novel model for multi-view learning, called multi-view machines (MVMs). The main advantages of MVMs are outlined as follows:

  • MVMs include all the possible interactions between features from multiple views, ranging from the first-order interactions (i.e., contributions of single features) to the highest order interactions (i.e., contributions of combinations of features from each view).

  • MVMs jointly factorize the interaction parameters in different orders to allow parameter estimation under sparsity.

  • MVMs are a general predictor that can work with different loss functions (e.g.

    , square error, hinge loss, logit loss) for a variety of machine learning tasks.

2 Multi-view Classification

We first state the problem of multi-view classification and introduce the notation. Table 1 lists some basic symbols that will be used throughout the paper.

Suppose each instance has representations in different views, i.e., , where , is the dimensionality of the -th view. Let , so . Considering the problem of click through rate (CTR) prediction for advertising display, for example, an instance corresponds to an impression which involves a user, an ad, and a query. Therefore, if is an impression, contains information of the user profile, is associated with the ad information, and is the description from the query aspect. The result of an impression is click or non-click.

Given a training set with labeled instances represented from views: , in which and is the class label of the -th instance. For CTR prediction problem, denotes click and denotes non-click in an impression. The task of multi-view classification is to learn a function that correctly predicts the label of a test instance.

Symbol Definition and Description
each lowercase letter represents a scale
each boldface lowercase letter represents a vector
each boldface capital letter represents a matrix
each calligraphic letter represents a tensor, set or space
denotes inner product
denotes tensor product or outer product
denotes mode- product
denotes absolute value
denotes (Frobenius) norm of vector, matrix or tensor
Table 1: Symbols.

In addition, we introduce the concept of tensors which are higher order arrays that generalize the notions of vectors (first-order tensors) and matrices (second-order tensors), whose elements are indexed by more than two indexes. We state the definition of tensor product and mode- product which will be used to formulate our proposed model.

Definition 2.1 (Tensor Product or Outer Product)

The tensor product of a tensor and another tensor is defined by


for all index values.

Definition 2.2 (Mode- Product)

The mode- product of a tensor and a matrix is defined by


for all index values.

3 Multi-view Machine Model

3.1 Model Formulation

The key challenge of multi-view classification is to model the interactions between features from different views, wherein complementary information is contained. Here, we consider nesting all interactions up to th-order between views:


Let us add an extra feature with constant value to the feature vector , i.e., . Then, Eq. (3) can be effectively rewritten as:


where and . For with some indexes satisfying , it encodes lower order interaction between views whose . Hereinafter, let denote where only and , and let denote where , and for other views, etc.

Figure 1: CP factorization. The third-order () tensor is approximated by rank-one tensors. The -th factor tensor is the tensor product of three vectors, i.e., .

The number of parameters in Eq. (4) is , which can make the model prone to overfitting and ineffective on sparse data. Therefore, we assume that the effect of interactions has a low rank and the th-order tensor can be factorized into factors:


where , and is the identity tensor, i.e., . Basically, it is a CANDECOMP/PARAFAC (CP) factorization [4] as shown in Figure 1, with element-wise notation . The number of model parameters is reduced to . It transforms Eq. (4) into:

Figure 2: Multi-view machines. All the interactions of different orders between multiple views are modeled in a single tensor and share the same set of latent factors.

We name this model as multi-view machines (MVMs). As shown in Figure 2, the full-order interactions between multiple views are modeled in a single tensor, and they are factorized collectively. The model parameters that have to be estimated are:


where the -th row within describes the -th feature in the -the view with factors. Let the last row denote the bias factor from the -th view, since it is always combined with in Eq. (6). Hence,


is the global bias, denoted as hereinafter.

Moreover, MVMs are flexible in the order of interactions of interests. That is to say, when there are too many views available for a learning task and interactions between some of them may obviously be physically meaningless, or sometimes the very high order interactions may be intuitively uninterpretable, it is not desirable to include these potentially redundant interactions in the model. In such scenarios, one can (1) partition (overlapping) groups of views, (2) construct multiple MVMs on these view groups where the full-order interactions within each group are included, and (3) implement a coupled matrix/tensor factorization [3]. This implementation excludes those cross-group interactions. Although MVMs are feasible in any order of interactions, that is outside the scope of this paper; our focus is on investigating how to effectively explore the full-order interactions within a given set of views.

3.2 Time Complexity

Next, we show how to make MVMs applicable from a computational point of view. The straightforward time complexity of Eq. (6) is . However, we observe that there is no model parameter which directly depends on the interactions between variables (e.g., a parameter with an index ), due to the factorization of the interactions. Therefore, the time complexity can be largely reduced.

Lemma 3.1

The model equation of MVMs can be computed in linear time .

The interactions in Eq. (6) can be reformulated as:


This equation has only linear complexity in both and . Thus, its time complexity is , which is in the same order of the number of parameters in the model.

4 Learning Multi-view Machines

To learn the parameters in MVMs, we consider the following regularization framework:


where represents all the model parameters, is the loss function, is the regularization term, and is the trade-off between the empirical loss and the risk of overfitting.

Importantly, MVMs can be used to perform a variety of machine learning tasks, depending on the choices of the loss function. For example, to conduct regression, the square error is a popular choice:


and for classification problems, we can use the logit loss:


or the hinge loss:


The regularization term is chosen based on our prior knowledge about the model parameters. Typically, we can apply L2-norm:


or L1-norm:


where is a very small number to make the L1-norm term differentiable.

The model parameters can be learned efficiently by alternating least square (ALS), stochastic gradient descent (SGD), L-BFGS, etc., for a variety of loss functions, including square error, hinge loss, logit loss, etc. From Eq. (9), the gradient of the MVM model is:


where , and if , otherwise . It validates that MVMs possess the multilinearity property, because the gradient along is independent of the value of itself.

Note that in Eq. (16), the sum can be precomputed and reused for updating the -th factor of all the features. Hence, each gradient can be computed in . In an iteration, including the precomputation time, all the parameters can be updated in . It can be even reduced under sparsity, where most of the elements in (or ) are and thus, the sums have only to be computed over the non-zero elements.

It is straightforward to embed Eq. (16) into the gradient of the loss functions e.g., Eqs. (11)-(13), for direct optimization, as follows:


Moreover, the gradient of the regularization term can be derived:


The SGD optimization method for MVMs is summarized in Algorithm 1

, where the model parameters are first initialized from a zero-mean normal distribution with standard deviation

, and the gradients in line 8 can be computed according to Eqs. (11)-(13) and Eqs. (20)-(21). Moreover, rather than specifying a learning rate beforehand, we can use a line search to determine it in the optimization process. The regularization parameter can be searched on a held-out validation set. Considering the number of factors , the performance can usually be improved with larger , at the cost of more parameters which can make the learning much harder in terms of both runtime and memory [10].

0:  Training data , number of factors , regularization parameter , learning rate , standard deviation
0:  Model parameters
1:  Initialize
2:  repeat
3:     for  do
4:        for  do
5:           for  do
6:              if  then
7:                 for  do
8:                     where
9:                 end for
10:              end if
11:           end for
12:        end for
13:     end for
14:  until convergence
Algorithm 1 Stochastic Gradient Descent for MVMs

5 Related Work

In this section, we discuss and compare our proposed MVM model with other methods (and extensions) for multi-view classification, including support vector machines (SVMs), support tensor machines (STMs) and factorization machines (FMs).

Figure 3: Related work (and extensions) on modeling the interactions between multiple views. In general, the linear SVM model is limited to the first-order interactions; the STM model explores only the highest order interactions; in spite of including all the interactions in different orders, the FM model is not sufficiently factorized compared to our proposed MVM model.

5.1 SVM Model

Vapnik introduced support vector machines (SVMs) [9]

based on the maximum-margin hyperplane. Essentially, SVMs integrate the hinge loss and the L2-norm regularization. The decision function with a linear kernel is

111The sign function is omitted, because the analysis and conclusions can easily extend to other generalized linear models, e.g.

, logistic regression.



In the multi-view setting, is simply a concatenation of features from different views, i.e., , as shown in Figure 3. Thus, Eq. (22) is equivalent to:


Obviously, no interactions between views are explored in Eq. (23). By restricting for any indexes of in Eq. (4), i.e., removing factorization and higher order interactions from MVMs, we obtain the linear SVMs:


Throught the employment of a nonlinear kernel, SVMs can implicitly project data from the feature space into a more complex high-dimensional space, which allows SVMs to model higher order interactions between features. However, as discussed in [6], all interaction parameters of nonlinear SVMs are completely independent. In contrast, the interaction parameters of MVMs are collectively factorized and thus dependencies exist when interactions share the same feature.

For nonlinear SVMs, there must be enough instances where and to reliably estimate the second-order interaction parameter . The instances with either or cannot be used for estimating . That is to say, on a sparse dataset where there are too few or even no cases for some higher order interactions, nonlinear SVMs are likely to degenerate into linear SVMs.

The factorization of interactions in Eq. (5) benefits MVMs for parameter estimation under sparsity, since the latent factor can be learned from any instances whose , which allows the second-order interaction can be approximated from instances whose or rather than instances whose and . Therefore, the interaction parameters in MVMs can be effectively learned without direct observations of such interactions in a training set of sparse data.

5.2 STM Model

Cao et al. investigated multi-view classification by modeling interactions between views as a tensor, i.e., [2] and solved the problem in the framework of support tensor machines (STMs) [8]. Basically, as shown in Figure 3, only the highest order interactions are explored:


where , i.e., a rank-one decomposition of the tensor [2].

However, estimating a lower order interaction (e.g., a pairwise one) reliably is easier than estimating a higher order one, and lower order interactions can usually explain the data sufficiently [7, 1]. Thus, it is critical to include the lower order interactions in MVMs. Moreover, instead of a rank-one decomposition, we apply a higher rank decomposition of to capture more latent factors and thereby achieving a better approximation to the original interaction parameters.

5.3 FM Model

Rendle introduced factorization machines (FMs) [6] that combine the advantages of SVMs with factorization models. The model equation for a second-order FM is as follows:


where and .

However, the pairwise interactions between all the features are included in FMs without consideration of the view segmentation. In the multi-view setting, there can be redundant correlations between features within the same view which are thereby unworthy of consideration. The coupled group lasso model proposed in [10] is essentially an application of the second-order FMs to multi-view classification. To achieve this purpose, we can simply modify Eq. (26) as:


The pairwise interaction parameter in Eq. (27) indicates that can be learned from instances whose and some (sharing ), or and some (sharing ), which makes FMs more robust under sparsity than SVMs where only instances with and can be used to learn .

The main difference between FMs and MVMs is that the interaction parameters in different orders are completely independent in FMs, e.g., the first-order interaction and the second-order interaction in Eq. (27). On the contrary, in MVMs, all the orders of interactions share the same set of latent factors, e.g., in Eq. (6). For example, the combination of and the bias factors from other views, i.e., , approximates the first-order interaction . Similarly, we can obtain the second-order interaction by combining , and other bias factors.

Such difference is more significant for higher order FMs, as shown in Figure 3. Assuming the same number of factors in different orders of interactions, the number of parameters to be estimated in a th-order FM is which can be much larger than in MVMs, when there are many views (i.e., a large ). Therefore, compared to MVMs, FMs are not fully factorized.

6 Conclusion

In this paper, we have proposed a multi-view machine (MVM) model and presented an efficient inference method based on stochastic gradient descent. In general, MVMs can be applied to a variety of supervised machine learning tasks, including classification and regression, and are particularly designed for data that is composed of features from multiple views, between which the interactions are effectively explored. In contrast to other models that explore only the partial interactions or factorize the interactions in different orders separately, MVMs jointly factorize the full-order interactions and thereby benefiting the parameter estimation under sparsity.


  • [1] Yuanzhe Cai, Miao Zhang, Dijun Luo, Chris Ding, and Sharma Chakravarthy. Low-order tensor decompositions for social tagging recommendation. In WSDM, pages 695–704. ACM, 2011.
  • [2] Bokai Cao, Lifang He, Xiangnan Kong, Philip S. Yu, Zhifeng Hao, and Ann B. Ragin.

    Tensor-based multi-view feature selection with applications to brain diseases.

    In ICDM, pages 40–49. IEEE, 2014.
  • [3] Liangjie Hong, Aziz S Doumith, and Brian D Davison. Co-factorization machines: modeling user interests and predicting individual decisions in twitter. In WSDM, pages 557–566. ACM, 2013.
  • [4] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
  • [5] Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 5:27–72, 2004.
  • [6] Steffen Rendle. Factorization machines. In ICDM, pages 995–1000. IEEE, 2010.
  • [7] Steffen Rendle and Lars Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In WSDM, pages 81–90. ACM, 2010.
  • [8] Dacheng Tao, Xuelong Li, Weiming Hu, Stephen Maybank, and Xindong Wu. Supervised tensor learning. In ICDM, pages 8–pp. IEEE, 2005.
  • [9] Vladimir Vapnik.

    The nature of statistical learning theory

    Springer Science & Business Media, 2000.
  • [10] Ling Yan, Wu-jun Li, Gui-Rong Xue, and Dingyi Han. Coupled group lasso for web-scale CTR prediction in display advertising. In ICML, pages 802–810, 2014.