With the rapid development of Internet communication tools in past few decades, many different types of data vectors become easily obtainable these days, advancing the development of multi-view data analysis methods (Sun, 2013; Zhao et al., 2017). Different types of vectors are called as “views”, and their dimensions may be different depending on the view. Typical examples are data vectors of images, text tags, and user attributes available in Social Networking Services (SNS). However, we cannot apply standard data analysis methods, such as clustering, to multi-view data vectors, because data vectors from different views, say, images and text tags, are not directly compared with each other. In this paper, we work on multi-view Feature Learning for transforming the data vectors from all the views into new vector representations called “feature vectors” in a shared euclidean subspace.
One of the best known approaches to multi-view feature learning is Canonical Correlation Analysis (Hotelling, 1936, CCA) for two-views, and Multiset CCA (Kettenring, 1971, MCCA) for many views. CCA considers pairs of related data vectors . For instance, may represent an image, and may represent a text tag. Their dimension and may be different. CCA finds linear transformation matrices so that the sum of inner products
is maximized under a variance constraint. The obtained linear transformations compute feature vectorswhere is the dimension of the shared space of feature vectors from the two views. However, the linear transformations may not capture the underlying structure of real-world datasets due to its simplicity.
which incorporate kernel methods and neural networks to CCA, respectively. These methods show drastic improvements in performance in face recognition(Zheng et al., 2006) and image-text embedding (Yan and Mikolajczyk, 2015). However, these CCA-based approaches are limited to multi-view data vectors with one-to-one correspondence across views.
Real-world datasets often have more complex association structures among the data vectors, thus the whole dataset is interpreted as a large graph with nodes of data vectors and links of the associations. For example, associations between images and their tags may be many-to-many relationships, in the sense that each image has multiple associated tags as well as each tag has multiple associated images. The weight , which we call “link weight”, is defined to represent the strength of association between data vectors (). The number of data vectors in each view may be different.
To fully utilize the complex associations represented by , Cross-view Graph Embedding (Huang et al., 2012, CvGE) and its extension to more than three views called Cross-Domain Matching Correlation Analysis (Shimodaira, 2016, CDMCA) are proposed recently, by extending CCA to many-to-many settings. 2-view CDMCA (CvGE) obtains linear transformation matrices so that is maximized under a variance constraint.
CDMCA includes various existing multi-view / graph-embedding methods as special cases. For instance, 2-view CDMCA obviously includes CCA as a special case , where is Kronecker’s delta. By considering -view setting, where is a 1-view data vector and is a link weight between and , CDMCA reduces to Locality Preserving Projections (He and Niyogi, 2004; Yan et al., 2007, LPP). LPP also reduces to Spectral Graph Embedding (Chung, 1997; Belkin and Niyogi, 2001, SGE), which is interpreted as “0-view” graph embedding, by letting be 1-hot vector with 1 at -th entry and 0 otherwise.
Although CDMCA shows a good performance in word embeddings (Oshikiri et al., 2016) and image embeddings (Fukui et al., 2016), its expressiveness is still limited due to its linearity. There has been a necessity of a framework, which can deal with many-to-many associations and non-linear transformations simultaneously. Therefore, in this paper, we propose a non-linear framework for multi-view feature learning with many-to-many associations. We name the framework as Probabilistic Multi-view Graph Embedding (PMvGE). Since PMvGE generalizes CDMCA to non-linear setting, PMvGE can be regarded as a generalization of various existing multi-view methods as well.
PMvGE is built on a simple observation: many existing approaches to feature learning consider the inner product similarity of feature vectors. For instance, the objective function of CDMCA is the weighted sum of the inner product of the feature vectors in the shared space. Turning our eyes to recent -view feature learning, Graph Convolutional Network (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), and Inductive DeepWalk (Dai et al., 2018) assume that the inner product of feature vectors approximates link weight .
Inspired by these existing studies, for -view feature learning (), PMvGE transforms data vectors from view by neural networks so that the function approximates the weight for
. We introduce a parametric model of the conditional probability ofgiven , and thus PMvGE is a non-linear and probabilistic extension of CDMCA. This leads to very efficient computation of the Maximum Likelihood Estimator (MLE) with minibatch SGD.
Our contribution in this paper is summarized as follows:
We show in Section 3 that PMvGE generalizes various existing multi-view methods, at least approximately, by considering the Maximum Likelihood Estimator (MLE) with a novel probabilistic model.
We show in Section 4 that PMvGE with large-scale datasets can be efficiently computed by minibatch SGD.
We prove that PMvGE, yet very simple, learns a wide class of similarity measures across views. By combining Mercer’s theorem and the universal approximation theorem, we prove in Section 5.1 that the inner product of feature vectors can approximate arbitrary continuous positive-definite similarity measures via sufficiently large neural networks. We also prove in Section 5.2 that MLE will actually learn the correct parameter value for sufficiently large number of data vectors.
2 Related works
0-view feature learning: There are several graph embedding methods related to PMvGE without data vectors. We call them as -view feature learning methods. Given a graph, Spectral Graph Embedding (Chung, 1997; Belkin and Niyogi, 2001, SGE)
obtains feature vectors of nodes by considering the adjacency matrix. However, SGE requires time-consuming eigenvector computation due to its variance constraint. LINE(Tang et al., 2015) is very similar to 0-view PMvGE, which reduces the time complexity of SGE by proposing a probabilistic model so that any constraint is not required. DeepWalk (Perozzi et al., 2014) and node2vec (Grover and Leskovec, 2016) obtain feature vectors of nodes by applying skip-gram model (Mikolov et al., 2013) to the node-series computed by random-walk over the given graph, while PMvGE directly considers the likelihood function.
1-view feature learning: Locality Preserving Projections (He and Niyogi, 2004; Yan et al., 2007, LPP) incorporates linear transformation of data vectors into SGE. For introducing non-linear transformation, the graph neural network (Scarselli et al., 2009, GNN)
defines a graph-based convolutional neural network. ChebNet(Defferrard et al., 2016) and Graph Convolutional Network (GCN) (Kipf and Welling, 2017) reduce the time complexity of GNN by approximating the convolution. These GNN-based approaches are highly expressive but not inductive. A method is called inductive if it computes feature vectors for newly obtained vectors which are not included in the training set. GraphSAGE (Hamilton et al., 2017) and Inductive DeepWalk (Dai et al., 2018, IDW) are inductive as well as our proposal PMvGE, but probabilistic models of these methods are different.
incorporates higher-order products of multi-view data vectors to linear-regression. It can be used for link prediction across views. If only the second terms in FM are considered, FM is approximately the same as PMvGE with linear transformations. However, FM does not include PMvGE with neural networks.
Another study Stochastic Block Model (Holland et al., 1983; Nowicki and Snijders, 2001, SBM) is a well-known probabilistic model of graphs, whose links are generated with probabilities depending on the cluster memberships of nodes. SBM assumes the cluster structure of nodes, while our model does not.
3 Proposed Model and Its Parameter Estimation
We consider an undirected graph consisting of nodes and link weights satisfying for all , and . Let be the number of views. For -view feature learning, node belongs to one of views, which we denote as . The data vector representing the attributes (or side-information) at node is denoted as for view with dimension . For 0-view feature learning, we formally let and use the 1-hot vector . We assume that we obtain as observations. By taking
as a random variable, we consider a parametric model of conditional probability ofgiven the data vectors. In Section 3.2, we consider the probability model of with the conditional expected value
for all . In Section 3.3, we then define PMvGE by specifying the functional form of via feature vectors.
3.2 Probabilistic model
For deriving our probabilistic model, we first consider a random graph model with fixed nodes. At each time-point , an unordered node pair is chosen randomly with probability
where and represents the undirected link at time . The parameters are interpreted as unnormalized probabilities of node pairs. We allow the same pair is sampled several times. Given independent observations , we consider the number of links generated between and as
The conditional probability follows a multinomial distribution. Assuming that
obeys Poisson distribution with mean, the probability of follows
where is the probability function of Poisson distribution with mean . Thus follows Poisson distribution independently as
for all . Although should be a non-negative integer as an outcome of Poisson distribution, our likelihood computation allows to take any nonnegative real value.
where is the parameter regulating the number of links whose end-points belong to clusters and . Note that SBM is interpreted as 1-view method with 1-hot vector indicating the cluster membership.
3.3 Proposed model (PMvGE)
Inspired by various existing methods for -view and -view feature learning, we propose a novel model for the parameter in eq. (1) by using the inner-product similarity as
Here is a symmetric parameter matrix () for regulating the sparseness of . For , is simply the inner product in Euclidean space.
The functions , , specify the non-linear transformations from data vectors to feature vectors. The parameter vector may take the form of a collection of matrices. These transformations can be trained by maximizing the likelihood for the probabilistic model (1) as shown in Section 3.5. PMvGE computes feature vectors through maximum likelihood estimation of transformations .
PMvGE associates nodes and with the probability specified by the similarity between their feature vectors in the shared space. Nodes with similar feature vectors will share many links.
We consider the following neural network model for the transformation function, while any functional form can be accepted for PMvGE. Only the inner product of feature vectors is considered for measuring the similarity in PMvGE. We prove in Theorem 5.1 that the inner product with neural networks approximates a wide class of similarity measures.
Neural Network (NN) with 3-layers is defined as
where is data vector, are parameter matrices. The neural network size is specified by . Each element of
is user-specified activation function. The number of layers can be arbitrary large, so it includes Deep NN (DNN).
NN model reduces to linear model
by applying , where is parameter matrix.
3.4 Link weights across some view pairs may be missing
Link weights across all the view pairs may not be available in practice. So we consider the set of unordered view pairs
and we formally set for the missing . For example, for indicates that link weights across view-1 and view-2 are observed while link weights within view-1 or view-2 are missing. We should notice the distinction between setting with missing and observing without missing, because these two cases give different likelihood functions.
3.5 Maximum Likelihood Estimator
3.6 PMvGE approximately generalizes various methods for multi-view learning
SBM (2) is 1-view PMvGE with 1-hot cluster membership vector . Consider the linear model (5) with , where is a feature vector for class . Then will specify for sufficiently large , and represent low-dimensional structure of SBM for smaller . SBM is also interpreted as PMvGE with views by letting and . Then is equivalent to SBM.
by specifying . CDMCA computes the transformation matrices by maximizing this objective function under a quadratic constraint such as
The above observation intuitively explains why PMvGE approximates CDMCA; this is more formally discussed in Appendix C. The quadratic constraint is required for preventing the maximizer of (7) from being diverged in CDMCA. However, our likelihood approach does not require it, because the last half of (6) serves as a regularization term.
3.7 PMvGE represents neural network classifiers of multiple classes
Let , be the training data for the classification problem of classes. FFNN classifier with softmax function (Bishop, 2006) is defined as
where is a multi-valued neural network, and is classified into the class . This classifier is equivalent to
where is the 1-hot vector with 1 at -th entry and 0 otherwise.
The classifier (9) can be interpreted as PMvGE with as follows. For view-1, and inputs are . For view-2, and inputs are . We set between and , and otherwise.
In this section, we present an efficient way of optimizing the parameters for maximizing the objective function defined in eq. (6). We alternatively optimize the two parameters and . Efficient update of with minibatch SGD is considered in Section 4.1, and update of by solving an estimating equation is considered in Section 4.2. We iterate these two steps for maximizing .
4.1 Update of
We update using the gradient of by fixing current parameter value . Since may be sparse in practice, the computational cost of minibatch SGD (Goodfellow et al., 2016) for the first half of is expected to be reduced by considering the sum over the set . On the other hand, there should be positive terms in the last half of , so we consider the sum over node pairs uniformly-resampled from .
We make two sets by picking from and , respectively, so that
User-specified constants and are usually called as “minibatch size” and “negative sampling rate”. We sequentially update minibatch for computing the gradient of
with respect to . By utilizing the gradient, the parameter can be sequentially updated by SGD, where is a tuning parameter. Eq. (10) approximates if are uniformly-resampled and , however, smaller such as may make this algorithm stable in some cases.
4.2 Update of
Let represent current parameter value of . By solving the estimating equation with respect to under constraints and , , we obtain a local maximizer of . However, the computation of local maximizer requires operations. To reduce the high computational cost, we update by (11) which is defined on the minibatch .
where and . Note that is unordered view pair.
4.3 Computational cost
PMvGE requires operations for each minibatch iteration. It is efficiently computed even if the number of data vectors is very large.
5 PMvGE LEARNS ARBITRARY SIMILARITY MEASURE
Two theoretical results are shown here for indicating that PMvGE with sufficiently large neural networks learns arbitrary similarity measure using sufficiently many data vectors. In Section 5.1, we prove that arbitrary similarity measure can be approximated by the inner product in the shared space with sufficiently large neural networks. In Section 5.2, we prove that MLE of PMvGE converges to the true parameter value, i.e., the consistency of MLE, in some sense as the number of data vectors increases.
5.1 Inner product of NNs approximates a wide class of similarity measures across views
Let , , be continuous functions and be a symmetric, continuous, and positive-definite kernel function for some . is ReLU or activation function which is non-constant, continuous, bounded, and monotonically-increasing. Then, for arbitrary , by specifying sufficiently large , there exist , , such that
for all , , where , , are two-layer neural networks with hidden units and is element-wise function.
Proof of the theorem is given in Appendix A.
If , Theorem 5.1 corresponds to Mercer’s theorem (Mercer, 1909; Courant and Hilbert, 1989) of Kernel methods, which states that arbitrary positive definite kernel can be expressed as the inner product of high-dimensional feature maps. While Mercer’s theorem indicates only the existence of such future maps, Theorem 5.1 also states that the feature maps can be approximated by neural networks.
Illustrative example We define with , and , . For 2-dim visualization in Fig. 2 with , let us define and its approximation by the inner product of neural networks with hidden units. If and are sufficiently large, approximates very well as suggested by Theorem 5.1.
5.2 MLE converges to the true parameter value
We have shown the universal approximation theorem of similarity measure in Theorem 5.1. However, it only states that the good approximation is achieved if we properly tune the parameters of neural networks. Here we argue that MLE of Section 3.5 will actually achieve the good approximation if we have sufficiently many data vectors. The technical details of the argument are given in Appendix B.
Let denote the vector of free parameters in , and be the log-likelihood function (6). We assume that the optimization algorithm in Section 4 successfully computes MLE that maximizes . Here we ignore the difficulty of global optimization, while we may only get a local maximum in practice. We also assume that PMvGE is correctly specified; there exists a parameter value
so that the parametric model represents the true probability distribution.
Then, we would like to claim that converges to the true parameter value in the limit of , the property called the consistency of MLE. However, we have to pay careful attention to the fact that PMvGE is not a standard setting in the sense that (i) there are correlated samples instead of
i.i.d. observations, and (ii) the model is not identifiable with infinitely many equivalent parameter values; for example there are rotational degrees of freedom in the shared space so that
with any orthogonal matrixin . We then consider the set of equivalent parameters . Theorem B.2 states that, as , converges to , the set of values equivalent to .
6 Real Data Analysis
6.1 Experiments on Citation dataset (1-view)
Dataset: We use Cora dataset (Sen et al., 2008) of citation network with 2,708 nodes and 5,278 ordered edges. Each node represents a document, which has -dimensional (bag-of-words) data vector and a class label of 7 classes. Each directed edge represents citation from a document to another document . We set the link weight as by ignoring the direction, and otherwise. There is no cross or self-citation. We divide the dataset into training set consisting of nodes () with their edges, and test set consisting of remaining nodes () with their edges. We utilize of the training set for validation. Hyper-parameters are tuned by utilizing the validation set.
We compare PMvGE with Stochastic Block Model (Holland et al., 1983, SBM),
ISOMAP (Tenenbaum et al., 2000),
Locally Linear Embedding (Roweis and Saul, 2000, LLE),
Spectral Graph Embedding (Belkin and Niyogi, 2001, SGE),
Multi Dimensional Scaling (Kruskal, 1964, MDS), DeepWalk (Perozzi et al., 2014), and GraphSAGE (Hamilton et al., 2017).
NN for PMvGE: 2-layer fully-connected network, which consists of 3,000 tanh hidden units and 1,000 tanh output units, is used. The network is trained by Adam (Kingma and Ba, 2015)
with batch normalization. The learning rate is starting fromand attenuated by for every 100 iterations. Negative sampling rate and minibatch size are set as and , respectively, and the number of iterations is .
For each method, parameters are tuned on validation sets. Especially, the dimension of feature vectors is selected from .
Label classification (Task 1):
We classify the documents into 7 classes using logistic regression with the future vector as input and the class label as output. We utilize LibLinear(Fan et al., 2008) for the implementation.
Clustering (Task 2): We classify the documents using -means clustering of the feature vectors. The number of clusters is set as .
The quality of classification is evaluated by classification accuracy in Task 1, and Normalized Mutual Information (NMI) in Task 2. Sample averages and standard deviations over 10 experiments are shown in Table2. In experiment (A), we apply methods to both training set and test set, and evaluate them by test set. In (B), we apply methods to only the training set, and evaluate them by test set. SGE, MDS, and DeepWalk are not inductive, and they cannot be applied to unseen data vectors in (B). PMvGE outperforms the other methods in both experiments.
6.2 Experiments on AwA dataset (2-view)
Dataset: We use Animal with Attributes (AwA) dataset (Lampert et al., 2009) with 30,475 images for view-1 and 85 attributes for view-2. We prepared dimensional DeCAF data vector (Donahue et al., 2014) for each image, and dimensional GloVe (Pennington et al., 2014) data vector for each attribute. Each image is associated with some attributes. We set for the associated pairs between the two views, and otherwise. In addition to the attributes, each image has a class label of 50 classes. We resampled 50 images from each of 50 classes; in total, 2500 images. In each experiment, we split the 2500 images into 1500 training images and 100 test images. A validation set of 300 images is sampled from the training images.
We compare PMvGE with CCA, DCCA (Andrew et al., 2013), SGE, DeepWalk, and GraphSAGE.
NN for PMvGE:
Each view uses a 2-layer fully-connected network, which consists of
2000 tanh hidden units and 100 tanh output units.
The dimension of the feature vector is .
Adam is used for optimization with Batch normalization and Dropout ().
Minibatch size, learning rate, and momentum are tuned on the validation set.
We monitor the score on the validation set for early stopping.
For each method, parameters are tuned on validation sets. Especially, the dimension of feature vectors is selected from .
Link prediction (Task 3):
For each query image, we rank attributes according to the cosine similarity of feature vectors across views. An attribute is regarded as correct if it is associated with the query image.
The quality of the ranked list of attributes is measured by Average Precision (AP) score in Task 3.
Sample averages and standard deviations over 10 experiments are shown in Table 2.
In experiment (A), we apply methods to both training set and test set, and evaluate them by test set.
The whole training set is used for validation.
In experiment (B), we apply methods to only the training set, and evaluate them by test set.
20 of training set is used for validation.
PMvGE outperforms the other methods including DCCA.
While DeepWalk shows good performance in experiment (A), DeepWalk and SGE cannot be applied to unseen data vectors in (B).
Unlike SGE and DeepWalk which only consider the associations, 1-view feature learning methods such as GraphSAGE cannot be applied to this AwA dataset since the dimension of data vectors is different depending on the view. So we do not perform 1-view methods in Task 3.
Locality of each-view is preserved through neural networks: To see whether the locality of input is preserved through neural networks in PMvGE, we computed the Spearman’s rank correlation coefficient between and for view- data vectors in AwA dataset (). For DeCAF (view-1) and GloVe (view-2) inputs, the values are and , respectively. This result indicates that the feature vectors of PMvGE preserves the similarities of the data vectors fairly well.
|Task 1 ()||ISOMAP|
|Task 2 ()||SBM|
|Task 3 ()||CCA|
We presented a simple probabilistic framework for multi-view learning with many-to-many associations. We name the framework as Probabilistic Multi-view Graph Embedding (PMvGE). Various existing methods are approximately included in PMvGE. We gave theoretical justification and practical estimation algorithm to PMvGE. Experiments on real-world datasets showed that PMvGE outperforms existing methods.
Appendix A Proof of Theorem 5.1
is a positive definite kernel on a compact set, it follows from Mercer’s theorem that there exist positive eigenvalues
and continuous eigenfunctionssuch that
where the convergence is absolute and uniform (Minh et al., 2006). The uniform convergence implies that for any there exists such that
This means for a feature map .
We fix and consider approximation of below. Since are continuous functions on a compact set, there exists such that
Let us write the neural networks as , where , , , are two-layer neural networks with hidden units. Since are continuous functions, it follows from the universal approximation theorem (Cybenko, 1989; Telgarsky, 2017) that for any , there exists such that
for . Therefore, for all , we have