## 1 Introduction

When observations consist of multiple views or modalities of the same underlying source of variation, a learning algorithm should efficiently account for the complementary information to alleviate learning difficulty (chaudhuri2009mvClusteringCCA) and improve accuracy. A well-established method for two-view analysis is given by canonical correlation analysis (CCA) (hotelling1992CCA)

, a classical subspace learning technique that extracts the common information between two multivariate random variables by projecting them onto a subspace. CCA, as a standard model for unsupervised two-view learning, has been used in a broad range of tasks such as dimensionality reduction, visualization and time series analysis

(xia2014robust).The goal of representation learning is to capture the essence of data and extract its natural features. Such features can be categories or cluster memberships. In multi-view data, the relationship between different views should be leveraged by the representation learning algorithms to enhance feature extraction. Learning representations in real-world applications, where the data is typically high-dimensional with complex structure, poses significant challenges and necessitates flexible and expressive yet scalable models such as deep generative neural networks to be applied.

It has been shown in (chaudhuri2009mvClusteringCCA) that by projecting multi-view data onto low-dimensional subspaces using CCA, cluster memberships can be recovered under a weak separation condition thus resulting in easier clustering in the subspace. Nevertheless, CCA exhibits poor generalization when trained on small training sets, therefore (klami2007localDependentComponents; klami2013bayesianCCA)

adopts a Bayesian approach to solve a probabilistic interpretation of CCA. However, real applications involve nonlinear subspaces where more than two view are available. Recently, deep learning has received renewed interest as a standard approach for describing highly expressive models for increasingly complex datasets. In multi-view learning, several deep learning based approaches have been successfully extended

(ngiam2011multimodalDL; andrew2013DCCA; wang2015deepMV; wang2017acousticVCCA; abavisani2018deep).Main contributions:
In this work, we
present a modified formulation of probabilistic CCA that enables an extension to deep generative models.
Generalizing the approach to an arbitrary number of views is also discussed.
In this approach, linear probabilistic layers are extended to deep generative multi-view networks
that capture the variations of the views by a *shared latent representation* that describes most of the variability (essence) of multi-view data,
and a set of *view-specific factors*.
Variational inference provides a powerful tool for scaling probabilistic models to complex problems and large scale datasets (rezende2014stochastic; kingma2013VAE; rezende_shakir2015_variationalInfWithNormalizingFlows).
Hence, to design a scalable training algorithm,
we follow variational inference principles,
taking into account the probabilistic CCA formulation
to achieve an approximate posterior
for the latent linear multi-view layer.
Empirical studies
confirm that the proposed deep generative multi-view model can efficiently integrate the relationship between multiple views to alleviate learning difficulty, which is the goal of multi-view learning approaches (chaudhuri2009mvClusteringCCA).

##### Notation and Definitions

Throughout the paper, bold lowercase variables denote vectors (

*e.g.*) or vector-valued random variables (

*e.g.*), bold uppercase are used for matrices (

*e.g.*) or matrix-valued random variables (

*e.g.*) and unbold lowercase are scalars (

*e.g.*) or random variables (

*e.g.*x). The transpose of a matrix is denoted as and is the standard basis vector with a 1 at th position. There are views in total and subscripts are intended to identify the view-specific variable, (

*e.g.*), which is different from an element of a vector that is specified by subscript (

*e.g.*). The difference should be clear from context.

## 2 Probabilistic CCA

Canonical correlation analysis (CCA) (hotelling1992CCA) is a classical subspace learning method that extracts information from the cross-correlation between two variables. Let and be a pair of random vectors corresponding to two different views. CCA linearly projects these onto the subspace as and , where and and , such that each pair of components are maximally correlated if and uncorrelated otherwise. Let and be the mean and covariance matrices of and , respectively, and is their cross-covariance. Then CCA can be formulated as the optimization problem

(1) | |||

Given as the pairs of left and right singular vectors corresponding to

largest singular values,

, of the correlation matrix , the solution to the CCA problem is given by where , also called canonical pairs of directions, form the columns of and is the diagonal matrix of canonical correlations.bachPCCA and browne1979IBFA proposed a probabilistic generative interpretation to the classical CCA problem that reveals the shared latent representation explicitly. An extension of their results to a more flexible model can be expressed as follows.

###### Theorem 1

Assume the probabilistic generative model for the graphical model in Figure 1 as:

(2) | ||||

where

is the shared latent representation. The maximum likelihood estimate of the parameters of this model can be expressed in terms of the canonical correlation directions as

(3) | ||||

where are arbitrary matrices with spectral norms smaller than one that satisfy , and the residual errors terms can be defined as and . This probabilistic graphical model induces conditional independence of and given . The parameter is not identifiable by maximum likelihood.

Proof: See Appendix A.

In contrast to the results in (bachPCCA) where , here we introduce

as an extra degree of freedom. We will see in the following the important role of this parameter in optimizing the upper bound on the likelihood, and also in the inference of the shared representation of deep probabilistic CCA. We will also derive an analytical form to identify it based on the parameters of the probabilistic multi-view layer, and develop an identification method for the arbitrary matrices

.### 2.1 Generalization to arbitrary number of views

As an extension to an arbitrary number of views for probabilistic CCA, (archambeau2009sparse) proposed a general probabilistic model as follows:

(4) | ||||

where and are the view specific offsets and residual errors, respectively. This model can also be viewed as multibattery factor analysis (MBFA) (klami2014GroupFactorAnalysis; browne1980MBFA) in the statistics literature, which describes the statistical dependence between all the views by a single latent vector, , and explains away the view-specific variations by factors private to each view. Moreover, limiting to single view, if the prior on the view-specific factor is multivariate independent Gaussian, this model includes the probabilistic factor analysis as a special case, and also reduces to probabilistic PCA given the prior is isotropic as well. archambeau2009sparse followed a Bayesian approach to the linear generative model (4

) and proposed a variational Expectation-Maximization algorithm to estimate the model parameters.

A reformulation for the parameters of this general model inspired by the maximum likelihood solutions of probabilistic CCA in Theorem 1 is presented in Appendix B.

Although constraining the observation models to the classical linear model (2) offers closed form inference for the latent variable(s) and efficient training algorithms the resulting power is very limited in modeling increasingly complex data distribution.
On the other hand, the generative descriptions of the probabilistic models in general, and the probabilistic multi-view models (2) and (4) in particular, can be extended naturally as the building blocks of more complex hierarchical models (klami2013bayesianCCA).
A well established method to increase the capacity and improve the expressiveness of such models is
using deep neural networks to capture nonlinear and complex structures in data distribution,
therefore one can append deep generative networks on top of the linear probabilistic model to obtain a
combined model denoted in this work as *deep probabilistic CCA* or *deep probabilistic multi-view* network in general.
A graphical representation of this model is depicted in Figure 1.
Let denote the collection of observations of all views and
be the collection of the shared latent representation and latent variables corresponding to each view.
The latent linear probabilistic CCA layer of the form presented in (2)
models the linear cross-correlation between all latent variables
in the latent space
while the nonlinear generative observation networks,
also called the decoders in the context of variational auto-encoders, are responsible for expressing the complex variations of each view.
The observation models are described by deep neural networks
with the set of model parameters .
In the following, an approximate variational inference approach is presented for training of this deep generative multi-view model.

### 2.2 Variational inference

To obtain the maximum likelihood estimate of the model parameters, it is desirable to maximize the marginal data log-likelihood averaged on the dataset

This objective requires marginalization over all the latent variables which entails the expectation of likelihood function over the prior distribution of the set of latent variables, . The marginalization is typically intractable for flexible models, hence, one work around is to follow the variational inference principle (jordan1999intro_variational), by introducing an approximate posterior distribution — also known as variational inference network in the context of amortized variational inference — and maximize a resulting variational lower bound on the marginal log-likelihood. This approach has recently attained renewed interest and studied extensively, and is considered as a default, flexible statistical inference method. Approximate variational inference is often modeled by deep NNs with model parameters . Therefore, we obtain the variational lower bound (rezende2014stochastic)

(5) |

This bound, also known as *evidence lower bound (ELBO)*, can be decomposed into two main terms:
first, the expectation of the log-likelihood function , known as the negative reconstruction error.
The conditional independence structure of the deep generative multimodal model implies that the likelihood function can be factored hence the negative reconstruction error can be expressed as

Although the expectations in above do not typically provide a closed analytical form, it can be approximated using Monte Carlo estimation by drawing random samples from the approximate posterior for each data point
^{1}^{1}1

This, indeed, leads to the Monte Carlo approximation of the gradient of the expected log-likelihood required for the stochastic gradient descent training

(rezende2014stochastic).The second term in ELBO is the KL divergence between the approximate posterior and the prior distribution of the latent variables, which acts as a regularizer injecting prior knowledge about the latent variable into the learning algorithm. Taking into account the conditional independence of the latent variables induced by the probabilistic graphical model of latent linear layer (2), the approximate posterior of the set of latent variables can be factorized as therefore, the KL divergence term can be decomposed to

(6) |

More details can be found in appendix C.

We model the variational approximate posteriors by joint multivariate Gaussian distributions with marginal densities

, that are assumed to be elementwise independent per each view for simplicity so having diagonal covariance matrices , and the cross correlation specified by canonical correlation matrix . The parameters of these variational posteriors are specified by separate deep neural networks, also called encoders. In this model a set of encoders are used to output the view-specific moments

, and an encoder network describes the cross correlation , whereas, depending on the application, can be either one (or a subset) of the views, when only one (or a subset) of the views are available at the test time,*e.g.*the multi-view setting where , or concatenation of all the views

*e.g.*in multi-modal setting. Altogether, the inference model is parameterized by . Having obtained the moments of approximate posteriors, we can obtain the canonical directions and subsequently the parameters of the probabilistic CCA model, according to the results presented in theorem 1. It is worth noting that the diagonal choices for covariance matrices , simplify the algebraic operations significantly, resulting in trivial SVD computation and matrix inversion required in theorem 1.

^{2}

^{2}2These types of simplifying assumption on the approximate posteriors have also been used in various deep variational inference models (rezende2014stochastic; kingma2013VAE). Although the representation power of such linear latent model is limited but using flexible enough deep generative models, that can explain away the complex nonlinear structures among the data, can justify these choices. Consequently, one can also readily verify that the canonical pairs of directions will be where is the standard basis vector with a 1 at th position.

We assume isotropic multivariate Gaussian priors on the latent variables as , and specify the approximate posteriors by Gaussian distributions, as explained above, the KL divergence terms can be computed in closed forms (kingma2013VAE). In the following, we provide an analytical approach to optimally identify the mean of shared latent variable, , from the parameters of the model and drive the optimal solution for .

###### Lemma 1

I) Rewriting the KL divergences with respect to the terms depending on the mean of latent factors give rise to the following optimization problem

(7) | ||||

where is sum of the terms not depending on the means. Solving this optimization problem results the optimal minimizer

(8) |

Having obtained the optimal , one can compute the means of the view-specific factors, , subsequently.

II) Given similar prior distributions on all view-specific factors, , the solutions that minimize the KL divergence term are where the matrix is the square root of matrix , *i.e.* and is an arbitrary rotation matrix.

Proof: See Appendix C for the proof.

According to the inference network, the optimal obtained by (8) is a function of all the views that can be viewed as a type of data fusion in the deep space, making it an appropriate choice for the multi-modal setting. On the other hand, in multi-view setting we are interested in the solution that depends only on the primary view available at the test time. To deal with this, we can solve a revised version of the optimization problem (7) by ignoring the terms that are depending on the non-primary views which leads to the minimizer

(9) |

##### Remark

As an alternative approach in the multi-view setting, one can train the model using the optimal inference based on both views in equation (8) while using the primary view-based estimate in (9) at the test time, but our empirical studies showed that using the same inference as in (9) for both training and test time offers richer shared representation variable resulting in slightly better performance in the downstream tasks.

##### Remark

Another possible approach is to treat as an extra parameter that is directly inferred by a deep NN, but this needs more NN layers to train and in practice we found this approach less efficient than the proposed optimal procedure.

We further assume that the rotation matrix is identity in the solutions to the probabilistic linear models (3), while leaving it to the deep generative network to approximate the rotation. Specifically, in our neural network architecture we select a fully connected as the first layer of the decoder to exactly mimic the rotation matrix.

In summary, the encoders together with the parameterizations of the model in (3) provide variational inference network for the parameters of the latent probabilistic multi-view model, , as non-linear functions of the observations.

##### Drawing samples from the latent variables:

Given the variational parameters of the latent probabilistic CCA model, one can draw samples of the latent factors from the approximate posteriors , using a differentiable transformation based on the reparameterization trick (kingma2019introduction_VAE), and generate latent representations as and , which are fed into the decoders to generate samples at the observation space. This procedure produces latent samples that satisfy the conditional independence rule of the probabilistic CCA while being cross correlated as specified by the variational canonical correlation . Therefore, the reconstruction error term can be stated as

### 2.3 Related work

To capture nonlinearity in the multi-view data, several kernel-based methods have been proposed (Hardoon_2004_KCCA; Bach_2003_KIC). Kernel-based methods, in general, require large memory to store a massive amount of training data to use in the test phase and in particular kernel-CCA requires an eigenvalue decomposition which is computationally expensive for large datasets. To overcome this issue, some kernel approximation techniques based on random sampling of training data are proposed in (williams2001nystromKCCA) and (lopez2014randomizedKCCA). Moreover, the probabilistic non-linear multi-view learning are considered in (shon2006learning; damianou2012manifold)

. As an alternative, deep neural networks (DNNs) offer powerful parametric models that can be trained for large pools of data using the recent advances of the stochastic optimization algorithms. In the multi-view setting, a deep auto-encoder model, called (SplitAE), was designed in

(ngiam2011multimodalDL) in which an encoder maps the primary view to the latent representation and two encoders are trained so that the reconstruction error of both views are minimized. On the other hand, the classical CCA is extended to deep CCA (DCCA) in (andrew2013DCCA)by replacing the linear transformations of both views with two deep nonlinear NNs and then learning the model parameters by maximizing the cross correlation between the nonlinear projections. DCCA is then extended to deep CCA autoencoder (DCCAE) in

(wang2015deepMV) where it leverages autoencoders to additionally reconstruct the inputs hence intruding extra reconstruction error terms to the objective function. While DCCAE could improve the representation learning over the DCCA, empirical studies showed that it tends to ignore the added reconstruction error terms resulting in poor reconstructed views (wang2015deepMV). The training algorithms of such classical CCA-based methods require sufficiently large batch size to approximate the covariance matrices and the gradients. Moreover, they also do not naturally provide an inference model to estimate shared latent factor and do not enable generating sampling from the model in the input space while also being restricted to the two-view setting. In contrast, the reconstruction error terms appear naturally in the objective function of the variational inference, the ELBO, so play a fundamental role in the training and thus richer decoder and reconstruction are expected using the proposed variational autoencoders. Furthermore, the stochastic backpropagation method with small mini-batches has proven as a standard and scalable technique for training deep variational autoencoders

(rezende2014stochastic). Moreover, the probabilistic multi-view model enables enforcing desired structures such as sparsity (archambeau2009sparse) by adopting a broader range of exponential family distributions for priors and approximate posteriors on the latent factors to capture while this property is not immediately apparent in the classical CCA-based variants.More recently, a variational two-view autoencoder was proposed in (wang2017acousticVCCA; wang2016_deepVCCA) that in principle offers a generative two-view model with shared representation or shared + view-specific factors. Despite the name of the method, in theory these works fail to draw connections between the proposed two-view models with the canonical directions and the probabilistic CCA interprtation in (bachPCCA) while the inference is not customized beyond the black box variational inference which can explain why they perform weaker than DCCAE in some experimental studies. On the other hand, in comparison to the deep probabilistic CCA proposed here, the shared latent representation equally contributes in both views, so these variational two-view methods can be viewed as special cases of the more generic model proposed here when hence, they are expected to offer lower flexibility in modeling two-view dataset compared to the deep probabilistic CCA.

##### Remark

Normalizing flows is a technique to specify a flexible and arbitrarily complex distribution by applying a sequence of invertible transformations on a simple base distribution (rezende_shakir2015_variationalInfWithNormalizingFlows). In the deep probabilistic multi-view model, we can also obtain a more complex approximate posterior by applying a rich normalizing flow, such as those in (karami2019invertibleConF), on the Gaussian distributed latent variables generated by latent linear probabilistic multi-view layer. By reducing the gap between the true posterior and its approximate, this technique is expected to provide a more expressive generative model for complex multi-view applications hence serving as a potential candidate for future studies.

## 3 Experiments

We empirically evaluate the representation learning performance of the proposed method and compare against some well established baseline algorithms on two scenarios: I) when all views are available at training but only a single view (the primary view) is available at the test time, namely multi-view setting, and II) all the views are available at the train and test time, namely multi-modal setting.

### 3.1 Multi-view experiments

Method | Error | NMI | ACC |

Linear CCA | 19.6 | 56.0 | 72.9 |

SpliAE | 11.9 | 69.0 | 64.0 |

KCCA | 5.1 | 87.3 | 94.7 |

DCCA | 2.9 | 92.0 | 97.0 |

DCCAE | 2.2 | 93.4 | 97.5 |

VCCA | 3.0 | - | - |

VCCA-private | 2.4 | - | - |

VPCCA | 1.9 | 94.8 | 98.1 |

*i.e.*only primary view is available at the test time so of equation (9) is used. The results of variational PCCA method are averaged over 3 rials where the results of the baseline methods are from (wang2015deepMV; wang2016_deepVCCA). The baseline methods are Linear CCA: linear single layer CCA, DCCA: deep CCA (andrew2013DCCA), Randomized KCCA: randomized kernel CCA approximation with Gaussian RBF kernels and random Fourier features (lopez2014randomizedKCCA), DCCAE: deep CCA-Auto encoder (wang2015deepMV), VCCA: multi-view variational auto-encoder (wang2016_deepVCCA) VCCA-private: shared-private multi-view variational auto-encoder (wang2016_deepVCCA).

##### Experimental design:

For the experimental study, we used the two-view noisy MNIST datasets of (wang2015deepMV) and (wang2016_deepVCCA) created based on MNIST handwritten digits that consists of grayscale images of size pixels with pixel values scaled to range

. The first view of the dataset is synthesized by rotating each image at angles randomly sampled from uniform distribution

while the second view is randomly sampled from the images with similar identity to the first view but not necessary the same image and then is corrupted by random uniform noise while the final value is truncated to remain in range . As a result of this procedure, both views are just sharing the same identity (label) of the digit but not the style of the handwriting as they are from arbitrary images in the same class. The training set is divided into training/validation subsets of length and the performance is measured on the images in the test set.To make a fair comparison, we used neural network architectures with the same capacity as those used in (wang2015deepMV) and (wang2016_deepVCCA). Accordingly, for the deep network models, all the inference networks and decoding networks are composed of 3 fully connected nonlinear hidden layers of size units, where ReLU gate is used as nonlinearity for all the hidden units of the deep networks. The first and the second encoder specify ,

where the variances are specified by

softplus function, and an extra encoder models the canonical correlations using the sigmoidfunction as the output gate. Independent Bernoulli distributions and independent Gaussian distributions are selected to specify the likelihood functions of the first and the second view, respectively, with the parameters of each view being specified by its own decoder network; the

sigmoid functions are applied on the outputs to estimate the means of both views while the variances of the Gaussian variables are specified by softplus functions. In order to prevent over-fitting, stochastic drop-out (srivastava2014dropout) is applied to all the layers as a regularization technique. ADAM optimizer (kingma2014adam) is adopted for training of the parameters of the deep neural networks. The details of the experimental setup can be found in Appendix E.To evaluate learned representation, the discriminative task and clustering task are examined on the shared latent variable. For the discriminative goal, the one-versus-one linear SVM classification algorithm is applied on the shared representation

. The parameters of the SVM algorithm are tuned using the validation set and the classification error is measured on the test set. We have also performed spectral clustering algorithm

(vonLuxburg2007tutorialSpectralClustering) on the k-nearest-neighbor graph constructed from the shared representation. To comply with the experiments in (wang2015deepMV) the degree (number of neighbors) of the nodes is tuned in the set using the validation set, and finally, it uses k-means as the last step to construct final partitioning into clusters in the embedding space. The proposed deep probabilistic CCA is compared against the available multi-view methods in terms of the performance of the downstream tasks reported in Table 1, where the results highlight that proposed variational model significantly improves the representation learning in multi-view datasets.2D embeddings of the shared latent representations using t-SNE is also visualised in Appendix D, which quantitatively verify that the learned features of the images of different classes are well separated.

Repeating the experiments on multi modal setting (*i.e.* both views are available at the test time) and using the equation (8) to recover the mean of the shared latent variable can significantly improve the performance of downstream tasks resulting in classification error= and clustering NMI= or ACC=,
confirming the merit of the proposed algorithm in successfully integrating information from different modality.

### 3.2 Multi-modal clustering

An important and interesting application of the proposed deep generative model is in clustering multi-modal datasets which we evaluate in this set of experiments. Recently, a deep multi-modal subspace clustering (abavisani2018multimodal) has been successfully extended the idea of deep subspcae clustering (DSC) (ji2017deepSupspaceClustering) into multiple modalities. The key component of such approaches is applying a self-expressive layer on a non-linear mapping of the data obtained by deep auto-encoders, which represent the projection of the data points as a linear combination of other data points projections. Although offering significant improvement in clustering performance for datasets lying in non-linear subspaces, such methods require a self-representation coefficient matrix of size where is the number of data points which makes them prohibitively expensive for large datasets.

#### 3.2.1 Datesets

The clustering performances of the proposed method are evaluated over the following standard datasets. Samples from all modalities of these datasets are depicted in Figure 3.

Digits | Extended Yale-B | |||||

ACC | NMI | ARI | ACC | NMI | ARI | |

CMVFC | 47.6 | 73.56 | 38.12 | 66.84 | 72.03 | 40 |

TM-MSC | 80.65 | 83.44 | 75.67 | 63.12 | 67.06 | 38.37 |

MSSC | 81.65 | 85.33 | 77.36 | 80.3 | 82.78 | 50.18 |

MLRR | 80.6 | 84.13 | 76.53 | 67.62 | 73.36 | 40.85 |

KMSSC | 84.4 | 89.45 | 79.61 | 87.65 | 81.5 | 63.83 |

KMLRR | 86.85 | 80.34 | 82.76 | 82.45 | 85.43 | 59.71 |

DMSC | 95.15 | 92.09 | 90.22 | 99.22 | 98.89 | 98.38 |

VPCCA | 98.78 | 96.72 | 97.35 | 99.71 | 99.52 | 99.19 |

Handwritten Digits: We chose two famous handwritten digits datasets MNIST (MNIST1998) and USPS (hull1994usps) that consist of grayscale digit images of size pixels and pixels, respectively. To make multi-modal dataset, each digit image in the MNIST dataset is paired with an arbitrary sample of the same digit identity but from USPS dataset. This process guarantees that the images of both modalities are just sharing the same identity (label) of the digit but not the style of the handwriting. The handwritten digits datasets are used for single-modal training and also for multi-modal, with , subspace clustering.

Multi-modal Facial Components: We also evaluate the proposed method on the multi-modal facial dataset used in (abavisani2018deep), where the Extended Yale-B dataset (lee2005YaleB) was used as the base and 4 facial components are extracted, by cropping eyes, nose and mouth, and formed 5 different modalities, including the whole face image. All modalities are resized to images of size pixels. This dataset is composed of 64 frontal images of 38 individuals under different illuminations and is a standard dataset in subspace clustering studies. For this multi-modal data, we train the general deep probabilistic multi-view model that extends the deep probabilistic CCA to arbitrary number of views (4) (Appendix B).

##### Experimental design:

To make a fair comparison, in this set of experiments, the encoders and decoders are built of neural networks with similar architectures as those used in (abavisani2018deep), except that our model does not require the self-expressive layer, a linear fully connected layer with parameter matrix of size coefficients. This key advantages of the proposed model, indeed, significantly reduces the total number of parameters specially for large input sizes.

Accordingly, the encoders (inference networks) of all modalities are composed of convolutional NN (CNN) layers while the decoders (observation networks) are built of transposed convolution layers. ReLU gate is used as nonlinearity for all the hidden units of the deep networks. The encoders specify , where the variances are model by softplus function. An extra encoder network models the canonical correlations, , using the sigmoid function as the output gate. The observation likelihood functions of all the views, , are modeled by independent Bernoulli distributions with the mean parameter being specified by decoder networks, ; the sigmoid functions are applied to estimate valid means of the distributions. To train the parameters of deep generative model, we used ADAM optimization (kingma2014adam) with learning rate of and default hyper-parameters and minibatch size of 200 data points. Details of model architecture and experimental setup together with more empirical results are presented in appendix E.

Clustering is performed on the shared latent factor using spectral clustering algorithm (vonLuxburg2007tutorialSpectralClustering) on the k-nearest-neighbor graph, with number of neighbors . As the last step, the spectral clustering discretizes the real-valued representation in the embedding space to extract the final partitioning. Clustering performance are measured using clustering Accuracy rate (ACC), Normalized Mutual Information (NMI) (cai2005documentclustering) and Adjusted Rand Index (ARI) (rand1971metricARI) as performance metrics.

The clustering performance of the proposed method is compared against the well established subspace clustering methods TM-MSC (zhang2015TMMSC), CMVFC (cao2015constrained), MSSC, MLRR , KMSSC, KMLRR (abavisani2018multimodal) and DMSC (abavisani2018deep) that are used as the baselines methods for multi-modal setting. The results summarized in Table 3 show that the proposed deep generative model sets new state-of-the-art which, subsequently, highlights that the proposed method can efficiently leverage the extra modalities and extract the common underlying information among the modalities, that is the cluster memberships.

## 4 Conclusion

In this work, we developed a simple, yet powerful, tool for multi-view learning based on the probabilistic interpretation of CCA. A deep generative probabilistic model for multi-view data was studied. It has been shown that following the theoretical formulation of the linear probabilistic CCA model in conjunction with variational inference principles for deep generative networks, we can obtain a scalable end-to-end learning algorithm for multi-view data. Experimental results have shown that this can efficiently integrates the relationship between multiple views to obtain a more powerful representation which achieved state-of-the-arts performance on several downstream tasks. These indeed suggest that the proposed method is a proper way of extending variational inference to deep probabilistic multi-view learning.

## References

## Appendix A Proof of theorem 1

The marginal mean and covariance matrix of the joint views under the linear probabilistic model (2) are and where we define and , therefore, similar to the proof in [bachPCCA], the negative log-likelihood of the data can be written as

Maximizing with respect to results in a maximum and the negative log-likelihood is reduced to

The rest of the proof follows immediately along the line of proof in [bachPCCA].

## Appendix B Generalization of probabilistic CCA

As an extension for probabilistic CCA to arbitrary number of views, [archambeau2009sparse] proposed a general probabilistic model as follows:

(10) | ||||

This model can also be viewed as multibattery factor analysis (MBFA) [klami2014GroupFactorAnalysis], [browne1980MBFA] which describes the dependency between all the views by a single latent vector, , and explains away the view-specific variations by factors private to each view. Moreover, limiting to single view, this model also includes the probabilistic factor analysis and probabilistic PCA as a special cases if the view-specific dimensions are independent and isotropic [archambeau2009sparse].

Explaining the view-specific variations by the variance matrices for each view, we can represent the probabilistic multi-view model as

(11) | ||||

Where the latent factor captures the common variations between all the views hence describing the essence of multi-view data. On the other hand, since the cluster memberships can be considered as a common information in all the views, this shared underlying representation is well suited for subspace clustering in multi-view setting. Let be the mean and covariance matrices of . Inspired by the maximum likelihood solutions of probabilistic CCA in theorem 1 and [bachPCCA], that reformulate the parameter estimate for the probabilistic model based on the classical CCA solutions, we can propose the following system of equation for the parameters of the probabilistic multi-view model

(12) |

Where with diagonal entries and

are composed of orthonormal vectors

. To simplify the model, we assume . The equations in (B) reduces to maximum likelihood estimate of PCCA for views, hence can be viewed as an extension of PCCA for multi-view with more than two views. Defining the correlation matrix as , equations in (B) imply that and are formed by the singular value and singular vectors of the correlation matrix, respectively,*i.e.*, therefore, analogous to the ML solution of PCCA, and can be interpreted as matrices of canonical correlations and canonical directions. This also implies that all the pairs of the views have similar correlation matrix.

In section 2.2, a closed form solution to infer based on other variational statistics of the model is presented. We will also provide a simple treatments for and . As a consequence, given the first and second order moments of the views together with the diagonal matrix of canonical correlations , one can infer the rest of the parameters for the multi-view generative model in (B). This, in fact, simplifies the variational inference network to just learn a compact set of parameters.

It is worth noting that, although the parametrization of model is based on a single shared latent factor (and also a single correlation matrix to explain the relationship between all the views) but it can be seen that the contribution of the shared factor in view is controlled by the factor loading that is, in turn, a function of and , which is a view specific parameter. Therefore, the shared factor does not equally influence the views but instead its effect on each view varies by the strength of its projection which results in dissimilar cross-covariances for the different pairs of views, therefore offering flexibility to model uneven dependencies between the views.

## Appendix C Some proofs of section 2.2

### c.1 Proof of additive property of KL (2.2)

Conditional independence of the latent variables induced by the probabilistic graphical model of latent linear layer (2) implies that the approximate posterior of the set of latent variables can be factorized as

(13) |

In addition, assuming independent prior distribution on the latent variables, *i.e.*
leads to

### c.2 Proof of Lemma 1

Assuming isotropic multivariate Gaussian priors on the latent variables as , and specifying the approximate posteriors as Gaussian distributed vectors with diagonal covariances results in closed form solutions for the KL divergence terms [kingma2013VAE] as

Splitting the terms in the KL divergence to those depending on the mean variables and the remaining ones results

Now, solving this constraint optimization problem using the method of Lagrange multipliers leads to the optimal minimizer

This provides an analytical approach to optimally recover from the parameters of the model.

## Appendix D 2-D embedding

Figures 4
depict the 2D t-SNE embeddings of the shared latent representations for multi-view and multi-modal setting, respectively.
They verify that the representation of the images of different classes are well separated in the shared latent space.

## Appendix E Model architecture and training procedure

### e.1 Two-view noisy MNIST experiments

The parameters of each algorithm are tuned through cross validation with grid search over the variance of the shared representation and equal variance for residual errors in range . Results are averaged over 3 trails.

The dimensionality of the shared representation was and the dimensionality of the latent factors were .

Weight decay of 0.0001 was applied as the regularization for all the parameters of NNs.

### e.2 Multi-modal clustering experiments

digits: The dimensionality of the shared representation was and the dimensionality of the latent factors were .

The parameters of each algorithm are tuned through cross validation with grid search over the variance of the shared representation and equal variance for residual errors in range .

Yale-B facial components: The dimensionality of the shared representation was and the dimensionality of the latent factors were .

The parameters of each algorithm are tuned through cross validation with grid search over the variance of the shared representation and equal variance for residual errors in range .

In both experiments a weight decay of 0.0001 was applied as the regularization for all the parameters of NNs.

Comments

There are no comments yet.