I Introduction
In many realworld problems, more than one set of features, referred to as views of the data, are available. For example, a web page can be represented by text data, images, and metadata. Multiple views can help improve the performance of many learning tasks because each view can provide information complementary to others, and learning using all views can maximally exploit the information available. In particular, multiview dimension reduction has been proven effective for learning from high dimensional multiview data [1] such as image and text in image processing [2], speech and video in speech processing [3], and multilingual texts in language processing [4].
Compared to traditional multiview dimension reduction, multiview dimension reduction using deep networks has shown the stateoftheart results [5]. Projections are learned to map all views to a common feature space where view information is retained and fused. The new space can enable or improve learning algorithms that are not applicable or inferior in multiple highdimensional spaces. Lowdimensional representations learned in such a way are without labeled data and thus not sufficiently discriminative for end tasks such as classification and clustering. Discriminative multiview dimension reductions based on CCA [6], topic models [7], and information bottleneck [8] can help learn representations that can not only unify different views by dimensionality reduction but also discriminate different classes. However, as the networks become deeper, more parameters need to be learned and a larger amount of labeled data are required which is not readily available in many applications. The high cost of obtaining labeled data along with the growing size of unlabeled data has driven the development of semisupervised learning that combines labeled and unlabeled data to mitigate the issue. However, there is still a lack of a semisupervised deep discriminative method for multiview dimension reduction.
We propose MDNN (Multiview Discriminative Neural Network) for the above purpose, using only a small amount of labeled data and a large amount of unlabeled data. MDNN maximizes betweenclass separations and minimizes withinclass variations while leveraging label information for discriminativeness. MDNN consists of a pair of parallel neural networks coupled by a shared layer on the top of the last layers (see Fig. 1). The model is trained in a joint manner to find viewspecific nonlinear transformations. The learned transformations are further used to project samples to the common space. MDNN not only projects paired instances from different views to the same space (maximal correlation) but also projects instances from different classes far from each other (interclass separation) while instances with the same class label are close to each other (intraclass variation).
To the best of our knowledge, MDNN is the first deep semisupervised representation learning method in multiview problems, which has all of the following properties in a single unified model: (i) yielding a discriminative feature representation, (ii) using the complementary information of other views to exploit the information in unlabeled data, and (iii) achieving the above properties using a large amount of unlabeled data to help learning with only a small amount labeled data. We evaluate MDNN on four multiview datasets, namely Noisy MNIST, WebKB, FOX, and CNN, and compare it to the stateoftheart baselines. The proposed model is evaluated in crossview learning setting where we have two views during the training but just one of them is available for test. Experimental results demonstrate that the proposed algorithm outperforms all the baselines in terms of accuracy especially when a limited number of labeled samples are available. The reminder of this paper is organized as follows. In Section II, an overview of the related previous works is given. The proposed algorithm is discussed in detail in Section III and experiments are presented in Section IV. In Section V, we conclude the paper.
Ii Previous Works
In Table I, capabilities of different multiview dimension reduction models are compared. The proposed algorithm (MDNN) is the only one that enjoys all the capabilities. CCA is a wellknown dimension reduction technique for data with two views [9, 10, 2, 11, 3]
. It finds two linear transformations to project the views to a common feature space, so that correlation between views is maximized. However, CCA suffers from the lack of nonlinearity in its transformations to model nonlinear data. Kernel CCA (KCCA) extends CCA to find nonlinear projections
[12] for both views. KCCA requires training data during testing and does not easily scale to large datasets.More recently, several deep neural network (DNN)based algorithms have been proposed for nonlinear feature representation learning on multiview problems [13, 14]
. A deep model for CCA estimation, referred to as deep CCA (DCCA), has also been proposed
[15, 16]. Like CCA, DCCA is a parametric approach and is scalable to large datasets, and like KCCA, it can model nonlinearity in the data.Nonetheless, linear CCA, KCCA and deep CCA are unsupervised feature learning techniques. They cannot exploit labels available (if any) during feature representation learning. The learned lowdimensional representations thus lack class discriminativeness that is critical to the success of end tasks such as classification and clustering. While discriminative representation learning from oneview using deep network or topic model has also been explored in [8, 7]
. Learning a discriminative representation from multiview data can require more labeled data as the number of views and network layers increase. While semisupervised techniques for deep learning has been explored in
[17, 18] to use a large amount of unlabeled data to mitigate the lack of labeled data, semisupervised discriminative multiview learning has not been studied and is the focus of this paper.Maximizing betweenclass separations while minimizing withinclass variations have been widely used in many learning algorithms, such as Fisher’s Linear Discriminant Analysis (FLDA) [19]. However, FLDA is a linear technique and although kernelbased versions of LDA have been proposed (KLDA) [12], they suffer from similar drawbacks of CCA and KCCA, such as scalability and fixed kernels. A version of LDA based on neural networks has been introduced [20]. However, all these studies work with only a single view and do not benefit from the noiserobustness of CCAbased techniques, which is the result of maximizing the correlation between views.
Iii The Proposed Algorithm
The schematic representation of the proposed model for two views is shown in Fig. 1. The MDNN comprises of two deep neural networks (one for each view) coupled in a shared layer (interviewlayer). More networks can get coupled to handle more views. Both networks are trained jointly to find viewspecific nonlinear transformations to map the input views to a common feature space. The interview layer encourages interview correlation between views and is responsible for exploiting the information in both views of both labeled and unlabeled. All views of a single instance are projected as near as possible to each other. Moreover, two objectives are imposed on the output layer of each view independently to make the new space discriminative. It is achieved by maximizing intraview discrimination using the labeled data: instances of the same class in one view are mapped closed together, whereas instances of different classes are mapped distant apart. Such properties make all the views of each instance to be highly correlated, and instances of different classes are easily separable.
These two parts of the model work in a joint manner to learn the desired representation from all the labeled and unlabeled data, and each can be considered as a regularizer for the other during the subspace learning. We train our model with backpropagation to learn two nonlinear transformations through optimizing the introduced objective functions. After training, the network is employed to map multiple views of data to a common lowdimensional space, where classifiers can be trained.
The purpose of using an independent network for each view is to learn lowlevel viewspecific representations according to the properties of each view. Thus, the architecture of each network, such as the type or number of layers, can get adjusted according to the view’s properties. In addition, representations obtained from higher levels of the networks are more likely to reveal the views’ statistical properties compared to the original inputs [21].
Iiia Deep Model Definition
For a twoview problem, the training set is represented as , where is a training sample with views and with dimension and , respectively. is the total number of training pairs consisting of labeled and unlabeled pairs. The label set for labeled samples is denoted by .
We aim to learn two nonlinear viewspecific functions and that map the given paired views to the embedding spaces and . Slightly abusing the notation, inputs to the first layers of the networks for the two views are batches of samples denoted by and
, and the hidden representations output by the last layers right before the shared layer are denoted by
and . Parameters and are the parameters of the two networks, respectively (see Fig. 2).IiiB Objective Function
To learn a discriminative representation for more effective classification, we define the objective function
(1)  
where the function maximizes the interview correlation between the samples in the new space, and functions encourages discriminative subspaces. The term with regularization parameter is added to regularize the networks. Parameter specifies the tradeoff between the importance of the interview correlation and intraview discrimination properties in the new space.
We define the function based on CCA, which maps multiple views of samples into a new space where paired views of each sample are highly correlated using a linear transformation matrix. It has been shown that the orthogonality of the learned dimensions is critical to effective representations of the multiviews [6]. Considering the outputs of the two branches of MDNN as two sets of variables, also denoted by and , CCA maximizes their correlation
(2)  
where is the covariance matrix of and :
(3) 
where and are the centered matrices of and , respectively.
Vectors and are the two linear transformation vectors that map and to a maximally correlated new space. Since such correlation function is invariant to scaling of transformation vectors and , the objective function can be written as a constraint optimization problem as follows
(4) 
We need to find other transformation vectors which produce projections uncorrelated with previous ones. The constrained problem to find all transformation vectors is
(5) 
where matrices and contain transformation vectors as columns. Note that there are several ways to solve such optimization problems. It is shown in [22]
that the sum of the largest singular values of
gives the maximal value of (5), and the corresponding eigenvectors are the optimal projection directions. The sum of all singular values can be estimated by the Frobenius matrix norm of
as the following:(6) 
All covariance matrices in (5) are regularized by a small positive number to ensure that the matrices are positive definite
(7) 
We define the function based on the two criteria of interclass separation and intraclass variation to learn transformations that lead to a discriminative feature space. Interclass separation measures how close instances from different classes are to each other. Intraclass variation measures how close instances from the same class are to each other. Generally, intraclass variation should be minimized while interclass separation should be maximized to obtain a discriminative feature space.
The intraclass criterion (also referred to as a withinclass scatter matrix) for a set of labelled samples from view , , is defined as
(8) 
where is the number of classes, and is the instance of view in the new space. Variable denotes the mean of the samples from class for view .
The interclass criterion (also referred to as the betweenclass scatter matrix) for the same set of samples can be defined as follows
where is the number of labeled samples from class . These two criteria can be merged into a single optimization problem as:
where measures the discriminiveness of the learned space for labeled samples of view . The parameter is a regularization parameter to increase the stability of the inverse operation. Maximizing the function leads to maximizing and minimizing simultaneously to obtain a discriminative feature space.
IiiC Optimization
To optimize the objective function , we find the optimal values of all parameters for both networks, i.e., and
, using stochastic gradient descent(SGD). To use SGD, we split the samples into some labeled and unlabeled minibatches. In labeled batches, labeled samples from each class are present proportional to their ratio in the whole data.
We estimate the gradient of with respect to the outputs of networks and to use the backpropagation technique. The backpropagation algorithm estimates other gradients to update the networks’ parameters and .
If the singular value decomposition of matrix
in function is , then the gradient of function with respect to can be estimated as follows:(9) 
where
denotes the total number of samples in the batch. Similar expressions hold for the gradient with respect to . More detail on calculating this gradient can be found in [15].
Calculating the gradient of the is not trivial. Similar variants of have been already investigated in other papers [23, 20]. In most cases, they tackled this optimization problem by reformulating it as a general eigen decomposition problem. We avoided such reformulation as we found out in our experiments that it increases the training instability of the neural networks. Therefore, we optimize without any reformulation by following [24].
We denote and as and respectively in the following derivations. If sample , then gradient of scatter matrices and for view are defined as (10) and (11).
(10) 
(11) 
Defining , then the gradient of the discriminative objective function is estimated as
(12)  
As we can have the following [25]
(13) 
We rewrite the gradient by using (10) and (11) as
(14)  
where
(15)  
More detail can be found in [24]. The gradient of the total objective function is used to train both the networks simultaneously with the backpropagation algorithm. It is necessary to train the model with minibatches because the objective function is defined on the properties of whole space, not just a single instance. Therefore at each step, we need a batch of sample to optimize the objective function.
Iv Experiments
In this section, we present the experimental evaluation and analysis of MDNN. All experiments are performed in crossview classification setting, however, it can be extended to other tasks such as crossmodal image and text retrieval [26].
Iva Datasets
We evaluate the proposed algorithm on the following four datasets. A summary of the datasets is presented in Table II.
Noisy MNIST: Noisy MNIST is a noisy version of the wellknown MNIST dataset that contains images of handwritten digits. Following [6]
, a twoview version of MNIST for evaluating multiview problem has been created. This was accomplished by rotating and adding random noise to the images of the dataset. Each image was rotated by a randomly sampled angle from a uniform distribution between
and . The resulting images were used as the first view. For each image, another image from the same class was selected randomly as the second view. Uniform noise samples in the range were also added to each pixel of the images in the second view. The noisy MNIST dataset contains 70K grayscale images of digits 0 to 9. The split of 60,000/10,000 is used in the experiments for train/test. Two examples of this dataset are shown in Fig. 3.Web Knowledge Base (WebKB)^{1}^{1}1http://vikas.sindhwani.org/manifoldregularization.html: It is a collection of 1,051 web documents crawled from four universities [27]. The data has two classes: course or noncourse web pages. Each document has two views: 1) the textual content of the web page and 2) the anchor text on the links pointing to the web page.
CNN and FOX: These two datasets were crawled from CNN and FOX web news [28]. The category information extracted from their RSS feeds are considered as their class label. Each instance is represented in two views: the text view and image view. Titles, abstracts, and text body contents are considered as the text view data (view 1), and the image associated with the article is the image view (view 2). All text is stemmed by Porter stemmer, and l2normalized TFIDF is used as text features. Processed data samples in CNN and FOX datasets have 1,143 and 7,980 features respectively. Also, seven groups of color features and five textural features are used for image features [28], which results in 996 features for both datasets.
Dataset  # Instance  # Feature  # Class 

Noisy MNIST  
WebKB  
FOX  
CNN 
Dataset  Noisy MNIST  WebKB  

# of labeled samples  
MDNN  97.34  86.58  93.29  94.46  96.50  
Deep CCA  
Deep LDA  
Linear CCA  
LDA  
Kernel CCA  75.81  87.25  93.78 
Dataset  FOX  CNN  

# of labeled samples  
MDNN  73.62  84.05  85.43  78.59  76.28  80.08  79.67  
Deep CCA  
Deep LDA  93.07  
Linear CCA  
LDA  
Kernel CCA 
IvB Baselines
We compare the performance of MDNN with some the stateoftheart algorithms for multiview representation learning. From methods that do not use deep neural network, we compare MDNN to linear Canonical Correlation Analysis (CCA) and Kernel CCA (KCCA) as the most commonly used techniques for representation learning in multiview problems [1, 6]. Although CCA finds linear transformations, it is still widely used because of its speed and simplicity. As traditional kernel CCA is not scalable, we use FKCCA [29] method which is an approximation of the real KCCA definition.
Among all the DNN based approaches, deep CCA (DCCA) [15] is used in the experiments because of the better performance than other DNNbased algorithms [6]
. None of these CCA based techniques use the label information, and all are categorized as unsupervised feature reduction techniques.
Also, two approaches which consider label information are also selected as baselines, Linear Discriminant Analysis (LDA) [30] and its neural network variant: Deep LDA [20]. These approaches are not designed for multiview problems, but they are selected because they use labeled data to learn the new representation. Therefore they are applied on just the primary view, and cannot use interview relation between views.
IvC Experimental Settings
We perform all the experiments in crossview learning, that is used in [1, 6, 31]
for evaluating representation learning techniques in multiview problems. In this setting, all views are available during the representation learning but one (view 2) is missing during the testing process. All the methods in the experiments use both primary and complementary views in the training process to learn a common feature space. After learning representation, primary view is mapped to the new learned space. Then a linear Support Vector Machine (SVM) classifier
[32] is trained on the new representation to evaluate it in a classification task. We would like to emphasize that the aim of this paper is to present a new representation for multiview setting. Therefore, we selected linear SVM as the classifier instead of a more complicated method for classification. In this way, we can evaluate the effectiveness of the representation learning more accurately.All the parameters are selected to obtain the best performance in crossvalidation process. All neural network based models are trained for epochs. All samples are distributed randomly over the batches proportionally to their class size.
Regularization parameter is selected from for all datasets. Tradeoff parameter is selected from for each dataset separately. Regularization is set to . Representation sizes are selected as the number of classes for each dataset except for WebKB which is set to . These sizes may not give the best performance possible for MDNN, but they are set as the number of classes for all models to have a fair comparison among all techniques. Parameter of the SVM is also selected from by crossvalidation.
Networks with the same architecture consisting of
hidden layers with the same number of hidden nodes are used for both views. The only exception is the network for WebKB which has 2 hidden layers instead of 3. Relu activation is used for all layers except the last one which has linear activation. Number of hidden nodes is selected as
, , , for noisy MNIST, WebKB, FOX, and CNN datasets, respectively. We use a variant of SGD, called Adam [33], to optimize the neural networks. All the parameters of Adam are set as the its paper recommends. The architecture of the networks for all the neural network based models including deep CCA, Deep LDA, and MDNN are defined the same to have fair comparisons.IvD Performance Evaluation
We evaluate the effectiveness of the new representations learned by MDNN on crossview classification tasks. All the results are reported for the primary view which is the only available view during the test. The classification accuracy of different baselines on datasets noisy MNIST, WebKB, FOX, and CNN are reported in Tables III and IV. Results are reported for different numbers of labeled samples to show the effectiveness of the proposed algorithm in semisupervised settings. The column labeled as ‘All’ indicates the case where the label information for all samples is available. The best performance for each case is shown in bold. As it can be observed, MDNN outperforms all the other baselines in most cases.
The differences of MDNN’s accuracies are more significant comparing to others in cases with fewer labeled samples. It shows the effectiveness of the proposed algorithm in exploiting labeled information which helps the model in semisupervised settings. It should be considered that none of the current approaches can exploit both the labeled and unlabeled data together.
The experiments also demonstrate that MDNN can also show superior results even for supervised settings where all data are labeled. It shows that the idea of combining interview correlation and intraview discrimination can be effective even when label information is available for all samples.
MDNN demonstrates better accuracy compared to deep CCA because it considers both label information and crossview correlation when finding the projections; while deep CCA ignores the available label information. The proposed MDNN attempts to produce more discriminative feature sets by leveraging label information into the mapping learning process. Simultaneous optimization of interclass separation, intraclass variation, and crossview correlation make the new representations more discriminative; therefore, prediction is easier.
Kernel CCA shows better results than MDNN in some cases of noisy MNIST. It can be due to the simplicity of noisy MNIST dataset. As it can be seen, just a few labeled samples are enough to get good results on this dataset.
IvE Model Analysis
We investigate and explore the influence of the main parameter of MDNN, the size of new representation, on the classification task. In Fig. 4
, the accuracies of MDNN on all datasets are plotted for various sizes of space. As it can be seen, good results can get achieved with a small size of representation, and there is no need to learn a high dimensional space. A simple classification algorithm such as linear SVM can classify the samples in the new space efficiently. It shows the representation learning power of MDNN. Representation learning can make it feasible to work on high dimensional data for the algorithms which are not able to handle high dimensional data efficiently.
Additionally, it can be seen that having unnecessary large sizes for the output dimension can affect the performance. For most datasets, hidden output size close to the number of classes can be a good choice. Unnecessary large embedding size may reduce the performance. It can be the result of producing noisy information in higher dimensional space.
IvF Subspace Analysis
In this section, the new subspace learned by MDNN is investigated and compared with the original feature space. 4000 instances of the training samples with new representation are selected randomly and visualized in 2dimensional space in Fig. 5. They are visualized using a dimensionality reduction algorithm called tdistributed stochastic neighbor embedding (tSNE) algorithm [34]. It is an unsupervised representation learning that is mostly used for visualizing features in a lowdimensional space. It learns mappings from the given feature space to a new space in which similarity of samples is preserved as much as possible. In other words, samples which are close or similar in the source feature space are likely to be close to each other in the new space. It is evident that MDNN produces a more discriminative space comparing to original feature space. It learns better representation that is owed to exploiting the label information.
V Conclusion
We have proposed a semisupervised deep neural network model, called MDNN, to learn discriminative representations for multiview problems when labels for some instances are not available. To achieve this, the proposed model maximizes betweenclass separation and minimizes withinclass variation to make the new space discriminative. It also maximizes the correlation between all views to exploit the interview information and also exploit the information in unlabeled data. Our model is capable of exploiting the information in both the labeled and unlabeled data in a unified learning process.
To the best of our knowledge, the proposed MDNN is the first deep network model that learns a common subspace with such properties for semisupervised multiview problems. The experimental results demonstrated the effectiveness of MDNN in learning discriminative feature spaces and also benefiting from the information exists in the unlabeled data.
References
 [1] C. Xu, D. Tao, and C. Xu, “A survey on multiview learning,” arXiv preprint arXiv:1304.5634, 2013.

[2]
K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan, “Multiview
clustering via canonical correlation analysis,” in
Proceedings of the 26th International Conference on Machine Learning, ICML’09
, 2009, pp. 1–8.  [3] R. Arora and K. Livescu, “Multiview CCAbased acoustic features for phonetic recognition across speakers and domains,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’13, 2013, pp. 7135–7139.
 [4] M. Faruqui and C. Dyer, “Improving vector space word representations using multilingual correlation,” in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, pp. 462––471.
 [5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [6] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multiview representation learning,” in Proceedings of the 32nd International Conference on Machine Learning , ICML’15, 2015, pp. 1083–1092.
 [7] J. Zhu, A. Ahmed, and E. P. Xing, “Medlda: Maximum margin supervised topic models for regression and classification,” ser. ICML, 2009.
 [8] C. Xu, D. Tao, and C. Xu, “Largemargin multiviewinformation bottleneck,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
 [9] D. R. Hardoon, S. Szedmak, and J. ShaweTaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004.
 [10] P. Dhillon, D. P. Foster, and L. H. Ungar, “MultiView Learning of Word Embeddings via CCA,” in Annual Conference on Advances in Neural Information Processing Systems, NIPS’11, 2011, pp. 199–207.
 [11] D. P. Foster, S. M. Kakade, and T. Zhang, “Multiview dimensionality reduction via canonical correlation analysis,” Toyota Technological Institute, Chicago, Illinois, Tech. Rep. TTITR20084, Tech. Rep., 2008.
 [12] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, K. Müller, Y.H. Hu, J. Larsen, E. Wilson, and S. Douglas, “Fisher discriminant analysis with kernels.” in Neural Networks for Signal Processing, 1999, pp. 41–48.
 [13] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning, ICML’11, 2011, pp. 689–696.
 [14] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
 [15] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in Proceedings of the 30th International Conference on Machine Learning, ICML’13, 2013, pp. 1247–1255.
 [16] A. Benton, H. Khayrallah, B. Gujral, D. A. Reisinger, S. Zhang, and R. Arora, “Deep generalized canonical correlation analysis,” arXiv preprint arXiv:1702.02519, 2017.
 [17] J. Zhang, G. Tian, Y. Mu, and W. Fan, “Supervised deep learning with auxiliary networks,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD, 2014.

[18]
A. Ororbia II, C. L. Giles, and D. Reitter, “Learning a deep hybrid model for
semisupervised text classification,” in
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, 2015.  [19] M. Sugiyama, “Local fisher discriminant analysis for supervised dimensionality reduction,” in Proceedings of the 23rd international conference on Machine learning, ICML’06, 2006, pp. 905–912.
 [20] M. Dorfer, R. Kelz, and G. Widmer, “Deep linear discriminant analysis,” in Proceedings of International Conference on Learning Representation, ICLR’16, 2016.
 [21] N. Srivastava and R. Salakhutdinov, “Learning representations for multimodal data with deep belief nets,” in International Conference on Machine Learning Workshop, ICML’12, 2012.
 [22] J. F. Hair, W. C. Black, B. J. Babin, R. E. Anderson, R. L. Tatham et al., Multivariate data analysis. Pearson Prentice Hall Upper Saddle River, NJ, 2006, vol. 6.
 [23] L. Wu, C. Shen, and A. van den Hengel, “Deep linear discriminant analysis on fisher networks: A hybrid architecture for person reidentification,” Pattern Recognition, vol. 65, pp. 238–250, 2017.

[24]
A. Stuhlsatz, J. Lippel, and T. Zielke, “Feature extraction with deep neural networks by a generalized discriminant analysis,”
IEEE transactions on neural networks and learning systems, vol. 23, no. 4, pp. 596–608, 2012.  [25] K. B. Petersen, M. S. Pedersen et al., “The matrix cookbook,” Technical University of Denmark, vol. 7, p. 15, 2008.
 [26] W. Kaiye, Y. Qiyue, W. Wei, W. Shu, and W. Liang, “A comprehensive survey on crossmodal retrieval,” CoRR, vol. abs/1607.06215, 2016.
 [27] V. Sindhwani, P. Niyogi, and M. Belkin, “Beyond the point cloud: from transductive to semisupervised learning,” in Proceedings of the 22nd international conference on Machine learning. ACM, 2005, pp. 824–831.

[28]
M. Qian and C. Zhai, “Unsupervised feature selection for multiview clustering on textimage web news data,” in
Proceedings of the 23rd ACM international conference on conference on information and knowledge management. ACM, 2014, pp. 1963–1966.  [29] D. LopezPaz, S. Sra, A. Smola, Z. Ghahramani, and B. Schölkopf, “Randomized nonlinear component analysis,” arXiv preprint arXiv:1402.0119, 2014.
 [30] A. J. Izenman, “Linear discriminant analysis,” in Modern multivariate statistical techniques. Springer, 2013, pp. 237–280.
 [31] W. Wang and K. Livescu, “Largescale approximate kernel canonical correlation analysis,” CoRR, vol. abs/1511.04773, 2015. [Online]. Available: http://arxiv.org/abs/1511.04773
 [32] J. A. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural processing letters, vol. 9, no. 3, pp. 293–300, 1999.
 [33] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [34] L. v. d. Maaten and G. Hinton, “Visualizing data using tsne,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.