Classification has been a fundamental technique of machine learning and broadly applied in image classification[1, 2, 3], image reconstruction , cross-modal retrieval , gaze tracking , visual tracking [7, 8], multi-view boosting , multi-view biological data 
and so on. The typical classification methods mainly include Logistic Regress, K-Nearest Neighbor, Decision Tree, and Support Vector Machines (SVM), where the kernel trick is an important innovation of SVM since it can not only utilize the convex optimization method with convergence to learn the nonlinear model but also make the implementation of kernel function (such as Radial Basis Function (RBF)) high-efficiency. Unfortunately, the computation cost of the kernel methods is too large for the large dataset. The deep learning method, therefore, is proposed to overcome the limitations of the kernel methods. For example, has been demonstrated that the performance of Neural Networks is better than that of SVM with RBF kernel on the MNIST dataset.
Recently, with the rapid development of data mining techniques, in many scientific data analysis tasks, data are often collected through different ways, such as various sensors, as usually a single way cannot comprehensively describe the entire information of a data instance. In this case, different kinds of features of each data instance can be considered as different views of this instance, where each view may capture a specific facet of this instance. Importantly, the combination of multiple views can provide a more comprehensive view of this instance. Obviously, the multi-view features can be utilized to sufficiently represent an instance compared to the single view features. As an important branch of multi-view learning , the multi-view classification [13, 14, 15] uses multiple distinct representations of data and models a multi-view learning framework to perform the classification task.
Canonical Correlation Analysis (CCA)  and Kernel CCA (KCCA) [17, 18, 19] show their abilities of effectively modeling the relationship between two or more views. However, they still have some limitations on capturing high-level associations between different views. To be specific, CCA ignores the nonlinearities in multi-view data and KCCA may suffer from the effect of small data when the data acquisition in one or more modalities is expensive or otherwise limited. Therefore, in the early efforts,  investigates a neural network implementation of CCA and maximizes the correlation between the output of the networks for each view.  formulates a nonlinear CCA method using three feedforward neural networks, where the first network maximizes the correlation between two canonical variates, while the remaining two networks map from the canonical variates back to the original two sets of variables. However, the above CCA-based methods have been proposed for many years and are needed to be improved.
Inspired by the recent successes of deep neural networks [22, 23], correlation can be naturally applied to multi-view neural network learning to learn deep and abstract multi-view interactive information. The deep neural networks extension of CCA (Deep CCA)  learns representations of two views by using multiple stacked layers and maximizes the correlation of two representations. Later,  proposes a deep canonically correlated auto-encoder (DCCAE) by combining the advantages of both Deep CCA and auto-encoder. Specifically, DCCAE consists of two auto-encoders and optimizes the combination of canonical correlation between two learned bottleneck representations and the reconstruction errors of the auto-encoders. Similar to the principle of DCCAE, 
proposes a correspondence auto-encoder (Corr-AE) via constructing correlations between the hidden representations of two unimodal deep auto-encoders. suggests exploiting the cross weights between the representations of views for gradually learning interactions of the modalities (views) in a multi-modal deep auto-encoder network. Theoretical analysis of  shows that considering these interactions from a high level provides more intra-modality (intra-view) information.
In addition, a number of multi-view analysis methods [28, 29] have been proposed. Based on Fisher Discriminant Analysis (FDA) , both regularized two-view FDA and its kernel extension can be cast as equivalent disciplined convex optimization problems. Then, 
introduces Multi-view Fisher Discriminant Analysis (MFDA) that learns classifiers in multiple views, by minimizing the variance of the data along with the projection while maximizing the distance between the average outputs for classes over all the views. However, MFDA can only be used for binary classification problems. proposes a Multi-view Discriminant Analysis (MvDA), which seeks for a discriminant common space by maximizing the between-class and minimizing the within-class variations, across all the views. Later, based on bilinear models  and general graph embedding framework ,  introduces Generalized Multi-view Analysis (GMA). As an example of GMA, Generalized Multi-view Linear Discriminant Analysis (GMLDA) finds a set of projection directions in each view that tries to separate different contents’ class means and unify different views of the same class in the common subspace.
Based on the above analyses, although CCA-based deep neural networks can learn the multi-view interactive information and DA-based methods can consider the discriminative information, there is not a unified framework that simultaneously embeds the intra-view information, cross-view interactive information, and a reliable multi-view fusion strategy.
To tackle this issue, we propose a novel multi-view learning framework named as MvNNBiIn, which integrates both the intra-view information and cross-view multi-dimension bilinear interactive information for each view and designs a multi-view selective loss fusion. Specifically, (a) multiple intra-view information coming from different faceted representations, ensures the diversity and complementarity among different views to enhance multi-view learning. (b) The cross-view multi-dimension bilinear interactive information dynamically learns multi-dimension bilinear interactive information from different bilinear similarities. Each bilinear similarity is calculated by the similarity between intra-view information via the bilinear function. Therefore, the multi-dimension bilinear interactive information comprehensively models the relationships between views by learning different metric matrices. (c) The multi-view selective loss fusion is to calculate multiple losses for multiple views and fuses them in an adaptive weighting way with the selective strategy. This selective strategy can choose several discriminative views that are beneficial to make a decision for the multi-view classification, by tuning the sparseness of the weight vector.
It is worth mentioning that we have developed a preliminary work  named deep embedded complementary and interactive information for multi-view classification (denoted as MvNNcor). In this paper, our proposed MvNNBiIn method extends and improves the MvNNcor method significantly. Their major differences are summarized as follows. On one hand, MvNNcor utilizes the cross-correlations between attributes of multiple representations to generate interactive information. In contrast, our MvNNBiIn models a novel cross-view multi-dimension bilinear interactive information which consists of different bilinear similarities for each view with respect to another view. Each bilinear similarity is generated by the similarity between intra-view information passed by the bilinear function. On the other hand, MvNNcor uses the multi-view fusion strategy to integrate multiple views. By contrast, our MvNNBiIn designs a novel view ensemble mechanism to select more discriminative views that are beneficial to the multi-view classification.
We summarize the main contributions of the proposed MvNNBiIn method as follows.
We propose a unified framework, which seamlessly embeds intra-view information, cross-view multi-dimension bilinear interactive information, and a novel view ensemble mechanism, to make a decision during the optimization and improve the classification performance.
We model the cross-view interactive information by capturing the multi-dimension bilinear interactive information which is calculated by simultaneously learning multiple metric matrices via the bilinear function between views. It comprehensively models the relationships between different views.
We develop a new view ensemble mechanism which not only selects some discriminative views but also fuses them via an adaptive weighting method. The selective strategy can ensure that the selected views are beneficial to the multi-view classification.
We perform extensive experiments on several publicly available datasets to demonstrate the effectiveness of our model.
The rest of this paper is organized as follows. The Preliminary Knowledge is introduced in Section II. We formulate the framework of MvNNBiIn in Section III. Section IV briefly provides a tractable and skillful optimization method of MvNNBiIn. Section V evaluates MvNNBiIn on several public datasets, followed by some theoretical and empirical analyses of experiments. The conclusion and further works are shown in Section VI.
Ii Preliminary Knowledge
In this section, we briefly review several multi-view learning methods from the perspectives of CCA-based methods and MvDA method.
Ii-a CCA-based Methods
CCA  is popular for its capability of modeling the relationship between two or more sets of variables. CCA computes a shared embedding of both or more sets of variables through maximizing the correlations among the variables among these sets. CCA has been widely used in the multi-view learning tasks to generate low-dimensional representations [35, 36]. Improved generalization performance has been witnessed in areas including dimensionality reduction , clustering [38, 39], regression [40, 41], word embeddings [42, 43, 44], and discriminant learning [45, 46].
Supposing that is a data matrix for the -th view, CCA tends to find the linear projections which make the instances from two data matrices maximally correlated in the projected space. Therefore, CCA is modeled as the following constrained optimization problem,
where are centralized.
KCCA is an kernel extension of CCA for pursuing maximally correlated nonlinear projections. Let denote the kernel matrix such that , where , , . is a centering matrix and denotes a vector of all ones. Therefore, KCCA is formulated as the following optimization problem,
where and are the coefficient vectors optimized by KCCA.
As a DNN extension of CCA, DCCA utilizes two DNNs and to extract nonlinear features for each view, and then maximizes the canonical correlation between and . That is,
where and are the CCA directions that project the DNN outputs, and
are the regularization parameters for the sample covariance estimation.
Inspired by both CCA and reconstruction-based objectives, DCCAE constructs a model by consisting of two auto-encoders and optimizing the combination of canonical correlation between the learned representations and the reconstruction errors of the auto-encoders. That is,
where and are the reconstruction networks for each view, and is the trade-off parameter.
Ii-B MvDA Method
MvDA attempts to find linear transforms that project the samples from views to one discriminant common space, respectively, where the between-class variation is maximized while the within-class variation is minimized. Defined as the samples from the -th view, where is the -th sample from the -th view of the -th class of dimension and denotes the number of classes, and is the number of samples from the -th view of the -th class.
The samples from views are then projected to the same common space by using view-specific linear transforms. The projected results are denoted as . In the common space, according to our goal, the between-class variation from all views should be maximized while the within-class variation from all views should be minimized. Therefore, the objective is formulated as a generalized Rayleigh quotient,
where the within-class scatter matrix and the between-class scatter matrix are computed as below,
where denotes the number of samples of the -th class in all views, is the number of samples from all the classes and all the views, is the mean of all the samples of the -th class over all the views in the common space, is the mean of all the samples over all views in the common space. That is,
Iii The Proposed Method
In this section, we propose the architecture of our model MvNNBiIn, depicted in Figure 1, which consists of four parts: various intra-view extracted networks, multi-dimension bilinear interactive modules between views, combination of intra-view and cross-view information, and the multi-view selective loss fusion strategy.
Iii-a Various Intra-view Information Extraction
Given an instance , we utilize views to denote its kinds of visual features, which ensures the diverse and complementary information during the multi-view learning. Defining a set of neural networks , each projects from into , which captures the high-level intra-view information for the -th view, that is,
where and is a neural network with layers,
where denotes the weight matrix and
denotes the bias vector, andis the output of the -th layer, and
is the activation function applied component-wise. Specifically,, , , , and . It is worth mentioning that are trained coordinatively by solving the optimization problem shown in subsection III-D, where the parameters of are not shared with those of
. Although we utilize a multi-layer perceptron as, it also can be replaced with any deterministic neural networks with the input and output layers.
Iii-B Multi-dimension Bilinear Interactive Information
Inspired by the metric learning, the multi-dimension bilinear interactive information is proposed to construct the second-order feature interactions on various intra-view information. To be specific, given the -th view and the -th view , the multi-dimension bilinear interactive information between different views can be formulated via the bilinear function , that is,
where , , , and is the bias. There are multiple metric matrices (i.e., ) to be learned simultaneously. Intuitively, Figure 2 shows the construction of the multi-dimension bilinear interactive information of each pair of views.
For the -th view, we define as a set contained view pairs relative to the -th view, that is,
where and each view pair is undirected. For the -th view, the bilinear interactive information of the -th view relative to other views is collected into a set ,
where the -th set contains view pairs for the -th view and there are sets for views.
Iii-C Combination and Prediction of Each View
According to the above two subsections, we combine the intra-view information and multi-dimension bilinear interactive information of the -th view as follows,
where Con denotes the concatenation operation.
After that, is passed through to obtain the prediction of the -th view, that is,
where is the number of classes and produces a distribution over the possible classes for each view. is a neural network with layers, that is,
where denotes the weight matrix and denotes the bias vector, and is the output of the -th layer, and is the activation function applied component-wise. In particular, , , , , and . Here, the parameters of are shared among views.
Iii-D Multi-view Selective Loss Fusion Strategy
In this subsection, we design a novel view ensemble mechanism, which is to calculate multiple losses for multiple views and fuses them in an adaptive weighting way with the selective strategy. This selective strategy can choose several discriminative views that are beneficial to the multi-view classification, by tuning the sparseness of the weight vector. That is described as,
where is the weight for each view, denotes the common label information of an instance for all the views, and is the cross-entropy loss of the -th view. is the power exponent parameter of the weight for the -th view, which controls the weight distribution of different views flexibly and avoids the trivial solution of during the classification. is used to constrain the sparseness of the weight vector , where denotes the number of nonzero elements in . Crucially, the -norm constraint is able to capture the global relationship among different views and to achieve the view-wise sparsity, which realizes selecting a few discriminative views can improve the performance during the multi-view classification. Intuitively, the architecture of our proposed method MvNNBiIn is shown in Figure 1.
We utilize the alternate optimization method to update the parameters and the view-weight , respectively. For convenience, denotes the bilinear function set and is used to denote the concatenation operation in Figure 3.
Iv-a Update , , and
We fix and update , , and3 briefly shows the gradient computations of our proposed MvNNBiIn method.
The descriptions of six publicly available datasets, i.e., Caltech101, Caltech20, AWA, NUSOBJ, Reuters, and SUN. Some abbreviations are defined as follows, Fea: public Feature, Dim: Dimensionality, WM: Wavelet Moments, CENT: CENTRIST, CH: Color Histogram, LSS: Local Self-Similarity, PHOG: Pyramid HOG, RGSIFT: Color SIFT, CM: block-wise Color Moments, CORR: color Correlogram, EDH: Edge Direction Histogram, WT: Wavelet Texture, GEOMAP: Geometric Map.
We learn the non-negative normalized weight for each view and assign the higher weight to more discriminative view. Therefore, we fix the parameters and update by solving the optimization problem (16).
To efficiently minimize the problem (16), we define a function on , that is,
where is a permutation matrix which results in permuting elements of along the ascending order, i.e., . Based on equation (18), we select the first smallest elements and optimize their corresponding weights , meanwhile, setting the rest weights as zeros. Therefore, the problem (16) is equivalent to the following problem by absorbing the -norm constraint into the objective function,
Through the Lagrangian Multiplier method, the Lagrangian function of problem (19) is:
where is the Lagrangian multiplier. Taking the derivatives of with respect to and , respectively, and setting them to zeros, there is,
where and is the sparsity of .
To sum up, the optimal solution of problem (16) can be calculated by,
Therefore, the scheme for MvNNBiIn can be summarized as follows. Firstly, the are used to learn the complementary and diverse intra-view information from multi-view representations. Secondly, the intra-view information is used for training the bilinear function set , which captures the cross-view interactive information. Thirdly, the intra-view and cross-view interactive information for each view is integrated by a concatenation operation, and then the concatenation is fed into to calculate the cross-entropy loss. Finally, the trained , and can be optimized by solving the problem (16) and the weight distribution can be obtained, which contributes to infer the label during the multi-view classification.
In this section, we evaluate the performance of our proposed MvNNBiIn method on six publicly available datasets.
We follow many state-of-the-art multi-view methods, which treat different kinds of pre-extracted feature vectors for an image as different views to make the experimental comparisons fair. Different kinds of pre-extracted features present different facets of the images, such as color, texture, shape, and geographic information, yet the above information is not intuitive. We describe all the publicly available datasets in TABLE I.
Specifically, Caltech101 [47, 48] and Caltech20  datasets consist of images of objects belonging to 102 (101 classes plus one background clutter class) and 20 classes, respectively, and each dataset is described as 6 views. AWA (Animals with Attributes)  dataset provides 6 kinds of features (6 views) for attribute base classification. NUSOBJ dataset is a subset of NUS-WIDE , which describes each object image using 5 types of low-level features (5 views). Reuters  dataset is used for document categorization and written in 5 different languages (5 views). SUN is a subset of SUN397 (Scene Categorization Benchmark) [52, 53] and utilizes 3 kinds of public features matrices (3 views) to represent each image.
Referring to the previous works [24, 54], we split each dataset into three parts, that is, 70% samples for training, 20% samples for validating, and 10% samples for testing. We utilize the classification accuracy (e.g., Top@1 accuracy and Top@5 accuracy) to evaluate the performance of all the methods and report the final results in TABLEs II, III, and IV.
V-B Experimental Settings
V-B1 Comparison Methods
We first compare our MvNNBiIn method with several state-of-the-art multi-view methods, including SVMcon, DCCA , DCCAE , DeepLDA , MvDA , MvDN , and MvNNcor , in Table III. In particular, SVMcon is a baseline that concatenates all the views and feeds into the SVM classifier. DCCA and DCCAE belong to the CCA-based method inputting two views. DeepLDA, MvDA, and MvDN are the Discriminant Analysis based method, where DeepLDA inputs the concatenation of all the views and feeds into a deep neural network followed by LDA; MvDA inputs multiple views with the same dimensionality; MvDN is the non-linear version of LDA, which uses deep neural networks to replace of the linear transformations.
What’s more, Table IV demonstrates the effectiveness of three important parts of our proposed MvNNBiIn, i.e., intra-view information, cross-view bilinear interactive information, and a novel view ensemble mechanism, respectively. The highest performance is obtained when all the parts are available while the performance is lower when any part is absent.
V-B2 Parameters Setup
Deep LDA is a fully connected neural network consisted of three hidden layers, i.e., 400, 200, and 300 units equipped with ReLU activation function. In DCCAE, there is a feature extraction network and a reconstruction network for each view, where each network is a fully connected network consisted of three hidden layers, i.e., 400, 200, and 300 units equipped with ReLU activation function, followed by a linear output of(for feature extraction network) / (for reconstruction network) units. The capacities of the above networks are the same as those of their counterparts in DCCA and MvDN.
In our MvNNBiIn, two kinds of networks and the bilinear function set are needed to learn. Each is a fully connected network consisted of two hidden layers (i.e., 400 and 200 units equipped with ReLU activation function). consists of 200 input units and 300 hidden units equipped with ReLU activation function, followed by a linear output of units. Each is a bilinear function which outputs bilinear interactive units from each pair . That is, the inputs of are the outputs of and , and the input of is concatenated by the outputs of and . The capacities of the above networks are the same as those of their counterparts in the ablation studies shown in Table IV. In MvNNcor, three kinds of networks are needed to learn, where are the same as the counterparts of MvNNBiIn, and consists of 200 input units and 200 hidden units with ReLU activation function.
In this paper, all the networks are optimized by Adam with batch normalization, where the learning rate is 10, , , and batch size is 64. In addition, we vary and , respectively, to explore the influence of different values of and on the classification accuracy. Based on the optimal and , we can learn the optimal model to achieve the highest classification accuracy. The results are shown in Figure 4 where Caltech101, Caltech20, AWA, NUSOBJ, Reuters, and SUN datasets can achieve the best performance when are set as , , , , , and , respectively.
Besides, to explore the proper combination proportion of the intra-view information and multi-dimension bilinear interactive information for each view, we investigate the dimensionality of the output of the bilinear function , i.e., setting as 50, 100, 200, and 400, respectively, and the results are reported in Table II. It can be seen that the classification accuracy is the highest when , which shows that the combination proportion also has an impact on the classification performance.
V-C Experimental Results
Tables III and IV show the classification performance of all the methods, where Table III reports the experimental results of several recent methods and Table IV provides the results of ablation experiments.
Firstly, compared with the single-view methods, i.e., SVMcon and DeepLDA, our MvNNBiIn consistently outperforms them on all the datasets. For example, MvNNBiIn achieves 28.661% and 30.913% improvements, respectively, on the Caltech101 dataset. That is because the concatenation of all the views may reduce the interpretability of different views and ignore the cross-view interactive information during the multi-view classification. We compare our MvNNBiIn with MvDA and the classification accuracy of our MvNNBiIn is better than that of MvDA on all the datasets. For instance, our MvNNBiIn obtains 31.362% improvements on the Caltech101 dataset, since the linear transformations of MvDA cannot deal well with some subtle but important structures in some challenging scenarios. Compared to MvDN, our MvNNBiIn achieves 6.348% improvements on the Caltech101 dataset due to embedding the multi-dimension bilinear interactive information between different views.
Secondly, we compare our MvNNBiIn with the CCA-based methods, i.e., DCCA and DCCAE, and the results are reported in Table III. It can be seen that our MvNNBiIn performs better than DCCA and DCCAE, since these two methods are limited to the double-view input and unable to capture more diverse and complementary information from more views. For example, compared to DCCA and DCCAE, our MvNNBiIn achieves 28.639% and 35.832% improvements on the AWA dataset, respectively.
Thirdly, as the extension work of MvNNcor, our MvNNBiIn almost achieves better performance on all the datasets. For example, compared with MvNNcor on the AWA dataset, MvNNBiIn achieves 1.629% improvements. For one reason, the cross-view multi-dimension bilinear interactive information of our MvNNBiIn is more able to capture the interactive information between different views. For another reason, our MvNNBiIn designs a novel view ensemble mechanism which can select more discriminative views and is more beneficial to the multi-view classification.
Besides, in the ablation studies of our MvNNBiIn shown in Table IV, taking the NUSOBJ dataset as an example, we can achieve 2.682%, 1.163%, 15.060%, 3.790%, 3.474%, 0.856%, and 0.399% improvements compared with our framework respectively containing some of the following modules, i.e., , , , , , , and . These results successively demonstrate the effectiveness of integrating the intra-view information and multi-dimension bilinear interactive information between views as well as the adaptive weighting multi-view loss fusion with the selective strategy.
Moreover, Figure 5 shows the view-weights of each dataset learned by our MvNNBiIn, where the -axis means the indices of different views and the -axis denotes the weight of each view. The higher weight indicates that the view provides more valuable information and makes more contribution.
Comparison results of our MvNNBiIn and several deep convolutional neural network architectures on image datasets Caltech101, NUSOBJ, and SUN397.
|Method||Iterations||Batch size||Learning rate|
|Method||Iterations||Batch size||Learning rate|
|Method||Iterations||Batch size||Learning rate|
Actually, our MvNNBiIn is a generic framework that can improve the multi-view classification using not only the handcrafted features (such as HOG, LBP, or SURF) but also the deep model-learned features.
We apply four popular CNNs (including AlexNet , GoogLeNet , VGGNet-16 , and ResNet-101 ) on three publicly available image datasets (i.e., Caltech101, NUSOBJ, and SUN397), respectively, to generate the CNN feature representations including four transferred CNN features and four fine-tuned CNN features. Then, we compare our proposed MvNNBiIn method with the single-view CNN feature-based methods to demonstrate the superiority of our multi-view learning framework.
To be specific, for the transferred CNN feature-based methods, four off-the-shelf CNN models including VGGNet-16, ResNet-101, AlexNet, and GoogLeNet are first adopted as general feature extractors to extract CNN features and then linear one-versus-all SVMs (=0.001) are used for classification. For the fine-tuned CNN feature-based methods, we fine-tune the aforementioned four CNN models on the training datasets to extract better CNN features and then adopt linear one-versus-all SVMs (=0.001) for classification. For our proposed MvNNBiIn method, we regard four transferred CNN features as four views for each image and apply them into MvNNBiIn to perform the multi-view classification. Similarly, our proposed MvNNBiIn method is also performed on four fine-tuned CNN features.
The experimental results are shown in table V and the experimental settings of the fine-tuned CNN Models are shown in Tables VIVIII. It can be seen that our proposed MvNNBiIn method outperforms the single-view CNN feature-based methods, respectively, on both transferred and fine-tuned CNN features, and averagely achieves 7.577% (AlexNet), 5.551% (GoogLeNet), 3.087% (ResNet-101), and 6.212% (VGGNet-16) improvements on the Caltech101 dataset. These results demonstrate the superiority of our proposed multi-view learning framework.
In this paper, we propose a novel multi-view learning framework denoted as MvNNBiIn which seamlessly embeds various intra-view information and cross-view multi-dimension bilinear interactive information as well as introducing a new view ensemble mechanism to jointly make decisions during the multi-view classification. Extensive experiments on several publicly available datasets demonstrate the effectiveness of our proposed MvNNBiIn method. Furthermore, we demonstrate the superiority of multi-view learning using the CNN feature representations, which provides a novel idea of fusing outputs of different deterministic neural networks in further work.
G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,”Proceedings of the IEEE, vol. 10, no. 105, pp. 1865–1883, 2017.
-  J. Han, X. Yao, G. Cheng, X. Feng, and D. Xu, “P-cnn: Part-based convolutional neural networks for fine-grained visual categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2019.2933510.
J. Xu, F. Nie, and J. Han, “Feature selection via scaling factor integrated multi-class support vector machines,” in
International Joint Conference on Artificial Intelligence, 2017.
-  C. Du, C. Du, L. Huang, and H. He, “Reconstructing perceived images from human brain activities with bayesian deep multiview learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 8, pp. 2310–2323, 2018.
-  Y. Yu, S. Tang, K. Aizawa, and A. Aizawa, “Category-based deep cca for fine-grained venue discovery from multimodal data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 4, pp. 1250–1258, 2018.
-  D. Lian, L. Hu, W. Luo, Y. Xu, L. Duan, J. Yu, and S. Gao, “Multiview multitask gaze estimation with deep convolutional neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 3010–3023, 2018.
-  S. Zhang, X. Yu, Y. Sui, S. Zhao, and L. Zhang, “Object tracking with multi-view support vector machines,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 265–278, 2015.
-  G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection,” IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 265–278, 2019.
-  J. Peng, A. J. Aved, G. Seetharaman, and K. Palaniappan, “Multiview boosting with information propagation for classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 3, pp. 657–669, 2017.
-  C.-M. Feng, Y. Xu, J.-X. Liu, Y.-L. Gao, and C.-H. Zheng, “Supervised discriminative sparse pca for com-characteristic gene selection and tumor classification on multiview biological data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 2926–2937, 2019.
-  G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
-  J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view learning overview: Recent progress and new challenges,” Information Fusion, vol. 38, pp. 43–54, 2017.
-  J. Xu, J. Han, F. Nie, and X. Li, “Multi-view scaling support vector machines for classification and feature selection,” IEEE Transactions on Knowledge and Data Engineering, DOI: 10.1109/TKDE.2019.2904256.
-  X. Xie and S. Sun, “Multi-view support vector machines with the consensus and complementarity information,” IEEE Transactions on Knowledge and Data Engineering, DOI: 10.1109/TKDE.2019.2933511.
-  J. Tang, Y. Tian, P. Zhang, and X. Liu, “Multiview privileged support vector machines,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3463–3477, 2017.
-  H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
F. R. Bach and M. I. Jordan, “Kernel independent component analysis,”Journal of Machine Learning Research, vol. 3, no. 1, pp. 1–48, 2002.
-  D. R. Hardoon, S. Szedmak, and J. Shawetaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
-  S. Sun, “A survey of multi-view machine learning,” Neural Computing and Applications, vol. 23, no. 7-8, pp. 2031–2038, 2013.
-  P. L. Lai and C. Fyfe, “A neural implementation of canonical correlation analysis,” Neural Networks, vol. 12, no. 10, pp. 1391–1397, 1999.
-  W. W. Hsieh, “Nonlinear canonical correlation analysis by neural networks,” Neural Networks, vol. 13, no. 10, pp. 1095–1105, 2000.
R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” inInternational Conference on Artificial Intelligence and Security, 2009.
-  B. Yoshua, C. Aaron, and V. Pascal, “Representation learning: a review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
-  G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International Conference on Machine Learning, 2013.
-  W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, “On deep multi-view representation learning,” in International Conference on Machine Learning, 2015.
F. Feng, X. Wang, and R. Li, “Cross-modal retrieval with correspondence autoencoder,” inACM International Conference on Multimedia, 2014.
-  S. Rastegar, M. S. Baghshah, H. R. Rabiee, and S. M. Shojaee, “Mdl-cw: A multimodal deep learning framework with cross weights,” in
-  T. Diethe, D. R. Hardoon, and J. Shawe-Taylor, “Constructing nonlinear discriminants from multiple data views,” in European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2010.
-  M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view discriminant analysis,” in European Conference on Computer Vision, 2012.
-  S. Mika, A. Smola, and B. Scholkopf, “An improved training algorithm for kernel fisher discriminants,” in International Conference on Artificial Intelligence and Statistics, 2001.
-  J. B. Tenenbaum and W. T. Freeman, “Separating style and content with bilinear models,” Neural Computation, vol. 12, no. 6, pp. 1247–1283, 2014.
-  Y. Shuicheng, X. Dong, Z. Benyu, Z. Hong-Jiang, Y. Qiang, and L. Stephen, “Graph embedding and extensions: a general framework for dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40–51, 2007.
-  A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  J. Xu, W. Li, X. Liu, D. Zhang, J. Liu, and J. Han, “Deep embedded complementary and interactive information for multi-view classification,” in AAAI Conference on Artificial Intelligence, 2020.
-  N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in International Conference on Multimedia, 2010.
-  S. Liang, S. Ji, and J. Ye, “A least squares formulation for canonical correlation analysis,” in International Conference on Machine Learning, 2008.
-  H. Avron, C. Boutsidis, S. Toledo, and A. Zouzias, “Efficient dimensionality reduction for canonical correlation analysis,” in International Conference on Machine Learning, 2013.
M. B. Blaschko and C. H. Lampert, “Correlational spectral clustering,”IEEE Conference on Computer Vision and Pattern Recognition, 2008.
-  K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan, “Multi-view clustering via canonical correlation analysis,” in International Conference on Machine Learning, 2009.
-  S. M. Kakade and D. P. Foster, “Multi-view regression via canonical correlation analysis,” in International Conference on Computer and Information Technology, 2007.
B. Mcwilliams, D. Balduzzi, and J. M. Buhmann, “Correlated random features for fast semi-supervised learning.” inAdvances in Neural Information Processing Systems, 2013.
-  P. S. Dhillon, D. Foster, and L. Ungar, “Multi-view learning of word embeddings via cca,” in Advances in Neural Information Processing Systems, 2011.
-  P. S. Dhillon, J. Rodu, D. P. Foster, and L. H. Ungar, “Using cca to improve cca: A new spectral method for estimating vector models of words,” in International Conference on Machine Learning, 2012.
-  Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision, vol. 106, no. 2, pp. 210–233, 2014.
-  K. Tae-Kyun, K. Josef, and C. Roberto, “Discriminative learning and recognition of image set classes using canonical correlations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1005–1018, 2007.
-  S. Ya, F. Yun, G. Xinbo, and T. Qi, “Discriminant learning through multiple principal angles for visual recognition,” IEEE Transactions on Image Processing, vol. 21, no. 3, pp. 1381–1390, 2012.
-  L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Computer Vision and Image Understanding, vol. 1, no. 106, pp. 59–70, 2007.
-  Y. Li, F. Nie, H. Huang, and J. Huang, “Large-scale multi-view spectral clustering via bipartite graph,” in AAAI Conference on Artificial Intelligence, 2015.
-  C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009.
-  T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in ACM International Conference on Image and Video Retrieval, 2009.
-  M. Amini, N. Usunier, and C. Goutte, “Learning from multiple partially observed views-an application to multilingual text categorization,” in Advances in Neural Information Processing Systems, 2009.
-  J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010.
-  J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene categories,” International Journal of Computer Vision, vol. 119, no. 1, pp. 3–22, 2016.
-  W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning,” in International Conference on Machine Learning, 2015.
-  M. Dorfer, R. Kelz, and G. Widmer, “Deep linear discriminant analysis,” in International Conference on Learning Representations, 2016.
-  M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view discriminant analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 188–194, 2016.
-  M. Kan, S. Shan, and X. Chen, “Multi-view deep network for cross-view classification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Systems, 2012.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.