I Introduction
Classification has been a fundamental technique of machine learning and broadly applied in image classification
[1, 2, 3], image reconstruction [4], crossmodal retrieval [5], gaze tracking [6], visual tracking [7, 8], multiview boosting [9], multiview biological data [10]and so on. The typical classification methods mainly include Logistic Regress, KNearest Neighbor, Decision Tree, and Support Vector Machines (SVM), where the kernel trick is an important innovation of SVM since it can not only utilize the convex optimization method with convergence to learn the nonlinear model but also make the implementation of kernel function (such as Radial Basis Function (RBF)) highefficiency. Unfortunately, the computation cost of the kernel methods is too large for the large dataset. The deep learning method, therefore, is proposed to overcome the limitations of the kernel methods. For example,
[11] has been demonstrated that the performance of Neural Networks is better than that of SVM with RBF kernel on the MNIST dataset.Recently, with the rapid development of data mining techniques, in many scientific data analysis tasks, data are often collected through different ways, such as various sensors, as usually a single way cannot comprehensively describe the entire information of a data instance. In this case, different kinds of features of each data instance can be considered as different views of this instance, where each view may capture a specific facet of this instance. Importantly, the combination of multiple views can provide a more comprehensive view of this instance. Obviously, the multiview features can be utilized to sufficiently represent an instance compared to the single view features. As an important branch of multiview learning [12], the multiview classification [13, 14, 15] uses multiple distinct representations of data and models a multiview learning framework to perform the classification task.
Canonical Correlation Analysis (CCA) [16] and Kernel CCA (KCCA) [17, 18, 19] show their abilities of effectively modeling the relationship between two or more views. However, they still have some limitations on capturing highlevel associations between different views. To be specific, CCA ignores the nonlinearities in multiview data and KCCA may suffer from the effect of small data when the data acquisition in one or more modalities is expensive or otherwise limited. Therefore, in the early efforts, [20] investigates a neural network implementation of CCA and maximizes the correlation between the output of the networks for each view. [21] formulates a nonlinear CCA method using three feedforward neural networks, where the first network maximizes the correlation between two canonical variates, while the remaining two networks map from the canonical variates back to the original two sets of variables. However, the above CCAbased methods have been proposed for many years and are needed to be improved.
Inspired by the recent successes of deep neural networks [22, 23], correlation can be naturally applied to multiview neural network learning to learn deep and abstract multiview interactive information. The deep neural networks extension of CCA (Deep CCA) [24] learns representations of two views by using multiple stacked layers and maximizes the correlation of two representations. Later, [25] proposes a deep canonically correlated autoencoder (DCCAE) by combining the advantages of both Deep CCA and autoencoder. Specifically, DCCAE consists of two autoencoders and optimizes the combination of canonical correlation between two learned bottleneck representations and the reconstruction errors of the autoencoders. Similar to the principle of DCCAE, [26]
proposes a correspondence autoencoder (CorrAE) via constructing correlations between the hidden representations of two unimodal deep autoencoders.
[27] suggests exploiting the cross weights between the representations of views for gradually learning interactions of the modalities (views) in a multimodal deep autoencoder network. Theoretical analysis of [27] shows that considering these interactions from a high level provides more intramodality (intraview) information.In addition, a number of multiview analysis methods [28, 29] have been proposed. Based on Fisher Discriminant Analysis (FDA) [30], both regularized twoview FDA and its kernel extension can be cast as equivalent disciplined convex optimization problems. Then, [28]
introduces Multiview Fisher Discriminant Analysis (MFDA) that learns classifiers in multiple views, by minimizing the variance of the data along with the projection while maximizing the distance between the average outputs for classes over all the views. However, MFDA can only be used for binary classification problems.
[29] proposes a Multiview Discriminant Analysis (MvDA), which seeks for a discriminant common space by maximizing the betweenclass and minimizing the withinclass variations, across all the views. Later, based on bilinear models [31] and general graph embedding framework [32], [33] introduces Generalized Multiview Analysis (GMA). As an example of GMA, Generalized Multiview Linear Discriminant Analysis (GMLDA) finds a set of projection directions in each view that tries to separate different contents’ class means and unify different views of the same class in the common subspace.Based on the above analyses, although CCAbased deep neural networks can learn the multiview interactive information and DAbased methods can consider the discriminative information, there is not a unified framework that simultaneously embeds the intraview information, crossview interactive information, and a reliable multiview fusion strategy.
To tackle this issue, we propose a novel multiview learning framework named as MvNNBiIn, which integrates both the intraview information and crossview multidimension bilinear interactive information for each view and designs a multiview selective loss fusion. Specifically, (a) multiple intraview information coming from different faceted representations, ensures the diversity and complementarity among different views to enhance multiview learning. (b) The crossview multidimension bilinear interactive information dynamically learns multidimension bilinear interactive information from different bilinear similarities. Each bilinear similarity is calculated by the similarity between intraview information via the bilinear function. Therefore, the multidimension bilinear interactive information comprehensively models the relationships between views by learning different metric matrices. (c) The multiview selective loss fusion is to calculate multiple losses for multiple views and fuses them in an adaptive weighting way with the selective strategy. This selective strategy can choose several discriminative views that are beneficial to make a decision for the multiview classification, by tuning the sparseness of the weight vector.
It is worth mentioning that we have developed a preliminary work [34] named deep embedded complementary and interactive information for multiview classification (denoted as MvNNcor). In this paper, our proposed MvNNBiIn method extends and improves the MvNNcor method significantly. Their major differences are summarized as follows. On one hand, MvNNcor utilizes the crosscorrelations between attributes of multiple representations to generate interactive information. In contrast, our MvNNBiIn models a novel crossview multidimension bilinear interactive information which consists of different bilinear similarities for each view with respect to another view. Each bilinear similarity is generated by the similarity between intraview information passed by the bilinear function. On the other hand, MvNNcor uses the multiview fusion strategy to integrate multiple views. By contrast, our MvNNBiIn designs a novel view ensemble mechanism to select more discriminative views that are beneficial to the multiview classification.
We summarize the main contributions of the proposed MvNNBiIn method as follows.

We propose a unified framework, which seamlessly embeds intraview information, crossview multidimension bilinear interactive information, and a novel view ensemble mechanism, to make a decision during the optimization and improve the classification performance.

We model the crossview interactive information by capturing the multidimension bilinear interactive information which is calculated by simultaneously learning multiple metric matrices via the bilinear function between views. It comprehensively models the relationships between different views.

We develop a new view ensemble mechanism which not only selects some discriminative views but also fuses them via an adaptive weighting method. The selective strategy can ensure that the selected views are beneficial to the multiview classification.

We perform extensive experiments on several publicly available datasets to demonstrate the effectiveness of our model.
The rest of this paper is organized as follows. The Preliminary Knowledge is introduced in Section II. We formulate the framework of MvNNBiIn in Section III. Section IV briefly provides a tractable and skillful optimization method of MvNNBiIn. Section V evaluates MvNNBiIn on several public datasets, followed by some theoretical and empirical analyses of experiments. The conclusion and further works are shown in Section VI.
Ii Preliminary Knowledge
In this section, we briefly review several multiview learning methods from the perspectives of CCAbased methods and MvDA method.
Iia CCAbased Methods
CCA [16] is popular for its capability of modeling the relationship between two or more sets of variables. CCA computes a shared embedding of both or more sets of variables through maximizing the correlations among the variables among these sets. CCA has been widely used in the multiview learning tasks to generate lowdimensional representations [35, 36]. Improved generalization performance has been witnessed in areas including dimensionality reduction [37], clustering [38, 39], regression [40, 41], word embeddings [42, 43, 44], and discriminant learning [45, 46].
Supposing that is a data matrix for the th view, CCA tends to find the linear projections which make the instances from two data matrices maximally correlated in the projected space. Therefore, CCA is modeled as the following constrained optimization problem,
(1) 
where are centralized.
KCCA is an kernel extension of CCA for pursuing maximally correlated nonlinear projections. Let denote the kernel matrix such that , where , , . is a centering matrix and denotes a vector of all ones. Therefore, KCCA is formulated as the following optimization problem,
(2) 
where and are the coefficient vectors optimized by KCCA.
As a DNN extension of CCA, DCCA utilizes two DNNs and to extract nonlinear features for each view, and then maximizes the canonical correlation between and . That is,
(3) 
where and are the CCA directions that project the DNN outputs, and
are the regularization parameters for the sample covariance estimation.
Inspired by both CCA and reconstructionbased objectives, DCCAE constructs a model by consisting of two autoencoders and optimizing the combination of canonical correlation between the learned representations and the reconstruction errors of the autoencoders. That is,
(4) 
where and are the reconstruction networks for each view, and is the tradeoff parameter.
IiB MvDA Method
MvDA attempts to find linear transforms that project the samples from views to one discriminant common space, respectively, where the betweenclass variation is maximized while the withinclass variation is minimized. Defined as the samples from the th view, where is the th sample from the th view of the th class of dimension and denotes the number of classes, and is the number of samples from the th view of the th class.
The samples from views are then projected to the same common space by using viewspecific linear transforms. The projected results are denoted as . In the common space, according to our goal, the betweenclass variation from all views should be maximized while the withinclass variation from all views should be minimized. Therefore, the objective is formulated as a generalized Rayleigh quotient,
(5) 
where the withinclass scatter matrix and the betweenclass scatter matrix are computed as below,
(6) 
where denotes the number of samples of the th class in all views, is the number of samples from all the classes and all the views, is the mean of all the samples of the th class over all the views in the common space, is the mean of all the samples over all views in the common space. That is,
(7) 
Iii The Proposed Method
In this section, we propose the architecture of our model MvNNBiIn, depicted in Figure 1, which consists of four parts: various intraview extracted networks, multidimension bilinear interactive modules between views, combination of intraview and crossview information, and the multiview selective loss fusion strategy.
Iiia Various Intraview Information Extraction
Given an instance , we utilize views to denote its kinds of visual features, which ensures the diverse and complementary information during the multiview learning. Defining a set of neural networks , each projects from into , which captures the highlevel intraview information for the th view, that is,
(8) 
where and is a neural network with layers,
(9) 
where denotes the weight matrix and
denotes the bias vector, and
is the output of the th layer, andis the activation function applied componentwise. Specifically,
, , , , and . It is worth mentioning that are trained coordinatively by solving the optimization problem shown in subsection IIID, where the parameters of are not shared with those of. Although we utilize a multilayer perceptron as
, it also can be replaced with any deterministic neural networks with the input and output layers.IiiB Multidimension Bilinear Interactive Information
Inspired by the metric learning, the multidimension bilinear interactive information is proposed to construct the secondorder feature interactions on various intraview information. To be specific, given the th view and the th view , the multidimension bilinear interactive information between different views can be formulated via the bilinear function , that is,
(10) 
where , , , and is the bias. There are multiple metric matrices (i.e., ) to be learned simultaneously. Intuitively, Figure 2 shows the construction of the multidimension bilinear interactive information of each pair of views.
For the th view, we define as a set contained view pairs relative to the th view, that is,
(11) 
where and each view pair is undirected. For the th view, the bilinear interactive information of the th view relative to other views is collected into a set ,
(12) 
where the th set contains view pairs for the th view and there are sets for views.
IiiC Combination and Prediction of Each View
According to the above two subsections, we combine the intraview information and multidimension bilinear interactive information of the th view as follows,
(13) 
where Con denotes the concatenation operation.
After that, is passed through to obtain the prediction of the th view, that is,
(14) 
where is the number of classes and produces a distribution over the possible classes for each view. is a neural network with layers, that is,
(15) 
where denotes the weight matrix and denotes the bias vector, and is the output of the th layer, and is the activation function applied componentwise. In particular, , , , , and . Here, the parameters of are shared among views.
IiiD Multiview Selective Loss Fusion Strategy
In this subsection, we design a novel view ensemble mechanism, which is to calculate multiple losses for multiple views and fuses them in an adaptive weighting way with the selective strategy. This selective strategy can choose several discriminative views that are beneficial to the multiview classification, by tuning the sparseness of the weight vector. That is described as,
(16) 
where
(17) 
where is the weight for each view, denotes the common label information of an instance for all the views, and is the crossentropy loss of the th view. is the power exponent parameter of the weight for the th view, which controls the weight distribution of different views flexibly and avoids the trivial solution of during the classification. is used to constrain the sparseness of the weight vector , where denotes the number of nonzero elements in . Crucially, the norm constraint is able to capture the global relationship among different views and to achieve the viewwise sparsity, which realizes selecting a few discriminative views can improve the performance during the multiview classification. Intuitively, the architecture of our proposed method MvNNBiIn is shown in Figure 1.
Iv Optimization
We utilize the alternate optimization method to update the parameters and the viewweight , respectively. For convenience, denotes the bilinear function set and is used to denote the concatenation operation in Figure 3.
Iva Update , , and
We fix and update , , and
, which utilizes Adam with batch normalization and the autograd package in PyTorch to train. Figure
3 briefly shows the gradient computations of our proposed MvNNBiIn method.Views  Caltech101  Caltech20  AWA  NUSOBJ  Reuters  SUN  
Fea  Dim  Fea  Dim  Fea  Dim  Fea  Dim  Fea  Dim  Fea  Dim  
1  Gabor  48  Gabor  48  CH  2688  CH  64  English  21531  GIST  256 
2  WM  40  WM  40  LSS  2000  CM  225  French  24892  GEOMAP  512 
3  CENT  254  CENT  254  PHOG  252  CORR  144  German  34251  TEXTON  512 
4  HOG  1984  HOG  1984  SIFT  2000  EDH  73  Italian  15506  /  / 
5  GIST  512  GIST  512  RGSIFT  2000  WT  128  Spanish  11547  /  / 
6  LBP  928  LBP  928  SURF  2000  /  /  /  /  /  / 
# Samples  9914  2386  30475  30000  18758  99250  
# Classes  102  20  50  31  6  397 
The descriptions of six publicly available datasets, i.e., Caltech101, Caltech20, AWA, NUSOBJ, Reuters, and SUN. Some abbreviations are defined as follows, Fea: public Feature, Dim: Dimensionality, WM: Wavelet Moments, CENT: CENTRIST, CH: Color Histogram, LSS: Local SelfSimilarity, PHOG: Pyramid HOG, RGSIFT: Color SIFT, CM: blockwise Color Moments, CORR: color Correlogram, EDH: Edge Direction Histogram, WT: Wavelet Texture, GEOMAP: Geometric Map.
IvB Update
We learn the nonnegative normalized weight for each view and assign the higher weight to more discriminative view. Therefore, we fix the parameters and update by solving the optimization problem (16).
To efficiently minimize the problem (16), we define a function on , that is,
(18) 
where is a permutation matrix which results in permuting elements of along the ascending order, i.e., . Based on equation (18), we select the first smallest elements and optimize their corresponding weights , meanwhile, setting the rest weights as zeros. Therefore, the problem (16) is equivalent to the following problem by absorbing the norm constraint into the objective function,
(19) 
Through the Lagrangian Multiplier method, the Lagrangian function of problem (19) is:
(20) 
where is the Lagrangian multiplier. Taking the derivatives of with respect to and , respectively, and setting them to zeros, there is,
(21) 
where and is the sparsity of .
To sum up, the optimal solution of problem (16) can be calculated by,
(22) 
Therefore, the scheme for MvNNBiIn can be summarized as follows. Firstly, the are used to learn the complementary and diverse intraview information from multiview representations. Secondly, the intraview information is used for training the bilinear function set , which captures the crossview interactive information. Thirdly, the intraview and crossview interactive information for each view is integrated by a concatenation operation, and then the concatenation is fed into to calculate the crossentropy loss. Finally, the trained , and can be optimized by solving the problem (16) and the weight distribution can be obtained, which contributes to infer the label during the multiview classification.
V Experiments
In this section, we evaluate the performance of our proposed MvNNBiIn method on six publicly available datasets.
Va Datasets
We follow many stateoftheart multiview methods, which treat different kinds of preextracted feature vectors for an image as different views to make the experimental comparisons fair. Different kinds of preextracted features present different facets of the images, such as color, texture, shape, and geographic information, yet the above information is not intuitive. We describe all the publicly available datasets in TABLE I.
Specifically, Caltech101 [47, 48] and Caltech20 [48] datasets consist of images of objects belonging to 102 (101 classes plus one background clutter class) and 20 classes, respectively, and each dataset is described as 6 views. AWA (Animals with Attributes) [49] dataset provides 6 kinds of features (6 views) for attribute base classification. NUSOBJ dataset is a subset of NUSWIDE [50], which describes each object image using 5 types of lowlevel features (5 views). Reuters [51] dataset is used for document categorization and written in 5 different languages (5 views). SUN is a subset of SUN397 (Scene Categorization Benchmark) [52, 53] and utilizes 3 kinds of public features matrices (3 views) to represent each image.
Referring to the previous works [24, 54], we split each dataset into three parts, that is, 70% samples for training, 20% samples for validating, and 10% samples for testing. We utilize the classification accuracy (e.g., Top@1 accuracy and Top@5 accuracy) to evaluate the performance of all the methods and report the final results in TABLEs II, III, and IV.
VB Experimental Settings
VB1 Comparison Methods
We first compare our MvNNBiIn method with several stateoftheart multiview methods, including SVMcon, DCCA [24], DCCAE [54], DeepLDA [55], MvDA [56], MvDN [57], and MvNNcor [34], in Table III. In particular, SVMcon is a baseline that concatenates all the views and feeds into the SVM classifier. DCCA and DCCAE belong to the CCAbased method inputting two views. DeepLDA, MvDA, and MvDN are the Discriminant Analysis based method, where DeepLDA inputs the concatenation of all the views and feeds into a deep neural network followed by LDA; MvDA inputs multiple views with the same dimensionality; MvDN is the nonlinear version of LDA, which uses deep neural networks to replace of the linear transformations.
What’s more, Table IV demonstrates the effectiveness of three important parts of our proposed MvNNBiIn, i.e., intraview information, crossview bilinear interactive information, and a novel view ensemble mechanism, respectively. The highest performance is obtained when all the parts are available while the performance is lower when any part is absent.
Caltech101  Caltech20  AWA  NUSOBJ  Reuters  SUN  

Top@1  Top@5  Top@1  Top@5  Top@1  Top@5  Top@1  Top@5  Top@1  Top@5  Top@1  Top@5  
50  76.228  87.277  97.397  98.958  48.210  74.977  51.529  84.142  89.116  99.838  43.770  71.069 
100  75.781  88.058  97.397  98.958  47.428  74.740  51.529  84.774  89.224  99.731  46.200  72.742 
200  76.562  89.063  98.958  99.473  49.316  75.521  51.529  84.843  90.194  99.838  47.036  73.458 
400  /  /  /  /  /  /  51.496  84.441  /  /  /  / 
Method  Caltech101  Caltech20  AWA  NUSOBJ  Reuters  SUN 

SVMcon  47.901  83.827  31.044  42.719  88.180  38.200 
DeepLDA  45.649  76.508  25.598  20.320  84.907  / 
MvDA  45.200  76.276  9.788  11.457  78.831  / 
DCCA  66.159  86.504  20.677  28.753  64.917  16.116 
DCCAE  26.894  50.267  13.484  27.477  56.530  / 
MvDN  70.214  94.833  42.359  47.241  88.339  40.538 
MvNNcor  76.002  97.924  47.687  52.049  89.276  45.632 
MvNNBiIn  76.562  98.958  49.316  51.529  90.194  47.036 
✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓  
✓  ✓  ✓  ✓  ✓  ✓  
✓  ✓  ✓  ✓  
✓  ✓  
✓  ✓  
Caltech101  60.044  73.237  49.330  72.210  73.750  74.038  75.781  76.562 
Caltech20  94.792  96.096  82.292  97.917  96.615  97.917  98.438  98.958 
AWA  37.341  43.424  36.003  41.862  45.427  46.953  48.893  49.316 
NUSOBJ  48.847  50.366  36.469  47.739  48.055  50.673  51.130  51.529 
Reuters  88.578  88.766  88.739  88.793  89.170  89.536  89.978  90.194 
SUN  20.605  39.798  31.341  3.942  30.655  40.936  46.794  47.036 
VB2 Parameters Setup
Deep LDA is a fully connected neural network consisted of three hidden layers, i.e., 400, 200, and 300 units equipped with ReLU activation function. In DCCAE, there is a feature extraction network and a reconstruction network for each view, where each network is a fully connected network consisted of three hidden layers, i.e., 400, 200, and 300 units equipped with ReLU activation function, followed by a linear output of
(for feature extraction network) / (for reconstruction network) units. The capacities of the above networks are the same as those of their counterparts in DCCA and MvDN.In our MvNNBiIn, two kinds of networks and the bilinear function set are needed to learn. Each is a fully connected network consisted of two hidden layers (i.e., 400 and 200 units equipped with ReLU activation function). consists of 200 input units and 300 hidden units equipped with ReLU activation function, followed by a linear output of units. Each is a bilinear function which outputs bilinear interactive units from each pair . That is, the inputs of are the outputs of and , and the input of is concatenated by the outputs of and . The capacities of the above networks are the same as those of their counterparts in the ablation studies shown in Table IV. In MvNNcor, three kinds of networks are needed to learn, where are the same as the counterparts of MvNNBiIn, and consists of 200 input units and 200 hidden units with ReLU activation function.
In this paper, all the networks are optimized by Adam with batch normalization, where the learning rate is 10, , , and batch size is 64. In addition, we vary and , respectively, to explore the influence of different values of and on the classification accuracy. Based on the optimal and , we can learn the optimal model to achieve the highest classification accuracy. The results are shown in Figure 4 where Caltech101, Caltech20, AWA, NUSOBJ, Reuters, and SUN datasets can achieve the best performance when are set as , , , , , and , respectively.
Besides, to explore the proper combination proportion of the intraview information and multidimension bilinear interactive information for each view, we investigate the dimensionality of the output of the bilinear function , i.e., setting as 50, 100, 200, and 400, respectively, and the results are reported in Table II. It can be seen that the classification accuracy is the highest when , which shows that the combination proportion also has an impact on the classification performance.
VC Experimental Results
Tables III and IV show the classification performance of all the methods, where Table III reports the experimental results of several recent methods and Table IV provides the results of ablation experiments.
Firstly, compared with the singleview methods, i.e., SVMcon and DeepLDA, our MvNNBiIn consistently outperforms them on all the datasets. For example, MvNNBiIn achieves 28.661% and 30.913% improvements, respectively, on the Caltech101 dataset. That is because the concatenation of all the views may reduce the interpretability of different views and ignore the crossview interactive information during the multiview classification. We compare our MvNNBiIn with MvDA and the classification accuracy of our MvNNBiIn is better than that of MvDA on all the datasets. For instance, our MvNNBiIn obtains 31.362% improvements on the Caltech101 dataset, since the linear transformations of MvDA cannot deal well with some subtle but important structures in some challenging scenarios. Compared to MvDN, our MvNNBiIn achieves 6.348% improvements on the Caltech101 dataset due to embedding the multidimension bilinear interactive information between different views.
Secondly, we compare our MvNNBiIn with the CCAbased methods, i.e., DCCA and DCCAE, and the results are reported in Table III. It can be seen that our MvNNBiIn performs better than DCCA and DCCAE, since these two methods are limited to the doubleview input and unable to capture more diverse and complementary information from more views. For example, compared to DCCA and DCCAE, our MvNNBiIn achieves 28.639% and 35.832% improvements on the AWA dataset, respectively.
Thirdly, as the extension work of MvNNcor, our MvNNBiIn almost achieves better performance on all the datasets. For example, compared with MvNNcor on the AWA dataset, MvNNBiIn achieves 1.629% improvements. For one reason, the crossview multidimension bilinear interactive information of our MvNNBiIn is more able to capture the interactive information between different views. For another reason, our MvNNBiIn designs a novel view ensemble mechanism which can select more discriminative views and is more beneficial to the multiview classification.
Besides, in the ablation studies of our MvNNBiIn shown in Table IV, taking the NUSOBJ dataset as an example, we can achieve 2.682%, 1.163%, 15.060%, 3.790%, 3.474%, 0.856%, and 0.399% improvements compared with our framework respectively containing some of the following modules, i.e., , , , , , , and . These results successively demonstrate the effectiveness of integrating the intraview information and multidimension bilinear interactive information between views as well as the adaptive weighting multiview loss fusion with the selective strategy.
Moreover, Figure 5 shows the viewweights of each dataset learned by our MvNNBiIn, where the axis means the indices of different views and the axis denotes the weight of each view. The higher weight indicates that the view provides more valuable information and makes more contribution.
VD Discussion
Methods  Caltech101  NUSOBJ  SUN397  

Pretrained  Finetuned  Pretrained  Finetuned  Pretrained  Finetuned  
AlexNet  85.860  87.830  59.240  58.980  41.250  42.590 
GoogLeNet  88.050  89.690  63.620  64.990  46.840  47.960 
ResNet101  90.240  92.430  69.690  70.260  55.320  55.600 
VGGNet16  86.180  90.240  64.720  67.570  48.190  50.380 
MvNNBiIn  93.752  95.091  70.513  71.384  56.404  56.632 
Comparison results of our MvNNBiIn and several deep convolutional neural network architectures on image datasets Caltech101, NUSOBJ, and SUN397.
Method  Iterations  Batch size  Learning rate  

AlexNet  50000  256  0.0001  SGD 
GoogLeNet  50000  32  0.001  SGD 
ResNet101  50000  8  0.00001  SGD 
VGGNet16  50000  50  0.0001  SGD 
Method  Iterations  Batch size  Learning rate  

AlexNet  15000  256  0.00001  SGD 
GoogLeNet  30000  64  0.00001  SGD 
ResNet101  14000  8  0.00001  SGD 
VGGNet16  15000  50  0.00001  SGD 
Method  Iterations  Batch size  Learning rate  

AlexNet  50000  256  0.00001  SGD 
GoogLeNet  50000  64  0.00001  SGD 
ResNet101  50000  8  0.00001  SGD 
VGGNet16  50000  32  0.00001  SGD 
Actually, our MvNNBiIn is a generic framework that can improve the multiview classification using not only the handcrafted features (such as HOG, LBP, or SURF) but also the deep modellearned features.
We apply four popular CNNs (including AlexNet [58], GoogLeNet [59], VGGNet16 [60], and ResNet101 [61]) on three publicly available image datasets (i.e., Caltech101, NUSOBJ, and SUN397), respectively, to generate the CNN feature representations including four transferred CNN features and four finetuned CNN features. Then, we compare our proposed MvNNBiIn method with the singleview CNN featurebased methods to demonstrate the superiority of our multiview learning framework.
To be specific, for the transferred CNN featurebased methods, four offtheshelf CNN models including VGGNet16, ResNet101, AlexNet, and GoogLeNet are first adopted as general feature extractors to extract CNN features and then linear oneversusall SVMs (=0.001) are used for classification. For the finetuned CNN featurebased methods, we finetune the aforementioned four CNN models on the training datasets to extract better CNN features and then adopt linear oneversusall SVMs (=0.001) for classification. For our proposed MvNNBiIn method, we regard four transferred CNN features as four views for each image and apply them into MvNNBiIn to perform the multiview classification. Similarly, our proposed MvNNBiIn method is also performed on four finetuned CNN features.
The experimental results are shown in table V and the experimental settings of the finetuned CNN Models are shown in Tables VIVIII. It can be seen that our proposed MvNNBiIn method outperforms the singleview CNN featurebased methods, respectively, on both transferred and finetuned CNN features, and averagely achieves 7.577% (AlexNet), 5.551% (GoogLeNet), 3.087% (ResNet101), and 6.212% (VGGNet16) improvements on the Caltech101 dataset. These results demonstrate the superiority of our proposed multiview learning framework.
Vi Conclusion
In this paper, we propose a novel multiview learning framework denoted as MvNNBiIn which seamlessly embeds various intraview information and crossview multidimension bilinear interactive information as well as introducing a new view ensemble mechanism to jointly make decisions during the multiview classification. Extensive experiments on several publicly available datasets demonstrate the effectiveness of our proposed MvNNBiIn method. Furthermore, we demonstrate the superiority of multiview learning using the CNN feature representations, which provides a novel idea of fusing outputs of different deterministic neural networks in further work.
References

[1]
G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,”
Proceedings of the IEEE, vol. 10, no. 105, pp. 1865–1883, 2017.  [2] J. Han, X. Yao, G. Cheng, X. Feng, and D. Xu, “Pcnn: Partbased convolutional neural networks for finegrained visual categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2019.2933510.

[3]
J. Xu, F. Nie, and J. Han, “Feature selection via scaling factor integrated multiclass support vector machines,” in
International Joint Conference on Artificial Intelligence
, 2017.  [4] C. Du, C. Du, L. Huang, and H. He, “Reconstructing perceived images from human brain activities with bayesian deep multiview learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 8, pp. 2310–2323, 2018.
 [5] Y. Yu, S. Tang, K. Aizawa, and A. Aizawa, “Categorybased deep cca for finegrained venue discovery from multimodal data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 4, pp. 1250–1258, 2018.
 [6] D. Lian, L. Hu, W. Luo, Y. Xu, L. Duan, J. Yu, and S. Gao, “Multiview multitask gaze estimation with deep convolutional neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 3010–3023, 2018.
 [7] S. Zhang, X. Yu, Y. Sui, S. Zhao, and L. Zhang, “Object tracking with multiview support vector machines,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 265–278, 2015.
 [8] G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotationinvariant and fisher discriminative convolutional neural networks for object detection,” IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 265–278, 2019.
 [9] J. Peng, A. J. Aved, G. Seetharaman, and K. Palaniappan, “Multiview boosting with information propagation for classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 3, pp. 657–669, 2017.
 [10] C.M. Feng, Y. Xu, J.X. Liu, Y.L. Gao, and C.H. Zheng, “Supervised discriminative sparse pca for comcharacteristic gene selection and tumor classification on multiview biological data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 2926–2937, 2019.
 [11] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
 [12] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multiview learning overview: Recent progress and new challenges,” Information Fusion, vol. 38, pp. 43–54, 2017.
 [13] J. Xu, J. Han, F. Nie, and X. Li, “Multiview scaling support vector machines for classification and feature selection,” IEEE Transactions on Knowledge and Data Engineering, DOI: 10.1109/TKDE.2019.2904256.
 [14] X. Xie and S. Sun, “Multiview support vector machines with the consensus and complementarity information,” IEEE Transactions on Knowledge and Data Engineering, DOI: 10.1109/TKDE.2019.2933511.
 [15] J. Tang, Y. Tian, P. Zhang, and X. Liu, “Multiview privileged support vector machines,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3463–3477, 2017.
 [16] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.

[17]
F. R. Bach and M. I. Jordan, “Kernel independent component analysis,”
Journal of Machine Learning Research, vol. 3, no. 1, pp. 1–48, 2002.  [18] D. R. Hardoon, S. Szedmak, and J. Shawetaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
 [19] S. Sun, “A survey of multiview machine learning,” Neural Computing and Applications, vol. 23, no. 78, pp. 2031–2038, 2013.
 [20] P. L. Lai and C. Fyfe, “A neural implementation of canonical correlation analysis,” Neural Networks, vol. 12, no. 10, pp. 1391–1397, 1999.
 [21] W. W. Hsieh, “Nonlinear canonical correlation analysis by neural networks,” Neural Networks, vol. 13, no. 10, pp. 1095–1105, 2000.

[22]
R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in
International Conference on Artificial Intelligence and Security, 2009.  [23] B. Yoshua, C. Aaron, and V. Pascal, “Representation learning: a review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
 [24] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International Conference on Machine Learning, 2013.
 [25] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, “On deep multiview representation learning,” in International Conference on Machine Learning, 2015.

[26]
F. Feng, X. Wang, and R. Li, “Crossmodal retrieval with correspondence autoencoder,” in
ACM International Conference on Multimedia, 2014. 
[27]
S. Rastegar, M. S. Baghshah, H. R. Rabiee, and S. M. Shojaee, “Mdlcw: A
multimodal deep learning framework with cross weights,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2016.  [28] T. Diethe, D. R. Hardoon, and J. ShaweTaylor, “Constructing nonlinear discriminants from multiple data views,” in European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2010.
 [29] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multiview discriminant analysis,” in European Conference on Computer Vision, 2012.
 [30] S. Mika, A. Smola, and B. Scholkopf, “An improved training algorithm for kernel fisher discriminants,” in International Conference on Artificial Intelligence and Statistics, 2001.
 [31] J. B. Tenenbaum and W. T. Freeman, “Separating style and content with bilinear models,” Neural Computation, vol. 12, no. 6, pp. 1247–1283, 2014.
 [32] Y. Shuicheng, X. Dong, Z. Benyu, Z. HongJiang, Y. Qiang, and L. Stephen, “Graph embedding and extensions: a general framework for dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40–51, 2007.
 [33] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 [34] J. Xu, W. Li, X. Liu, D. Zhang, J. Liu, and J. Han, “Deep embedded complementary and interactive information for multiview classification,” in AAAI Conference on Artificial Intelligence, 2020.
 [35] N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to crossmodal multimedia retrieval,” in International Conference on Multimedia, 2010.
 [36] S. Liang, S. Ji, and J. Ye, “A least squares formulation for canonical correlation analysis,” in International Conference on Machine Learning, 2008.
 [37] H. Avron, C. Boutsidis, S. Toledo, and A. Zouzias, “Efficient dimensionality reduction for canonical correlation analysis,” in International Conference on Machine Learning, 2013.

[38]
M. B. Blaschko and C. H. Lampert, “Correlational spectral clustering,”
IEEE Conference on Computer Vision and Pattern Recognition, 2008.  [39] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan, “Multiview clustering via canonical correlation analysis,” in International Conference on Machine Learning, 2009.
 [40] S. M. Kakade and D. P. Foster, “Multiview regression via canonical correlation analysis,” in International Conference on Computer and Information Technology, 2007.

[41]
B. Mcwilliams, D. Balduzzi, and J. M. Buhmann, “Correlated random features for fast semisupervised learning.” in
Advances in Neural Information Processing Systems, 2013.  [42] P. S. Dhillon, D. Foster, and L. Ungar, “Multiview learning of word embeddings via cca,” in Advances in Neural Information Processing Systems, 2011.
 [43] P. S. Dhillon, J. Rodu, D. P. Foster, and L. H. Ungar, “Using cca to improve cca: A new spectral method for estimating vector models of words,” in International Conference on Machine Learning, 2012.
 [44] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multiview embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision, vol. 106, no. 2, pp. 210–233, 2014.
 [45] K. TaeKyun, K. Josef, and C. Roberto, “Discriminative learning and recognition of image set classes using canonical correlations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1005–1018, 2007.
 [46] S. Ya, F. Yun, G. Xinbo, and T. Qi, “Discriminant learning through multiple principal angles for visual recognition,” IEEE Transactions on Image Processing, vol. 21, no. 3, pp. 1381–1390, 2012.
 [47] L. FeiFei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Computer Vision and Image Understanding, vol. 1, no. 106, pp. 59–70, 2007.
 [48] Y. Li, F. Nie, H. Huang, and J. Huang, “Largescale multiview spectral clustering via bipartite graph,” in AAAI Conference on Artificial Intelligence, 2015.
 [49] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by betweenclass attribute transfer,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009.
 [50] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nuswide: a realworld web image database from national university of singapore,” in ACM International Conference on Image and Video Retrieval, 2009.
 [51] M. Amini, N. Usunier, and C. Goutte, “Learning from multiple partially observed viewsan application to multilingual text categorization,” in Advances in Neural Information Processing Systems, 2009.
 [52] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Largescale scene recognition from abbey to zoo,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010.
 [53] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene categories,” International Journal of Computer Vision, vol. 119, no. 1, pp. 3–22, 2016.
 [54] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multiview representation learning,” in International Conference on Machine Learning, 2015.
 [55] M. Dorfer, R. Kelz, and G. Widmer, “Deep linear discriminant analysis,” in International Conference on Learning Representations, 2016.
 [56] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multiview discriminant analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 188–194, 2016.
 [57] M. Kan, S. Shan, and X. Chen, “Multiview deep network for crossview classification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[58]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems, 2012.  [59] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [60] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in International Conference on Learning Representations, 2015.
 [61] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.