Embedded Deep Bilinear Interactive Information and Selective Fusion for Multi-view Learning

07/13/2020 ∙ by Jinglin Xu, et al. ∙ 0

As a concrete application of multi-view learning, multi-view classification improves the traditional classification methods significantly by integrating various views optimally. Although most of the previous efforts have been demonstrated the superiority of multi-view learning, it can be further improved by comprehensively embedding more powerful cross-view interactive information and a more reliable multi-view fusion strategy in intensive studies. To fulfill this goal, we propose a novel multi-view learning framework to make the multi-view classification better aimed at the above-mentioned two aspects. That is, we seamlessly embed various intra-view information, cross-view multi-dimension bilinear interactive information, and a new view ensemble mechanism into a unified framework to make a decision via the optimization. In particular, we train different deep neural networks to learn various intra-view representations, and then dynamically learn multi-dimension bilinear interactive information from different bilinear similarities via the bilinear function between views. After that, we adaptively fuse the representations of multiple views by flexibly tuning the parameters of the view-weight, which not only avoids the trivial solution of weight but also provides a new way to select a few discriminative views that are beneficial to make a decision for the multi-view classification. Extensive experiments on six publicly available datasets demonstrate the effectiveness of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Classification has been a fundamental technique of machine learning and broadly applied in image classification

[1, 2, 3], image reconstruction [4], cross-modal retrieval [5], gaze tracking [6], visual tracking [7, 8], multi-view boosting [9], multi-view biological data [10]

and so on. The typical classification methods mainly include Logistic Regress, K-Nearest Neighbor, Decision Tree, and Support Vector Machines (SVM), where the kernel trick is an important innovation of SVM since it can not only utilize the convex optimization method with convergence to learn the nonlinear model but also make the implementation of kernel function (such as Radial Basis Function (RBF)) high-efficiency. Unfortunately, the computation cost of the kernel methods is too large for the large dataset. The deep learning method, therefore, is proposed to overcome the limitations of the kernel methods. For example,

[11] has been demonstrated that the performance of Neural Networks is better than that of SVM with RBF kernel on the MNIST dataset.

Recently, with the rapid development of data mining techniques, in many scientific data analysis tasks, data are often collected through different ways, such as various sensors, as usually a single way cannot comprehensively describe the entire information of a data instance. In this case, different kinds of features of each data instance can be considered as different views of this instance, where each view may capture a specific facet of this instance. Importantly, the combination of multiple views can provide a more comprehensive view of this instance. Obviously, the multi-view features can be utilized to sufficiently represent an instance compared to the single view features. As an important branch of multi-view learning [12], the multi-view classification [13, 14, 15] uses multiple distinct representations of data and models a multi-view learning framework to perform the classification task.

Canonical Correlation Analysis (CCA) [16] and Kernel CCA (KCCA) [17, 18, 19] show their abilities of effectively modeling the relationship between two or more views. However, they still have some limitations on capturing high-level associations between different views. To be specific, CCA ignores the nonlinearities in multi-view data and KCCA may suffer from the effect of small data when the data acquisition in one or more modalities is expensive or otherwise limited. Therefore, in the early efforts, [20] investigates a neural network implementation of CCA and maximizes the correlation between the output of the networks for each view. [21] formulates a nonlinear CCA method using three feedforward neural networks, where the first network maximizes the correlation between two canonical variates, while the remaining two networks map from the canonical variates back to the original two sets of variables. However, the above CCA-based methods have been proposed for many years and are needed to be improved.

Inspired by the recent successes of deep neural networks [22, 23], correlation can be naturally applied to multi-view neural network learning to learn deep and abstract multi-view interactive information. The deep neural networks extension of CCA (Deep CCA) [24] learns representations of two views by using multiple stacked layers and maximizes the correlation of two representations. Later, [25] proposes a deep canonically correlated auto-encoder (DCCAE) by combining the advantages of both Deep CCA and auto-encoder. Specifically, DCCAE consists of two auto-encoders and optimizes the combination of canonical correlation between two learned bottleneck representations and the reconstruction errors of the auto-encoders. Similar to the principle of DCCAE, [26]

proposes a correspondence auto-encoder (Corr-AE) via constructing correlations between the hidden representations of two unimodal deep auto-encoders.

[27] suggests exploiting the cross weights between the representations of views for gradually learning interactions of the modalities (views) in a multi-modal deep auto-encoder network. Theoretical analysis of [27] shows that considering these interactions from a high level provides more intra-modality (intra-view) information.

In addition, a number of multi-view analysis methods [28, 29] have been proposed. Based on Fisher Discriminant Analysis (FDA) [30], both regularized two-view FDA and its kernel extension can be cast as equivalent disciplined convex optimization problems. Then, [28]

introduces Multi-view Fisher Discriminant Analysis (MFDA) that learns classifiers in multiple views, by minimizing the variance of the data along with the projection while maximizing the distance between the average outputs for classes over all the views. However, MFDA can only be used for binary classification problems.

[29] proposes a Multi-view Discriminant Analysis (MvDA), which seeks for a discriminant common space by maximizing the between-class and minimizing the within-class variations, across all the views. Later, based on bilinear models [31] and general graph embedding framework [32], [33] introduces Generalized Multi-view Analysis (GMA). As an example of GMA, Generalized Multi-view Linear Discriminant Analysis (GMLDA) finds a set of projection directions in each view that tries to separate different contents’ class means and unify different views of the same class in the common subspace.

Based on the above analyses, although CCA-based deep neural networks can learn the multi-view interactive information and DA-based methods can consider the discriminative information, there is not a unified framework that simultaneously embeds the intra-view information, cross-view interactive information, and a reliable multi-view fusion strategy.

To tackle this issue, we propose a novel multi-view learning framework named as MvNNBiIn, which integrates both the intra-view information and cross-view multi-dimension bilinear interactive information for each view and designs a multi-view selective loss fusion. Specifically, (a) multiple intra-view information coming from different faceted representations, ensures the diversity and complementarity among different views to enhance multi-view learning. (b) The cross-view multi-dimension bilinear interactive information dynamically learns multi-dimension bilinear interactive information from different bilinear similarities. Each bilinear similarity is calculated by the similarity between intra-view information via the bilinear function. Therefore, the multi-dimension bilinear interactive information comprehensively models the relationships between views by learning different metric matrices. (c) The multi-view selective loss fusion is to calculate multiple losses for multiple views and fuses them in an adaptive weighting way with the selective strategy. This selective strategy can choose several discriminative views that are beneficial to make a decision for the multi-view classification, by tuning the sparseness of the weight vector.

It is worth mentioning that we have developed a preliminary work [34] named deep embedded complementary and interactive information for multi-view classification (denoted as MvNNcor). In this paper, our proposed MvNNBiIn method extends and improves the MvNNcor method significantly. Their major differences are summarized as follows. On one hand, MvNNcor utilizes the cross-correlations between attributes of multiple representations to generate interactive information. In contrast, our MvNNBiIn models a novel cross-view multi-dimension bilinear interactive information which consists of different bilinear similarities for each view with respect to another view. Each bilinear similarity is generated by the similarity between intra-view information passed by the bilinear function. On the other hand, MvNNcor uses the multi-view fusion strategy to integrate multiple views. By contrast, our MvNNBiIn designs a novel view ensemble mechanism to select more discriminative views that are beneficial to the multi-view classification.

We summarize the main contributions of the proposed MvNNBiIn method as follows.

  • We propose a unified framework, which seamlessly embeds intra-view information, cross-view multi-dimension bilinear interactive information, and a novel view ensemble mechanism, to make a decision during the optimization and improve the classification performance.

  • We model the cross-view interactive information by capturing the multi-dimension bilinear interactive information which is calculated by simultaneously learning multiple metric matrices via the bilinear function between views. It comprehensively models the relationships between different views.

  • We develop a new view ensemble mechanism which not only selects some discriminative views but also fuses them via an adaptive weighting method. The selective strategy can ensure that the selected views are beneficial to the multi-view classification.

  • We perform extensive experiments on several publicly available datasets to demonstrate the effectiveness of our model.

The rest of this paper is organized as follows. The Preliminary Knowledge is introduced in Section II. We formulate the framework of MvNNBiIn in Section III. Section IV briefly provides a tractable and skillful optimization method of MvNNBiIn. Section V evaluates MvNNBiIn on several public datasets, followed by some theoretical and empirical analyses of experiments. The conclusion and further works are shown in Section VI.

Ii Preliminary Knowledge

In this section, we briefly review several multi-view learning methods from the perspectives of CCA-based methods and MvDA method.

Ii-a CCA-based Methods

CCA [16] is popular for its capability of modeling the relationship between two or more sets of variables. CCA computes a shared embedding of both or more sets of variables through maximizing the correlations among the variables among these sets. CCA has been widely used in the multi-view learning tasks to generate low-dimensional representations [35, 36]. Improved generalization performance has been witnessed in areas including dimensionality reduction [37], clustering [38, 39], regression [40, 41], word embeddings [42, 43, 44], and discriminant learning [45, 46].

Supposing that is a data matrix for the -th view, CCA tends to find the linear projections which make the instances from two data matrices maximally correlated in the projected space. Therefore, CCA is modeled as the following constrained optimization problem,

(1)

where are centralized.

KCCA is an kernel extension of CCA for pursuing maximally correlated nonlinear projections. Let denote the kernel matrix such that , where , , . is a centering matrix and denotes a vector of all ones. Therefore, KCCA is formulated as the following optimization problem,

(2)

where and are the coefficient vectors optimized by KCCA.

As a DNN extension of CCA, DCCA utilizes two DNNs and to extract nonlinear features for each view, and then maximizes the canonical correlation between and . That is,

(3)

where and are the CCA directions that project the DNN outputs, and

are the regularization parameters for the sample covariance estimation.

Inspired by both CCA and reconstruction-based objectives, DCCAE constructs a model by consisting of two auto-encoders and optimizing the combination of canonical correlation between the learned representations and the reconstruction errors of the auto-encoders. That is,

(4)

where and are the reconstruction networks for each view, and is the trade-off parameter.

Ii-B MvDA Method

MvDA attempts to find linear transforms that project the samples from views to one discriminant common space, respectively, where the between-class variation is maximized while the within-class variation is minimized. Defined as the samples from the -th view, where is the -th sample from the -th view of the -th class of dimension and denotes the number of classes, and is the number of samples from the -th view of the -th class.

The samples from views are then projected to the same common space by using view-specific linear transforms. The projected results are denoted as . In the common space, according to our goal, the between-class variation from all views should be maximized while the within-class variation from all views should be minimized. Therefore, the objective is formulated as a generalized Rayleigh quotient,

(5)

where the within-class scatter matrix and the between-class scatter matrix are computed as below,

(6)

where denotes the number of samples of the -th class in all views, is the number of samples from all the classes and all the views, is the mean of all the samples of the -th class over all the views in the common space, is the mean of all the samples over all views in the common space. That is,

(7)
Fig. 1: The architecture of our proposed method MvNNBiIn. It consists of four parts: intra-view extracted networks , cross-view bilinear interactive modules, the combination module of intra-view and cross-view information, and the multi-view selective loss fusion strategy.

Iii The Proposed Method

In this section, we propose the architecture of our model MvNNBiIn, depicted in Figure 1, which consists of four parts: various intra-view extracted networks, multi-dimension bilinear interactive modules between views, combination of intra-view and cross-view information, and the multi-view selective loss fusion strategy.

Iii-a Various Intra-view Information Extraction

Given an instance , we utilize views to denote its kinds of visual features, which ensures the diverse and complementary information during the multi-view learning. Defining a set of neural networks , each projects from into , which captures the high-level intra-view information for the -th view, that is,

(8)

where and is a neural network with layers,

(9)

where denotes the weight matrix and

denotes the bias vector, and

is the output of the -th layer, and

is the activation function applied component-wise. Specifically,

, , , , and . It is worth mentioning that are trained coordinatively by solving the optimization problem shown in subsection III-D, where the parameters of are not shared with those of

. Although we utilize a multi-layer perceptron as

, it also can be replaced with any deterministic neural networks with the input and output layers.

Iii-B Multi-dimension Bilinear Interactive Information

Inspired by the metric learning, the multi-dimension bilinear interactive information is proposed to construct the second-order feature interactions on various intra-view information. To be specific, given the -th view and the -th view , the multi-dimension bilinear interactive information between different views can be formulated via the bilinear function , that is,

(10)

where , , , and is the bias. There are multiple metric matrices (i.e., ) to be learned simultaneously. Intuitively, Figure 2 shows the construction of the multi-dimension bilinear interactive information of each pair of views.

Fig. 2: The construction of the multi-dimension bilinear interactive information of each pair of views. Taking the first view as an example, we calculate its multi-dimension bilinear interactive information relative to the second view by learning multiple metric matrics, i.e., , and then obtain .

For the -th view, we define as a set contained view pairs relative to the -th view, that is,

(11)

where and each view pair is undirected. For the -th view, the bilinear interactive information of the -th view relative to other views is collected into a set ,

(12)

where the -th set contains view pairs for the -th view and there are sets for views.

Iii-C Combination and Prediction of Each View

According to the above two subsections, we combine the intra-view information and multi-dimension bilinear interactive information of the -th view as follows,

(13)

where Con denotes the concatenation operation.

After that, is passed through to obtain the prediction of the -th view, that is,

(14)

where is the number of classes and produces a distribution over the possible classes for each view. is a neural network with layers, that is,

(15)

where denotes the weight matrix and denotes the bias vector, and is the output of the -th layer, and is the activation function applied component-wise. In particular, , , , , and . Here, the parameters of are shared among views.

Iii-D Multi-view Selective Loss Fusion Strategy

In this subsection, we design a novel view ensemble mechanism, which is to calculate multiple losses for multiple views and fuses them in an adaptive weighting way with the selective strategy. This selective strategy can choose several discriminative views that are beneficial to the multi-view classification, by tuning the sparseness of the weight vector. That is described as,

(16)

where

(17)

where is the weight for each view, denotes the common label information of an instance for all the views, and is the cross-entropy loss of the -th view. is the power exponent parameter of the weight for the -th view, which controls the weight distribution of different views flexibly and avoids the trivial solution of during the classification. is used to constrain the sparseness of the weight vector , where denotes the number of nonzero elements in . Crucially, the -norm constraint is able to capture the global relationship among different views and to achieve the view-wise sparsity, which realizes selecting a few discriminative views can improve the performance during the multi-view classification. Intuitively, the architecture of our proposed method MvNNBiIn is shown in Figure 1.

Iv Optimization

We utilize the alternate optimization method to update the parameters and the view-weight , respectively. For convenience, denotes the bilinear function set and is used to denote the concatenation operation in Figure 3.

Iv-a Update , , and

We fix and update , , and

, which utilizes Adam with batch normalization and the autograd package in PyTorch to train. Figure

3 briefly shows the gradient computations of our proposed MvNNBiIn method.

Fig. 3: The gradient computations of our proposed MvNNBiIn method during the multi-view classification.
Views Caltech101 Caltech20 AWA NUSOBJ Reuters SUN
Fea Dim Fea Dim Fea Dim Fea Dim Fea Dim Fea Dim
1 Gabor 48 Gabor 48 CH 2688 CH 64 English 21531 GIST 256
2 WM 40 WM 40 LSS 2000 CM 225 French 24892 GEOMAP 512
3 CENT 254 CENT 254 PHOG 252 CORR 144 German 34251 TEXTON 512
4 HOG 1984 HOG 1984 SIFT 2000 EDH 73 Italian 15506 / /
5 GIST 512 GIST 512 RGSIFT 2000 WT 128 Spanish 11547 / /
6 LBP 928 LBP 928 SURF 2000 / / / / / /
# Samples 9914 2386 30475 30000 18758 99250
# Classes 102 20 50 31 6 397
TABLE I:

The descriptions of six publicly available datasets, i.e., Caltech101, Caltech20, AWA, NUSOBJ, Reuters, and SUN. Some abbreviations are defined as follows, Fea: public Feature, Dim: Dimensionality, WM: Wavelet Moments, CENT: CENTRIST, CH: Color Histogram, LSS: Local Self-Similarity, PHOG: Pyramid HOG, RGSIFT: Color SIFT, CM: block-wise Color Moments, CORR: color Correlogram, EDH: Edge Direction Histogram, WT: Wavelet Texture, GEOMAP: Geometric Map.

Iv-B Update

We learn the non-negative normalized weight for each view and assign the higher weight to more discriminative view. Therefore, we fix the parameters and update by solving the optimization problem (16).

To efficiently minimize the problem (16), we define a function on , that is,

(18)

where is a permutation matrix which results in permuting elements of along the ascending order, i.e., . Based on equation (18), we select the first smallest elements and optimize their corresponding weights , meanwhile, setting the rest weights as zeros. Therefore, the problem (16) is equivalent to the following problem by absorbing the -norm constraint into the objective function,

(19)

Through the Lagrangian Multiplier method, the Lagrangian function of problem (19) is:

(20)

where is the Lagrangian multiplier. Taking the derivatives of with respect to and , respectively, and setting them to zeros, there is,

(21)

where and is the sparsity of .

To sum up, the optimal solution of problem (16) can be calculated by,

(22)

Therefore, the scheme for MvNNBiIn can be summarized as follows. Firstly, the are used to learn the complementary and diverse intra-view information from multi-view representations. Secondly, the intra-view information is used for training the bilinear function set , which captures the cross-view interactive information. Thirdly, the intra-view and cross-view interactive information for each view is integrated by a concatenation operation, and then the concatenation is fed into to calculate the cross-entropy loss. Finally, the trained , and can be optimized by solving the problem (16) and the weight distribution can be obtained, which contributes to infer the label during the multi-view classification.

V Experiments

In this section, we evaluate the performance of our proposed MvNNBiIn method on six publicly available datasets.

V-a Datasets

We follow many state-of-the-art multi-view methods, which treat different kinds of pre-extracted feature vectors for an image as different views to make the experimental comparisons fair. Different kinds of pre-extracted features present different facets of the images, such as color, texture, shape, and geographic information, yet the above information is not intuitive. We describe all the publicly available datasets in TABLE I.

Specifically, Caltech101 [47, 48] and Caltech20 [48] datasets consist of images of objects belonging to 102 (101 classes plus one background clutter class) and 20 classes, respectively, and each dataset is described as 6 views. AWA (Animals with Attributes) [49] dataset provides 6 kinds of features (6 views) for attribute base classification. NUSOBJ dataset is a subset of NUS-WIDE [50], which describes each object image using 5 types of low-level features (5 views). Reuters [51] dataset is used for document categorization and written in 5 different languages (5 views). SUN is a subset of SUN397 (Scene Categorization Benchmark) [52, 53] and utilizes 3 kinds of public features matrices (3 views) to represent each image.

Referring to the previous works [24, 54], we split each dataset into three parts, that is, 70% samples for training, 20% samples for validating, and 10% samples for testing. We utilize the classification accuracy (e.g., Top@1 accuracy and Top@5 accuracy) to evaluate the performance of all the methods and report the final results in TABLEs II, III, and IV.

Fig. 4: The effect of sparseness on various views in our proposed method MvNNBiIn. denotes the number of nonzero elements in learned weights . Since different datasets contain different numbers of views, we explore from on Caltech101, Caltech20, and AWA datasets; from on NUSOBJ and Reuters datasets; on SUN dataset. is the power exponent parameter of the weight and we investigate in the range of on Caltech101, Caltech20, and Reuters datasets; in the range of on AWA dataset; in the range of on NUSOBJ dataset; in the range of on SUN dataset.

V-B Experimental Settings

V-B1 Comparison Methods

We first compare our MvNNBiIn method with several state-of-the-art multi-view methods, including SVMcon, DCCA [24], DCCAE [54], DeepLDA [55], MvDA [56], MvDN [57], and MvNNcor [34], in Table III. In particular, SVMcon is a baseline that concatenates all the views and feeds into the SVM classifier. DCCA and DCCAE belong to the CCA-based method inputting two views. DeepLDA, MvDA, and MvDN are the Discriminant Analysis based method, where DeepLDA inputs the concatenation of all the views and feeds into a deep neural network followed by LDA; MvDA inputs multiple views with the same dimensionality; MvDN is the non-linear version of LDA, which uses deep neural networks to replace of the linear transformations.

What’s more, Table IV demonstrates the effectiveness of three important parts of our proposed MvNNBiIn, i.e., intra-view information, cross-view bilinear interactive information, and a novel view ensemble mechanism, respectively. The highest performance is obtained when all the parts are available while the performance is lower when any part is absent.

Caltech101 Caltech20 AWA NUSOBJ Reuters SUN
Top@1 Top@5 Top@1 Top@5 Top@1 Top@5 Top@1 Top@5 Top@1 Top@5 Top@1 Top@5
50 76.228 87.277 97.397 98.958 48.210 74.977 51.529 84.142 89.116 99.838 43.770 71.069
100 75.781 88.058 97.397 98.958 47.428 74.740 51.529 84.774 89.224 99.731 46.200 72.742
200 76.562 89.063 98.958 99.473 49.316 75.521 51.529 84.843 90.194 99.838 47.036 73.458
400 / / / / / / 51.496 84.441 / / / /
TABLE II: The investigation of the dimensionality of the bilinear function on different datasets. is set as , , , and , respectively, which reflects the combination proportion of intra-view information and cross-view bilinear interactive information for each view. Due to CUDA out of memory, the results of cannot be obtained except for NUSOBJ dataset.
Method Caltech101 Caltech20 AWA NUSOBJ Reuters SUN
SVMcon 47.901 83.827 31.044 42.719 88.180 38.200
DeepLDA 45.649 76.508 25.598 20.320 84.907 /
MvDA 45.200 76.276 9.788 11.457 78.831 /
DCCA 66.159 86.504 20.677 28.753 64.917 16.116
DCCAE 26.894 50.267 13.484 27.477 56.530 /
MvDN 70.214 94.833 42.359 47.241 88.339 40.538
MvNNcor 76.002 97.924 47.687 52.049 89.276 45.632
MvNNBiIn 76.562 98.958 49.316 51.529 90.194 47.036
TABLE III: Comparison results of MvNNBiIn and several state-of-the-art methods on all the datasets.
Caltech101 60.044 73.237 49.330 72.210 73.750 74.038 75.781 76.562
Caltech20 94.792 96.096 82.292 97.917 96.615 97.917 98.438 98.958
AWA 37.341 43.424 36.003 41.862 45.427 46.953 48.893 49.316
NUSOBJ 48.847 50.366 36.469 47.739 48.055 50.673 51.130 51.529
Reuters 88.578 88.766 88.739 88.793 89.170 89.536 89.978 90.194
SUN 20.605 39.798 31.341 3.942 30.655 40.936 46.794 47.036
TABLE IV: Ablation experiments of our proposed MvNNBiIn on all the datasets.

V-B2 Parameters Setup

Deep LDA is a fully connected neural network consisted of three hidden layers, i.e., 400, 200, and 300 units equipped with ReLU activation function. In DCCAE, there is a feature extraction network and a reconstruction network for each view, where each network is a fully connected network consisted of three hidden layers, i.e., 400, 200, and 300 units equipped with ReLU activation function, followed by a linear output of

(for feature extraction network) / (for reconstruction network) units. The capacities of the above networks are the same as those of their counterparts in DCCA and MvDN.

In our MvNNBiIn, two kinds of networks and the bilinear function set are needed to learn. Each is a fully connected network consisted of two hidden layers (i.e., 400 and 200 units equipped with ReLU activation function). consists of 200 input units and 300 hidden units equipped with ReLU activation function, followed by a linear output of units. Each is a bilinear function which outputs bilinear interactive units from each pair . That is, the inputs of are the outputs of and , and the input of is concatenated by the outputs of and . The capacities of the above networks are the same as those of their counterparts in the ablation studies shown in Table IV. In MvNNcor, three kinds of networks are needed to learn, where are the same as the counterparts of MvNNBiIn, and consists of 200 input units and 200 hidden units with ReLU activation function.

In this paper, all the networks are optimized by Adam with batch normalization, where the learning rate is 10, , , and batch size is 64. In addition, we vary and , respectively, to explore the influence of different values of and on the classification accuracy. Based on the optimal and , we can learn the optimal model to achieve the highest classification accuracy. The results are shown in Figure 4 where Caltech101, Caltech20, AWA, NUSOBJ, Reuters, and SUN datasets can achieve the best performance when are set as , , , , , and , respectively.

Besides, to explore the proper combination proportion of the intra-view information and multi-dimension bilinear interactive information for each view, we investigate the dimensionality of the output of the bilinear function , i.e., setting as 50, 100, 200, and 400, respectively, and the results are reported in Table II. It can be seen that the classification accuracy is the highest when , which shows that the combination proportion also has an impact on the classification performance.

Fig. 5: The learned weights for different views through MvNNBiIn for Caltech101/20, AWA, NUSOBJ, Reuters, and SUN datasets. The blue dashed boxes respectively in Caltech101, Caltech20, and NUSOBJ datasets denote that the view weights are zeros.

V-C Experimental Results

Tables III and IV show the classification performance of all the methods, where Table III reports the experimental results of several recent methods and Table IV provides the results of ablation experiments.

Firstly, compared with the single-view methods, i.e., SVMcon and DeepLDA, our MvNNBiIn consistently outperforms them on all the datasets. For example, MvNNBiIn achieves 28.661% and 30.913% improvements, respectively, on the Caltech101 dataset. That is because the concatenation of all the views may reduce the interpretability of different views and ignore the cross-view interactive information during the multi-view classification. We compare our MvNNBiIn with MvDA and the classification accuracy of our MvNNBiIn is better than that of MvDA on all the datasets. For instance, our MvNNBiIn obtains 31.362% improvements on the Caltech101 dataset, since the linear transformations of MvDA cannot deal well with some subtle but important structures in some challenging scenarios. Compared to MvDN, our MvNNBiIn achieves 6.348% improvements on the Caltech101 dataset due to embedding the multi-dimension bilinear interactive information between different views.

Secondly, we compare our MvNNBiIn with the CCA-based methods, i.e., DCCA and DCCAE, and the results are reported in Table III. It can be seen that our MvNNBiIn performs better than DCCA and DCCAE, since these two methods are limited to the double-view input and unable to capture more diverse and complementary information from more views. For example, compared to DCCA and DCCAE, our MvNNBiIn achieves 28.639% and 35.832% improvements on the AWA dataset, respectively.

Thirdly, as the extension work of MvNNcor, our MvNNBiIn almost achieves better performance on all the datasets. For example, compared with MvNNcor on the AWA dataset, MvNNBiIn achieves 1.629% improvements. For one reason, the cross-view multi-dimension bilinear interactive information of our MvNNBiIn is more able to capture the interactive information between different views. For another reason, our MvNNBiIn designs a novel view ensemble mechanism which can select more discriminative views and is more beneficial to the multi-view classification.

Besides, in the ablation studies of our MvNNBiIn shown in Table IV, taking the NUSOBJ dataset as an example, we can achieve 2.682%, 1.163%, 15.060%, 3.790%, 3.474%, 0.856%, and 0.399% improvements compared with our framework respectively containing some of the following modules, i.e., , , , , , , and . These results successively demonstrate the effectiveness of integrating the intra-view information and multi-dimension bilinear interactive information between views as well as the adaptive weighting multi-view loss fusion with the selective strategy.

Moreover, Figure 5 shows the view-weights of each dataset learned by our MvNNBiIn, where the -axis means the indices of different views and the -axis denotes the weight of each view. The higher weight indicates that the view provides more valuable information and makes more contribution.

V-D Discussion

Methods Caltech101 NUSOBJ SUN397
Pre-trained Fine-tuned Pre-trained Fine-tuned Pre-trained Fine-tuned
AlexNet 85.860 87.830 59.240 58.980 41.250 42.590
GoogLeNet 88.050 89.690 63.620 64.990 46.840 47.960
ResNet-101 90.240 92.430 69.690 70.260 55.320 55.600
VGGNet-16 86.180 90.240 64.720 67.570 48.190 50.380
MvNNBiIn 93.752 95.091 70.513 71.384 56.404 56.632
TABLE V:

Comparison results of our MvNNBiIn and several deep convolutional neural network architectures on image datasets Caltech101, NUSOBJ, and SUN397.

Method Iterations Batch size Learning rate
AlexNet 50000 256 0.0001 SGD
GoogLeNet 50000 32 0.001 SGD
ResNet-101 50000 8 0.00001 SGD
VGGNet-16 50000 50 0.0001 SGD
TABLE VI: The experimental settings of all the networks on Caltech101 dataset.
Method Iterations Batch size Learning rate
AlexNet 15000 256 0.00001 SGD
GoogLeNet 30000 64 0.00001 SGD
ResNet-101 14000 8 0.00001 SGD
VGGNet-16 15000 50 0.00001 SGD
TABLE VII: The experimental settings of all the networks on NUSOBJ dataset.
Method Iterations Batch size Learning rate
AlexNet 50000 256 0.00001 SGD
GoogLeNet 50000 64 0.00001 SGD
ResNet-101 50000 8 0.00001 SGD
VGGNet-16 50000 32 0.00001 SGD
TABLE VIII: The experimental settings of all the networks on SUN397 dataset.

Actually, our MvNNBiIn is a generic framework that can improve the multi-view classification using not only the handcrafted features (such as HOG, LBP, or SURF) but also the deep model-learned features.

We apply four popular CNNs (including AlexNet [58], GoogLeNet [59], VGGNet-16 [60], and ResNet-101 [61]) on three publicly available image datasets (i.e., Caltech101, NUSOBJ, and SUN397), respectively, to generate the CNN feature representations including four transferred CNN features and four fine-tuned CNN features. Then, we compare our proposed MvNNBiIn method with the single-view CNN feature-based methods to demonstrate the superiority of our multi-view learning framework.

To be specific, for the transferred CNN feature-based methods, four off-the-shelf CNN models including VGGNet-16, ResNet-101, AlexNet, and GoogLeNet are first adopted as general feature extractors to extract CNN features and then linear one-versus-all SVMs (=0.001) are used for classification. For the fine-tuned CNN feature-based methods, we fine-tune the aforementioned four CNN models on the training datasets to extract better CNN features and then adopt linear one-versus-all SVMs (=0.001) for classification. For our proposed MvNNBiIn method, we regard four transferred CNN features as four views for each image and apply them into MvNNBiIn to perform the multi-view classification. Similarly, our proposed MvNNBiIn method is also performed on four fine-tuned CNN features.

The experimental results are shown in table V and the experimental settings of the fine-tuned CNN Models are shown in Tables VIVIII. It can be seen that our proposed MvNNBiIn method outperforms the single-view CNN feature-based methods, respectively, on both transferred and fine-tuned CNN features, and averagely achieves 7.577% (AlexNet), 5.551% (GoogLeNet), 3.087% (ResNet-101), and 6.212% (VGGNet-16) improvements on the Caltech101 dataset. These results demonstrate the superiority of our proposed multi-view learning framework.

Vi Conclusion

In this paper, we propose a novel multi-view learning framework denoted as MvNNBiIn which seamlessly embeds various intra-view information and cross-view multi-dimension bilinear interactive information as well as introducing a new view ensemble mechanism to jointly make decisions during the multi-view classification. Extensive experiments on several publicly available datasets demonstrate the effectiveness of our proposed MvNNBiIn method. Furthermore, we demonstrate the superiority of multi-view learning using the CNN feature representations, which provides a novel idea of fusing outputs of different deterministic neural networks in further work.

References

  • [1]

    G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,”

    Proceedings of the IEEE, vol. 10, no. 105, pp. 1865–1883, 2017.
  • [2] J. Han, X. Yao, G. Cheng, X. Feng, and D. Xu, “P-cnn: Part-based convolutional neural networks for fine-grained visual categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2019.2933510.
  • [3]

    J. Xu, F. Nie, and J. Han, “Feature selection via scaling factor integrated multi-class support vector machines,” in

    International Joint Conference on Artificial Intelligence

    , 2017.
  • [4] C. Du, C. Du, L. Huang, and H. He, “Reconstructing perceived images from human brain activities with bayesian deep multiview learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 8, pp. 2310–2323, 2018.
  • [5] Y. Yu, S. Tang, K. Aizawa, and A. Aizawa, “Category-based deep cca for fine-grained venue discovery from multimodal data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 4, pp. 1250–1258, 2018.
  • [6] D. Lian, L. Hu, W. Luo, Y. Xu, L. Duan, J. Yu, and S. Gao, “Multiview multitask gaze estimation with deep convolutional neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 3010–3023, 2018.
  • [7] S. Zhang, X. Yu, Y. Sui, S. Zhao, and L. Zhang, “Object tracking with multi-view support vector machines,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 265–278, 2015.
  • [8] G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection,” IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 265–278, 2019.
  • [9] J. Peng, A. J. Aved, G. Seetharaman, and K. Palaniappan, “Multiview boosting with information propagation for classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 3, pp. 657–669, 2017.
  • [10] C.-M. Feng, Y. Xu, J.-X. Liu, Y.-L. Gao, and C.-H. Zheng, “Supervised discriminative sparse pca for com-characteristic gene selection and tumor classification on multiview biological data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 2926–2937, 2019.
  • [11] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  • [12] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view learning overview: Recent progress and new challenges,” Information Fusion, vol. 38, pp. 43–54, 2017.
  • [13] J. Xu, J. Han, F. Nie, and X. Li, “Multi-view scaling support vector machines for classification and feature selection,” IEEE Transactions on Knowledge and Data Engineering, DOI: 10.1109/TKDE.2019.2904256.
  • [14] X. Xie and S. Sun, “Multi-view support vector machines with the consensus and complementarity information,” IEEE Transactions on Knowledge and Data Engineering, DOI: 10.1109/TKDE.2019.2933511.
  • [15] J. Tang, Y. Tian, P. Zhang, and X. Liu, “Multiview privileged support vector machines,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3463–3477, 2017.
  • [16] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
  • [17]

    F. R. Bach and M. I. Jordan, “Kernel independent component analysis,”

    Journal of Machine Learning Research, vol. 3, no. 1, pp. 1–48, 2002.
  • [18] D. R. Hardoon, S. Szedmak, and J. Shawetaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, 2004.
  • [19] S. Sun, “A survey of multi-view machine learning,” Neural Computing and Applications, vol. 23, no. 7-8, pp. 2031–2038, 2013.
  • [20] P. L. Lai and C. Fyfe, “A neural implementation of canonical correlation analysis,” Neural Networks, vol. 12, no. 10, pp. 1391–1397, 1999.
  • [21] W. W. Hsieh, “Nonlinear canonical correlation analysis by neural networks,” Neural Networks, vol. 13, no. 10, pp. 1095–1105, 2000.
  • [22]

    R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in

    International Conference on Artificial Intelligence and Security, 2009.
  • [23] B. Yoshua, C. Aaron, and V. Pascal, “Representation learning: a review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  • [24] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International Conference on Machine Learning, 2013.
  • [25] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, “On deep multi-view representation learning,” in International Conference on Machine Learning, 2015.
  • [26]

    F. Feng, X. Wang, and R. Li, “Cross-modal retrieval with correspondence autoencoder,” in

    ACM International Conference on Multimedia, 2014.
  • [27] S. Rastegar, M. S. Baghshah, H. R. Rabiee, and S. M. Shojaee, “Mdl-cw: A multimodal deep learning framework with cross weights,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2016.
  • [28] T. Diethe, D. R. Hardoon, and J. Shawe-Taylor, “Constructing nonlinear discriminants from multiple data views,” in European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2010.
  • [29] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view discriminant analysis,” in European Conference on Computer Vision, 2012.
  • [30] S. Mika, A. Smola, and B. Scholkopf, “An improved training algorithm for kernel fisher discriminants,” in International Conference on Artificial Intelligence and Statistics, 2001.
  • [31] J. B. Tenenbaum and W. T. Freeman, “Separating style and content with bilinear models,” Neural Computation, vol. 12, no. 6, pp. 1247–1283, 2014.
  • [32] Y. Shuicheng, X. Dong, Z. Benyu, Z. Hong-Jiang, Y. Qiang, and L. Stephen, “Graph embedding and extensions: a general framework for dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40–51, 2007.
  • [33] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [34] J. Xu, W. Li, X. Liu, D. Zhang, J. Liu, and J. Han, “Deep embedded complementary and interactive information for multi-view classification,” in AAAI Conference on Artificial Intelligence, 2020.
  • [35] N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in International Conference on Multimedia, 2010.
  • [36] S. Liang, S. Ji, and J. Ye, “A least squares formulation for canonical correlation analysis,” in International Conference on Machine Learning, 2008.
  • [37] H. Avron, C. Boutsidis, S. Toledo, and A. Zouzias, “Efficient dimensionality reduction for canonical correlation analysis,” in International Conference on Machine Learning, 2013.
  • [38]

    M. B. Blaschko and C. H. Lampert, “Correlational spectral clustering,”

    IEEE Conference on Computer Vision and Pattern Recognition, 2008.
  • [39] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan, “Multi-view clustering via canonical correlation analysis,” in International Conference on Machine Learning, 2009.
  • [40] S. M. Kakade and D. P. Foster, “Multi-view regression via canonical correlation analysis,” in International Conference on Computer and Information Technology, 2007.
  • [41]

    B. Mcwilliams, D. Balduzzi, and J. M. Buhmann, “Correlated random features for fast semi-supervised learning.” in

    Advances in Neural Information Processing Systems, 2013.
  • [42] P. S. Dhillon, D. Foster, and L. Ungar, “Multi-view learning of word embeddings via cca,” in Advances in Neural Information Processing Systems, 2011.
  • [43] P. S. Dhillon, J. Rodu, D. P. Foster, and L. H. Ungar, “Using cca to improve cca: A new spectral method for estimating vector models of words,” in International Conference on Machine Learning, 2012.
  • [44] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling internet images, tags, and their semantics,” International Journal of Computer Vision, vol. 106, no. 2, pp. 210–233, 2014.
  • [45] K. Tae-Kyun, K. Josef, and C. Roberto, “Discriminative learning and recognition of image set classes using canonical correlations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1005–1018, 2007.
  • [46] S. Ya, F. Yun, G. Xinbo, and T. Qi, “Discriminant learning through multiple principal angles for visual recognition,” IEEE Transactions on Image Processing, vol. 21, no. 3, pp. 1381–1390, 2012.
  • [47] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Computer Vision and Image Understanding, vol. 1, no. 106, pp. 59–70, 2007.
  • [48] Y. Li, F. Nie, H. Huang, and J. Huang, “Large-scale multi-view spectral clustering via bipartite graph,” in AAAI Conference on Artificial Intelligence, 2015.
  • [49] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [50] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in ACM International Conference on Image and Video Retrieval, 2009.
  • [51] M. Amini, N. Usunier, and C. Goutte, “Learning from multiple partially observed views-an application to multilingual text categorization,” in Advances in Neural Information Processing Systems, 2009.
  • [52] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010.
  • [53] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene categories,” International Journal of Computer Vision, vol. 119, no. 1, pp. 3–22, 2016.
  • [54] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning,” in International Conference on Machine Learning, 2015.
  • [55] M. Dorfer, R. Kelz, and G. Widmer, “Deep linear discriminant analysis,” in International Conference on Learning Representations, 2016.
  • [56] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view discriminant analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 188–194, 2016.
  • [57] M. Kan, S. Shan, and X. Chen, “Multi-view deep network for cross-view classification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [58]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems, 2012.
  • [59] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [60] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
  • [61] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.