Random Forest for Dissimilarity-based Multi-view Learning

07/16/2020 ∙ by Simon Bernard, et al. ∙ Université de Rouen Ecole De Technologie Superieure (Ets) 0

Many classification problems are naturally multi-view in the sense their data are described through multiple heterogeneous descriptions. For such tasks, dissimilarity strategies are effective ways to make the different descriptions comparable and to easily merge them, by (i) building intermediate dissimilarity representations for each view and (ii) fusing these representations by averaging the dissimilarities over the views. In this work, we show that the Random Forest proximity measure can be used to build the dissimilarity representations, since this measure reflects similarities between features but also class membership. We then propose a Dynamic View Selection method to better combine the view-specific dissimilarity representations. This allows to take a decision, on each instance to predict, with only the most relevant views for that instance. Experiments are conducted on several real-world multi-view datasets, and show that the Dynamic View Selection offers a significant improvement in performance compared to the simple average combination and two state-of-the-art static view combinations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real-world pattern recognition problems, the available data are complex in that they cannot be described by a single numerical representation. This may be due to multiple sources of information, as for autonomous vehicles for example, where multiple sensors are jointly used to recognize the environment

[7]. It may also be due to the use of several feature extractors, such as in image recognition tasks, often based on multiple representations of features, such as color, shape, texture descriptors, etc. [6]

Learning from these types of data is called multi-view learning and each modality/set of features is called a view. For this type of task, it is assumed that the views convey different types of information, each of which can contribute to the pattern recognition task. Therefore, the challenge is generally to carry out the learning task taking into account the complementarity of the views. However, the difficulty with this is that these views can be very different from each other in terms of dimension, nature and meaning, and therefore very difficult to compare or merge. In a recent work [6], we proposed to use dissimilarity strategies to overcome this issue. The idea is to use a dissimilarity measure to build intermediate representations from each view separately, and to merge them afterward. By describing the instances with their dissimilarities to other instances, the merging step becomes straightforward since the intermediate dissimilarity representations are fully comparable from one view to another.

For using dissimilarities in multi-view learning, two questions must be addressed: (i) how to measure and exploit the dissimilarity between instances for building the intermediate representation? and (ii) how to combine the view-specific dissimilarity representations for the final prediction?

In our preliminary work [6]

, the first question has been addressed with Random Forest (RF) classifiers. RF are known to be versatile and accurate classifiers

[2, 14] but they are also known to embed a (dis)similarity measure between instances [13]. The advantage of such a mechanism in comparison to traditional similarity measures is that it takes the classification/regression task into account for computing the similarities. For classification for example, the instances that belong to the same class are more likely to be similar according to this measure. Therefore, a RF trained on a view can be used to measure the dissimilarities between instances according to the view, and according to their class membership as well. The way this measure is used to build the per-view intermediate representations is by calculating the dissimilarity of a given instance to all the training instances. By doing so,

can be represented by a new feature vector of size

, or in other words in a -dimensional space where each dimension is the dissimilarity to one of the training instances. This space is called the dissimilarity space [18, 8] and is used as the intermediate representation for each view.

As for the second question, we addressed the combination of the view-specific dissimilarity representations by computing the average dissimilarities over all the views. That is to say, for an instance , all the view-specific dissimilarity vectors are computed and averaged to obtain a final vector of size . Each value in this vector is thus the average dissimilarity between and one of the

training instances over the views. This is a simple, yet meaningful way to combine the information conveyed by each view. However, one could find it a little naive when considering the true rationale behind multi-view learning. Indeed, even if the views are expected to be complementary to each other, they are likely to contribute in very different ways to the final decision. One view in particular is likely to be less informative than another, and this contribution is even likely to be very different from an instance to predict to another. In that case, it is desirable to estimate and take this importance into account when merging the view-specific representations. This is the goal we follow in the present work.

In a nutshell, our preliminary work [6] has validated the generic framework explained above, with the two following key steps: (i) building the dissimilarity space with the RF dissimilarity mechanism and (ii) combining the views afterward by averaging the dissimilarities. In the present work, we deepen the second step by investigating two methods to better combine the view-specific dissimilarities:

  1. combining the view-specific dissimilarities with a static weighted average, so that the views contribute differently to the final dissimilarity representation; in particular, we propose an original weight calculation method based on an analysis of the RF classifiers used to compute the view-specific dissimilarities;

  2. combining the view-specific dissimilarities with a dynamic combination, for which the views are solicited differently from one instance to predict to another; this dynamic combination is based on the definition of a region of competence for which the performance of the RF classifiers is assessed and used for a view selection step afterward.

The rest of this chapter is organized as follows. The Random Forest dissimilarity measure is firstly explained in Section 2. The way it is used for multi-view classification is detailed in Section 3. The different strategies for combining the dissimilarity representations are given in Section 4, along with our two proposals for static and dynamic view combinations. Finally, the experimental validation is presented in Section 5.

2 Random Forest Dissimilarity

To fully understand the way a RF classifier can be used to compute dissimilarities between instances, it is first necessary to understand how an RF is built and how it gives a prediction for each new instance.

2.1 Random Forest

In this work, the name ”Random Forest” refers to the Breiman’s reference method [2]. Let us briefly recall its procedure to build a forest of Decision Trees, from a training set . First, a bootstrap sample is built by random draw with replacement of instances, amongst the training instances available in . Each of these bootstrap samples is then used to build one tree. During this induction phase, at each node of the tree, a splitting rule is designed by selecting a feature over features randomly drawn from the available features. The feature retained for the splitting rule at a given node is the one among the that maximizes the splitting criterion. At last, the trees in RF classifiers are grown to their maximum depth, that is to say when all their terminal nodes (also called leaves) are pure. The resulting RF classifier is typically noted as:

(1)

where is the Random Tree of the forest, built using the mechanisms explained above [2, 1]. Note however that there exist many other RF learning methods that differ from the Breiman’s method by the use of different randomization techniques for growing the trees [20].

For predicting the class of a given instance with a Random Tree, goes down the tree structure from its root to one of its leaves. The descending path followed by is determined by successive tests on the values of its features, one per node along the path. The prediction is given by the leaf in which has landed. More information about this procedure can be found in the recently published RF reviews[23, 1, 20]. The key point here is that, if two test instances land in the same terminal node, they are likely to belong to the same class and they are also likely to share similarities in their feature vectors, since they have followed the same descending path. This is the main motivation behind using RF for measuring dissimilarities between instances.

Note that the final prediction of a RF classifier is usually obtained via majority voting over the component trees. Here again, there exist alternatives to majority voting [20], but this latter remains the most used as far as we know.

2.2 Using Random Forest for measuring dissimilarities

The RF dissimilarity measure is the opposite measure of the RF proximity (or similarity) measure defined in Breiman’s work [2, 23, 6], the latter being noted in the following.

The RF dissimilarity measure is inferred from a RF classifier , learned from . Let us firstly define the dissimilarity measure inferred by a single Random Tree , noted : let denote the set of leaves of , and let denote a function from the input domain to , that returns the leaf of where lands when one wants to predict its class. The dissimilarity measure is defined as in Equation 2: if two training instances and land in the same leaf of , then the dissimilarity between both instances is set to , else it is equal to .

(2)

The measure is the strict opposite of the tree proximity measure [2, 23], i.e. .

Now, the measure derived from the whole forest consists in calculating for each tree in the forest, and in averaging the resulting dissimilarity values over the trees, as follows:

(3)

Similarly to the way the predictions are given by a forest, the rationale is that the accuracy of the dissimilarity measure relies essentially on the averaging over a large number of trees. Moreover, this measure is a pairwise function that satisfies the reflexivity property (), the non-negativity property () and the symmetry property (). Note however that it does not satisfy the last two properties of the distance functions, namely the definiteness property ( does not imply ) and the triangle inequality ( is not necessarily less or equal to ).

As far as we know, only few variants of this measure have been proposed in the literature [13, 5]. These variants differ from the measure explained above in the way they infer the dissimilarity value from a tree structure. The motivation is to design a finer way to measure the dissimilarity than the coarse binary value given in Equation 2. This coarse value may seem intuitively too superficial to measure dissimilarities, especially considering that a tree structure can provide richer information about the way two instances are similar to each other.

The first variant [13] modifies the measure by using the path length from one leaf to another when two instances land in different leaf nodes. In this way, does not take its value in anymore but is computed as follows:

(4)

where, is the number of tree branches between the two terminal nodes occupied by and in the tree of the forest, and where is a parameter to control the influence of in the computation. When , is still equal to , but in the opposite situation the resulting value is in .

A second variant [5], noted RFD in the following, leans on a measure of instance hardness, namely the -Disagreeing Neighbors (DN) measure [21], that estimates the intrinsic difficulty to predict an instance as follows:

(5)

where is the set of the nearest neighbors of . This value is used for measuring the dissimilarity , between any instance to any of the training instances , as follows:

(6)

where is the measure computed in the subspace formed by the sole features used in the tree of the forest.

Any of these variants could be used to compute the dissimilarities in our framework. However, we choose to use the RFD variant in the following, since it has been shown to give very good results when used for building dissimilarity representations for multi-view learning [5].

3 The dissimilarity representation for multi-view learning

3.1 The dissimilarity space

Among the different dissimilarity strategies for classification, the most popular is the dissimilarity representation approach[18]. It consists in using a set of reference instances, to build a dissimilarity matrix. The elements of this matrix are the dissimilarities between the training instances in and the reference instances in :

(7)

where stands for a dissimilarity measure, are the training instances and are the reference instances. Even if and can be disjoint sets of instances, the most common is to take as a subset of , or even as itself. In this work, for simplification purpose, and to avoid the selection of reference instances from , we choose . As a consequence the dissimilarity matrix is always a symmetric matrix.

Once such a squared dissimilarity matrix is built, there exist two main ways to use it for classification: the embedding approach and the dissimilarity space approach [18]. The embedding approach consists in embedding the dissimilarity matrix in a Euclidean vector space such that the distances between the objects in this space are equal to the given dissimilarities. Such an exact embedding is possible for every symmetric dissimilarity matrix with zeros on the diagonal [18]. In practice, if the dissimilarity matrix can be transformed in a positive semi-definite (p.s.d.) similarity matrix, this can be done with kernel methods. This p.s.d. matrix is used as a pre-computed kernel, also called a kernel matrix. This method has been successfully applied with RFD along with SVM classifiers [15, 6].

The second approach, the dissimilarity space strategy, is more versatile and does not require the dissimilarity matrix to be transformed into a p.s.d. similarity matrix. It simply consists in using the dissimilarity matrix as a new training set. Indeed, each row of the matrix can be seen as the projection of a training instance into a dissimilarity space, where the dimension is the dissimilarity with the training instance . As a consequence, the matrix can be seen as the projection of the training set into this dissimilarity space, and can be fed afterward to any learning procedure. This method is much more straightforward than the embedding approach as it can be used with any dissimilarity measurement, regardless its reflexivity or symmetry properties, and without transforming it into a p.s.d. similarity matrix.

In the following, the dissimilarity matrices built with the RFD measure are called RFD matrices and are noted for short. It can be proven that the matrices derived from the initial RF proximity measure [2, 23] are p.s.d and can be used as pre-computed kernels in SVM classifiers [6], following the embedding approach. However, the proof does not apply if the matrices are obtained using the RFD measure [5]. This is the main reason we use the dissimilarity space strategy in this work, as it allows more flexibility.

3.2 Using dissimilarity spaces for multi-view learning

In traditional supervised learning tasks, each instance is described by a single vector of

features. For multi-view learning tasks, each instance is described by different vectors. As a consequence, the task is to infer a model :

(8)

where the are the input domains, i.e. the views. These views are generally of different dimensions, noted to . For such learning tasks, the training set is actually made up with training subsets:

(9)

The key principle of the proposed framework is to compute the RFD matrices from each of the training subsets . For that purpose, each is fed to the RF learning procedure, resulting in RF classifiers noted . The RFD measure is then used to compute the RFD matrices .

Once these RFD matrices are built, they have to be merged in order to build the joint dissimilarity matrix that will serve as a new training set for an additional learning phase. This additional learning phase can be realized with any learning algorithm, since the goal is to address the classification task. For simplicity and because they are as accurate as they are versatile, the same Random Forest method used to calculate the dissimilarities is also used in this final learning stage.

Regarding the merging step, which is the main focus of the present work, it can be straightforwardly done by a simple average of the RFD matrices:

(10)

The whole RFD based multi-view learning procedure is summarized in Algorithm 1 and illustrated in Figure 1.

Input: , : the training sets, composed of instances
Input: : The Breiman’s RF learning procedure
Input: : the dissimilarity measure
Output: : RF classifiers
Output: : the final RF classifier
1 for  do
        // Build the RFD matrix :
2        forall  do
3               forall  do
4                     
5               end forall
6              
7        end forall
8       
9 end for
// Build the average RFD matrix :
// Train the final classifier on :
Algorithm 1 The RFD multi-view learning procedure
Figure 1: The RFD framework for multi-view learning.

As for the prediction phase, the procedure is very similar. For any new instance to predict:

  1. Compute , to form -sized dissimilarity vectors for . These vectors are the dissimilarity representations for , from each of the views.

  2. Compute , to form the -sized vector that corresponds to the projection of in the joint dissimilarity space.

  3. Predict the class of with the classifier trained on .

4 Combining views with weighted combinations

The average dissimilarity is a simple, yet meaningful way to merge the dissimilarity representations built from all the views. However, it intrinsically considers that all the views are equally relevant with regard to the task and that the resulting dissimilarities are as reliable as each other. This is likely to be wrong from our point of view. In multi-view learning problems, the different views are meant to be complementary in some ways, that is to say to convey different types of information regarding the classification task. These different types of information may not have the same contribution to the final predictions. That is the reason why it may be important to differentiate these contributions, for example with a weighted combination in which the weights would be defined according to the view reliability.

The calculation of these weights can be done following two paradigms: static weighting and dynamic weighting. The static weighting principle is to weight the views once for all, with the assumption that the importance of each view is the same for all the instances to predict. The dynamic weighting principle on the other way, aims at setting different weights for each instance to predict, with the assumption that the contribution of each view to the final prediction is likely to be different from one instance to another.

4.1 Static combination

Given a set of dissimilarity matrices built from different views, our goal is to find the best set of non-negative weights , so that the joint dissimilarity matrix is:

(11)

where and .

There exist several ways, proposed in the literature, to compute the weights of such a static combination of dissimilarity matrices. The most natural one is to deduce the weights from a quality score measured on each view. For example, this principle has been used for multi-scale image classification [17] where each view is a version of the image at a given scale, i.e. the weights are derived directly from the scale factor associated with the view. Obviously, this only makes sense with regard to the application, for which the scale factor gives an indication of the reliability for each view.

Another, more generic and classification-specific approach, is to evaluate the quality of the dissimilarity matrix using the performance of a classifier. This makes it possible to estimate whether a dissimilarity matrix sufficiently reflects class membership [12, 17]. For example, one can train a SVM classifier from each dissimilarity matrix and use its accuracy as an estimation of the corresponding weights [17]. NN classifiers are also very often used for that purpose [12, 16]. The reason is that a good dissimilarity measure is expected to propose good neighborhoods, or in other words the most similar instances should belong to the same class.

Since kernel matrices can be viewed as similarity matrices, there are also few solutions in the literature of kernel methods that could be used to estimate the quality of a dissimilarity matrix. The most notable is the Kernel Alignment (KA) estimate [9] , for measuring the similarity between two kernel matrices and :

(12)

where is a kernel matrix and where is the Frobenius norm [9].

In order to use the KA measure to estimate the quality of a given kernel matrix, a target matrix must be defined beforehand. This target matrix is an ideal theoretical similarity matrix, regarding the task. For example, for binary classification, the ideal target matrix is usually defined as , where are the true labels of the training instances, in . Thus, each value in is:

(13)

In other words, the ideal matrix is the similarity matrix in which instances are considered similar () if and only if they belong to the same class. This estimate is transposed to multi-class classification problems as follows [4]:

(14)

where is the number of classes.

Both NN and KA methods presented above are used in the experimental part for comparison purposes (cf. Section 5). However, in order to use the KA method for our problem, some adaptations are required. Firstly, the dissimilarity matrices need to be transformed into similarity matrices by

. The following heuristic is then used to deduce the weight from the KA measure

[19]:

(15)

Strictly speaking, for the similarity matrices to be considered as kernel matrices, it must be proven that they are p.s.d. When such matrices are proven to be p.s.d, the KA estimates is necessarily non-negative, and the corresponding are also non-negative [9, 19]. However, as it is not proven that our matrices built from are p.s.d., we propose to use the softmax function to normalize the weights and to ensure they are strictly positive:

(16)

The main drawback of the methods mentioned above is that they evaluate the quality of the dissimilarity matrices based solely on the training set. This is the very essence of these methods, which are designed to evaluate (dis)similarity matrices built from a sample, e.g. the training set. However, this may cause overfitting issues when these dissimilarity matrices are used for classification purposes as it is the case in our framework. Ideally, the weights should be set from the quality of the dissimilarity representations estimated on an independent validation dataset. Obviously, this requires to have additional labeled instances. The method we propose in this section allows to estimate the quality of the dissimilarity representations without the use of additional validation instances.

The idea behind our method is that the relevance of a RFD space is reflected by the accuracy of the RF classifier used to build it. This accuracy can be efficiently estimated with a mechanism called the Out-Of-Bag (OOB) error. This OOB error is an estimate supplied by the Bagging principle, known to be a reliable estimate of the generalization error [2]. Since the RF classifiers in our framework are built with the Bagging principle, the OOB error can be used to estimate their generalization error without the need of an independent validation dataset.

Let us briefly explained here how the OOB error is obtained from a RF: let denote a Bootstrap sample formed by randomly drawing instances from , with replacement. When , being the number of instances in , it can be proven that about one third of , in average, will not be drawn to form [2]. These instances are called the OOB instances of . Using Bagging for growing a RF classifier, each tree in the forest is trained on a Bootstrap sample, that is to say using only about two thirds of the training instances. Similarly, each training instance is used for growing about two thirds of the trees in the forest. The remaining trees are called the OOB trees of . The OOB error is the error rate measured on the whole training set by only using the OOB trees of each training instance.

Therefore, the method we propose to use consists in using the OOB error of the RF classifier trained on a view directly as its weight in the weighted combination. This method is noted in the following.

4.2 Dynamic combination

In contrast to static weighting, dynamic weighting aims at assigning different weights to the views for each instance to predict [10]. The intuition behind using dynamic weighting in our framework is that the prediction for different instances may rely on different types of information, i.e. different views. In that case, it is crucial to use different weights for building the joint dissimilarity representation from one instance to predict to another.

However, such a dynamic weighting process is particularly complex in our framework. Let us recall that the framework we propose to use in this work is composed of two stages: (i) inferring the dissimilarity matrix from each view, and (ii) combining the per-view dissimilarity matrices to form a new training set. The weights we want to determine are the weights used to compute the final joint dissimilarity matrix in stage (ii). As a consequence, if these weights change for each instance to predict, the joint dissimilarity matrix must be completely recalculated and a new classifier must also be re-trained afterwards. This means that, for every new instance to predict, a whole training procedure has to be performed. This is computationally expensive and quite inefficient from our point of view.

To overcome this problem, we propose to use Dynamic Classifier Selection (DCS) instead of dynamic weighting. DCS is a generic strategy, amongst the most successful ones in the Multiple Classifier Systems literature [10]. It typically aims at selecting one classifier in a pool of candidate classifiers, for each instance to predict. This is essentially done through two steps [3]: (i) the generation of a pool of candidate classifiers and (ii) the selection of the most competent classifier in this pool for the instance to predict. The solutions we propose for these steps are illustrated in Figure 2, the first step in the upper part and the second step in the lower part. The whole procedure is also detailed in Algorithm 2 and described in the following.

4.2.1 Generation of the pool of classifiers

The generation of the pool is the first key step of DCS. As the aim is to select the most competent classifier on the fly for each given test instance, the classifiers in the pool must be as diverse and as individually accurate as possible. In our case, the challenge is not to create the diversity in the classifiers, since they are trained on different joint dissimilarity matrices, generated with different sets of weights. The challenge is rather to generate these different weight tuples used to compute the joint dissimilarity matrices. For such a task, a traditional grid search strategy could be used. However, the number of candidate solutions increases exponentially with respect to the number of views. For example, Suppose that we sample the weights with 10 values in . For views, it would result in different weight tuples. Six views would thus imply to generate 1 million weight tuples and to train 1 million classifiers afterwards. Here again, this is obviously highly inefficient.

The alternative approach we propose is to select a subset of views for every candidate in the pool, instead of considering a weighted combination of all of them. By doing so, for each instance to predict, only the views that are considered informative enough are expected to be used for its prediction. The selected views are then combined by averaging. For example, if a problem is described with six views, there are possible combinations (the situation in which none of the views is selected is obviously ignored), which will result in a pool of 63 classifiers . Lines 1 to 6 of Algorithm 2 give a detailed implementation of this procedure.

4.2.2 Evaluation and selection of the best classifier

The selection of the most competent classifier is the second key step of DCS. Generally speaking, this selection is made through two steps [10]: (i) the definition of a region of competence for the instance to predict and (ii) the evaluation of each classifier in the pool for this region of competence, in order to select the most competent one.

The region of competence of each instance is the region used to estimate the competence of the classifiers for predicting that instance. The usual way to do so is to rely on clustering methods or to identify the nearest neighbors (NN) of . For clustering [22], the principle is usually to define the region of competence as the closest cluster of , according to the distances of to the centroids of the clusters. As the clusters are fixed once for all, many different instances might share the same region of competence. In contrast, NN methods give different regions of competence from one instance to another, which allows for more flexibility but at the expense of a higher computational cost [11].

The most important part of the selection process is to define the criterion to measure the competence level of each classifier in the pool. There are a lot of methods for doing so, that differ in the way they estimate the competence, using for example a ranking, the classifier accuracies, a data complexity measure, etc. [10]. Nevertheless, the general principle is most of the time the same: calculating the measure on the region of competence exclusively. We do not give an exhaustive survey of the way it can be done here, but briefly explain the most representative method, namely the Local Classifier Accuracy (LCA) method [24], as an illustration.

The LCA method measures the local accuracy of a candidate classifier , with respect to the prediction of a given instance :

(17)

where is the region of competence for , and is the set of instances from that belong to the same class as . Therefore, represents the percentage of correct classifications within the region of competence, by only considering the instances for which the classifier predicts the same class as for . In this calculation, the instances in generally come from a validation set, independent of the training set [10].

The alternative method we propose here is to use a selection criterion that does not rely on an independent validation set, but rather relies on the OOB estimate. To do so, the region of competence is formed by the nearest neighbors of , amongst the training instances. These nearest neighbors are determined in the joint dissimilarity space with the RFD measure (instead of the traditional Euclidean distance). This is related to the fact that each candidate classifier is trained in this dissimilarity space, but also because the RFD measure is more robust to high dimensional spaces, contrary to traditional distance measures. Finally, the competence of each classifier is estimated with its OOB error on the nearest neighbors of . Lines 7 to 15 of Algorithm 2 give all the details of this process.

To sum it up, the key mechanisms of the DCS method we proposed, noted and detailed in Algorithm 2, are:

  • Create the pool of classifiers by using all the possible subsets of views, to avoid the expensive grid search for the weights generation (lines 4-5 of Algorithm 2).

  • Define the region of competence in the dissimilarity space by using the RFD dissimilarity measure, to circumvent the issues that arise from high dimensional spaces (lines 12-13 of Algorithm 2).

  • Evaluate the competence of each candidate classifier with its OOB error rate, so that no additional validation instances are required (line 14 of Algorithm 2).

  • Select the best classifier for (lines 16-17 of Algorithm 2).

These steps are also illustrated in Figure 2 with the generation of the pool of classifiers in the upper part, and with the evaluation and selection of the classifier in the lower part of the figure. For illustration purposes, the classifier ultimately selected for predicting the class of is assumed to be the second candidate (in red).

Input: , : the training sets, each composed of instances
Input: , : RFD matrices, built from the views
Input: , : the RF classifiers used to build the
Input: : The RF learning procedure
Input: : the measure
Input: : the number of neighbors to define the region of competence
Input: : an instance to predict
Output: : the prediction for
// 1 - Generate the pool of classifiers:
1 all the possible Q-sized 0/1 vectors an empty pool of classifiers for  do
        // The candidate classifier in the pool, being the value of , either equal to 1 or 0:
2       
3 end for
// 2 - Evaluate the candidate classifiers for
4 for  do
        // the dissimilarity representation of
5       
6 end for
7 an empty set of dissimilarity representations of for  do
        // The averaged dissimilarity representation of :
        // The region of competence, being the row of :
        the NN according to // The competence of on :
8       
9 end for
// 3 - Select the best classifier for and predict its class
Algorithm 2 The method
Figure 2: The procedure, with the training and prediction phases. The best candidate classifier that gives the final prediction for is in this illustration (in red).

5 Experiments

5.1 Experimental protocol

Both the and the methods are evaluated on several real-world multi-view datasets in the following, and compared to state-of-the-art methods: the simple average of the view-specific dissimilarity matrices as a baseline method and the two static weighting methods presented in Section 4.1, namely the and the methods.

The multi-view datasets used in this experiment are described in Table 1. All these datasets are real-world multi-view datasets, supplied with several views of the same instances: NonIDH1, IDHcodel, LowGrade and Progression

are medical imaging classification problems, with different families of features extracted from different types of radiographic images;

LSVT and Metabolomic are two other medical related classification problems, the first one for Parkinson’s disease recognition and the second one for colorectal cancer detection; BBC and BBCSport are text classification problems from news articles; Cal7, Cal20, Mfeat, NUS-WIDE2, NUS-WIDE3, AWA8 and AWA15 are image classification problems made up with different families of features extracted from the images. More details about how these datasets have been constituted can be found in the paper (and references therein) cited in the caption of Table 1.

max width=0.8 features instances views classes IR AWA8 10940 640 6 8 1 AWA15 10940 1200 6 15 1 BBC 13628 2012 2 5 1.34 BBCSport 6386 544 2 5 3.16 Cal7 3766 1474 6 7 25.74 Cal20 3766 2386 6 20 24.18 IDHcodel 6746 67 5 2 2.94 LowGrade 6746 75 5 2 1.4 LSVT 309 126 4 2 2 Metabolomic 476 94 3 2 1 Mfeat 649 600 6 10 1 NonIDH1 6746 84 5 2 3 NUS-WIDE2 639 442 5 2 1.12 NUS-WIDE3 639 546 5 3 1.43 Progression 6746 84 5 2 1.68

Table 1: Real-world multi-view datasets [6]. Imbalanced Ratio, i.e. the number of instances from the majority class over the number of instances from the minority class.

All the methods used in these experiments include the same first stage, i.e. building the RF classifiers from each view and building then the view-specific RFD matrices. Therefore, for a fair comparison on each dataset, all the methods use the exact same RF classifiers, made up with the same trees [6]. As for the other important parameters of the RF learning procedure, the parameter is set to , where is the dimension of the view, and all the trees are grown to their maximum depth (i.e. with no pre-pruning).

The methods compared in this experiment differ in the way they combine the view-specific RFD matrices afterwards. We recall below these differences:

  • denotes the baseline method for which the joint dissimilarity representation is formed by a simple average of the view-specific dissimilarity representations.

  • and both denote static weighting methods for determining weights, one per view. The first one derives the weights from the performance of a classifier applied on each RFD matrix; the second one uses the KA method to estimate the relevancy of each RFD matrix in regards to the classification problem.

  • is the static weighting method we propose in this work and presented in Section 4.1; it computes the weights of each view from the OOB error rate of its RF classifier.

  • is the dynamic selection method we propose in this work and presented in Section 4.2; it computes different combinations of the RFD matrices for each instance to predict based on its nearest neighbors, with following the recommendation in the literature [10].

After each method determine a set of weights, the joint RFD matrix is computed. This matrix is then used as a new training set for a RF classifier learnt with the same parameters as above (512 trees, with the number of training instances, fully grown trees).

As for the pre-processing of the datasets, a stratified random splitting procedure is repeated 10 times, with 50% of the instances for training and 50% for testing. The mean accuracy, with standard deviations, are computed over the 10 runs and reported in Table

2. Bold values in this table are the best average performance obtained on each dataset.

max width= Avg AWA8 AWA15 BBC BBCSport Cal7 Cal20 IDHCodel LowGrade LSVT Metabolomic Mfeat NonIDH1 NUS-WIDE2 NUS-WIDE3 Progression Avg rank 3.67 3.50 3.30 2.40 2.13

Table 2: Accuracy (mean standard deviation) and average ranks

5.2 Results and discussion

The first observation one can make from the results gathered in Table 2 is that the best performance are obtained with one of the two proposed methods for 13 over the 15 datasets. This is confirmed by the average ranks that place these two methods in the first two positions. To better assess the extent to which these differences are significant, a pairwise analysis based on the Sign test is computed on the number of wins, ties and losses between the baseline method and all the other methods. The result is shown in Figure 3.

Figure 3: Pairwise comparison between each method and the baseline . The vertical lines are the level of statistical significance according to the Sign test.

From this statistical test, one can observe that none of the static weighting methods allows to reach the significance level of wins over the baseline method. It indicates that the simple average combination, when using dissimilarity representations for multi-view learning, is a quite strong baseline. It also underlines that all views are globally relevant for the final classification task. There is no view that is always irrelevant, for all the predictions.

Figure 3 shows also that the dynamic selection method proposed in this work is the only method that predominantly improves the accuracy over this baseline, till reaching the level of statistical significance. From our point of view, it shows that all the views do not participate in the same extent to the good prediction of every instance. Some instances are better recognized when the dissimilarities are computed by relying on some views more than on the others. These views are certainly not the same ones from one instance to another, and some instances may need the dissimilarity information from all the views at some point. Nevertheless, this highlights that the confusion between the classes is not always consistent from one view to another. In that sense, the views complement each others, and this can be efficiently exploited for multi-view learning provided that we can identify the views that are the most reliable for every instance, one by one.

6 Conclusion

Multi-view data are now very common in real world applications. Whether they arise from multiple sources or from multiple feature extractors, the different views are supposed to provide a more accurate and complete description of objects than a single description would do. Our proposal in this work was to address multi-view classification tasks using dissimilarity strategies, which give an efficient way to handle the heterogeneity of the multiple views.

The general framework we proposed consists in building an intermediate dissimilarity representation for each view, and in combining these representations afterwards for learning. The key mechanism is to use Random Forest classifiers to measure the dissimilarities. Random Forests embed a (dis)similarity measure that takes the class membership into account in such a way that instances from the same class are similar. The resulting dissimilarity representations can be efficiently merged since they are fully comparable from one view to another.

Using this framework, our main contribution was to propose a dynamic view selection method that provides a better way of merging the per-view dissimilarity representations: a subset of views is selected for each instance to predict, in order to take the decision on the most relevant views while at the same time ignoring as much as possible the irrelevant views. This subset of views is potentially different from one instance to another, because all the views do not contribute at the same extent to the prediction of each instance. This has been confirmed on several real-world multi-view datasets, for which the dynamic combination of views has allowed to obtain much better results than static combination methods.

However, in its current form, the dynamic selection method proposed in this chapter strongly depends on the number of candidate classifiers in the pool. To allow for more versatility, it could be interesting to decompose each view into several sub-views. This could be done for example, by using Bagging and Random Subspaces principles before computing the view-specific dissimilarities. In such a way, the dynamic combination could only select some specific part of each view, instead of considering the views as a whole.

Acknowledgement

This work is part of the DAISI project, co-financed by the European Union with the European Regional Development Fund (ERDF) and by the Normandy Region.

References

  • [1] G. Biau and E. Scornet. A random forest guided tour. TEST, 25:197–227, 2016.
  • [2] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
  • [3] A. S. Britto Jr, R. Sabourin, and L. E. Oliveira. Dynamic selection of classifiers - a comprehensive review. Pattern Recognition, 47(11):3665–3680, 2014.
  • [4] J. E. Camargo and F. A. González. A multi-class kernel alignment method for image collection summarization. In

    Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (CIARP)

    , pages 545–552. Springer-Verlag, 2009.
  • [5] H. Cao. Random Forest For Dissimilarity Based Multi-View Learning: Application To Radiomics. PhD thesis, University of Rouen Normandy, 2019.
  • [6] H. Cao, S. Bernard, R. Sabourin, and L. Heutte. Random forest dissimilarity based multi-view learning for radiomics application. Pattern Recognition, 88:185–197, 2019.
  • [7] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6526–6534, 2017.
  • [8] Y. M. G. Costa, D. Bertolini, A. S. Britto, G. D. C. Cavalcanti, and L. E. S. de Oliveira. The dissimilarity approach: a review. Artificial Intelligence Review, pages 1–26, 2019.
  • [9] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola. On kernel-target alignment. In Advances in Neural Information Processing Systems (NeurIPS), pages 367–373, 2002.
  • [10] R. M. Cruz, R. Sabourin, and G. D. Cavalcanti. Dynamic classifier selection: Recent advances and perspectives. Information Fusion, 41:195–216, 2018.
  • [11] M. C. De Souto, R. G. Soares, A. Santana, and A. M. Canuto. Empirical comparison of dynamic classifier selection methods based on diversity and accuracy for building ensembles. In

    IEEE International Joint Conference on Neural Networks (IJCNN)

    , pages 1480–1487. IEEE, 2008.
  • [12] R. P. Duin and E. Pekalska. The dissimilarity space: Bridging structural and statistical pattern recognition. Pattern Recognition Letters, 33(7):826–832, 2012.
  • [13] C. Englund and A. Verikas. A novel approach to estimate proximity in a random forest: An exploratory study. Expert Systems with Applications, 39(17):13046–13050, 2012.
  • [14] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.
  • [15] K. R. Gray, P. Aljabar, R. A. Heckemann, A. Hammers, and D. Rueckert. Random forest-based similarity measures for multi-modal classification of alzheimer’s disease. NeuroImage, 65:167–175, 2013.
  • [16] D. Li and Y. Tian. Survey and experimental study on metric learning methods. Neural Networks, 2018.
  • [17] Y. Li, R. P. Duin, and M. Loog. Combining multi-scale dissimilarities for image classification. In International Conference on Pattern Recognition (ICPR), pages 1639–1642. IEEE, 2012.
  • [18] E. Pekalska and R. P. W. Duin. The Dissimilarity Representation for Pattern Recognition: Foundations And Applications (Machine Perception and Artificial Intelligence). World Scientific Publishing Co., Inc., 2005.
  • [19] S. Qiu and T. Lane. A framework for multiple kernel support vector regression and its applications to sirna efficacy prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 6(2):190–199, 2009.
  • [20] L. Rokach. Decision forest: Twenty years of research. Information Fusion, 27:111–125, 2016.
  • [21] M. R. Smith, T. Martinez, and C. Giraud-Carrier. An instance level analysis of data complexity. Machine Learning, 95(2):225–256, 2014.
  • [22] R. G. Soares, A. Santana, A. M. Canuto, and M. C. P. de Souto. Using accuracy and diversity to select classifiers to build ensembles. In IEEE International Joint Conference on Neural Network (IJCNN), pages 1310–1316. IEEE, 2006.
  • [23] A. Verikas, A. Gelzinis, and M. Bacauskiene. Mining data with random forests: A survey and results of new tests. Pattern Recognition, 44(2):330 – 349, 2011.
  • [24] K. Woods, W. P. Kegelmeyer, and K. Bowyer. Combination of multiple classifiers using local accuracy estimates. IEEE transactions on pattern analysis and machine intelligence, 19(4):405–410, 1997.