Multiple Classifier Systems (MCS) aim to combine classifiers in order to increase the recognition accuracy in pattern recognition systemskittler; kuncheva; wozniak2013hybrid; wozniak2014survey. MCS are composed of three phases Alceu2014: (1) Generation, (2) Selection, and (3) Integration. In the first phase, a pool of classifiers is generated. In the second, a single classifier or a subset having the best classifiers of the pool is(are) selected. We refer to the subset of classifiers as the Ensemble of Classifiers (EoC). In the last phase, called integration, the predictions of the selected classifiers are combined to obtain the final decision.
The classifier selection phase can be either static or dynamic. In static selection, the ensemble is selected during the training stage. The classifiers with the best performance, according to the selection criteria, considering the whole training or validation distribution are selected to compose the ensemble. Then, the ensemble is used for the classification of all unseen data. In dynamic approaches, the ensemble of classifiers is selected during the test phase. For each test sample, the competence of the base classifiers is estimated according to a selection criterion. Then, only the classifier(s) that attain a certain competence level, are used to predict the label of the given test sample. Recent works in the MCS literature have shown that dynamic ensemble selection (DES) techniques achieve higher classification accuracy when compared to static ones Alceu2014; CruzPR; knora. This is especially true for ill-defined problems, i.e., for problems where the size of the training data is small, and there are not enough data available to train the classifiers paulo2; logid. Moreover, using dynamic ensemble selection, we can solve classification problems with a complex non-linear decision boundary using only a few linear classifiers, while static ensemble techniques, such as Bagging and AdaBoost, cannot reportarXiv.
When dealing with DES, the key issue is to define a suitable criterion to select the most competent classifiers to predict the label of a specific query sample. Several criteria have previously been proposed, based on different sources of information, such as the classifier local accuracy estimates in small regions of the feature space surrounding the query instance, called the region of competence lca; knora, probabilistic models Woloszynski; WoloszynskiKPS12; Kurzynski2010, ranking classrank and classifier behavior mcb; paulo2. In our previous work CruzPR, we proposed a novel DES framework using meta-learning, called META-DES. The framework is divided into three steps: (1) Overproduction, where the pool of classifiers is generated; (2) Meta-training, where the meta-features are extracted using the training data, and used as inputs to train a meta-classifier that works as a classifier selector; and (3) the Generalization phase, in which the meta-features are extracted from each query sample and used as input to the meta-classifier. The meta-classifier decides whether the base classifier is competent enough to classify the test sample.
The main advantage of the META-DES framework is its modularity. Any criterion used to estimate the level of competence of base classifiers can be encoded as a new set of meta-features and added to the system. A total of five sets of meta-features were proposed in CruzPR, each one representing a different DES criterion, such as local accuracy information and degree of confidence. Moreover, in reportarXiv, a case study is presented demonstrating how the use of multiple criteria leads to a more robust dynamic selection technique. Using multiple sets of meta-features, even though one criterion might fail due to imprecisions in the local regions of the feature space or due to low confidence results, the system can still achieve a good performance as other meta-features are considered by the selection scheme. Since the META-DES framework considers the dynamic selection problem as a meta-classification problem, we can significantly improve the recognition accuracy of the system by focusing only on optimizing the performance of the meta-classifier.
However, there are some drawbacks to the META-DES framework. First, there are different sources of information that were not considered by the previous version of the system, such as probabilistic models, ambiguity, and ranking. Secondly, all sets of meta-features are used for every classification problem with no pre-processing step at all. As stated by the “No Free Lunch” theorem freelunch, there is no criterion for dynamic selection that outperforms all others over all possible classes of problems. Different classification problems may require distinct sets of meta-features. The meta-classifier training process is not optimized for each classification problem. This can also lead to low classification results, since we found that the training of the meta-classifier is problem-dependent icpr2014. For these reasons, the results obtained by the META-DES framework were still far from those achieved by the Oracle. The Oracle is an abstract model defined in Kuncheva:2002, which always selects the classifier that predicted the correct label, for the given query sample, if such a classifier exists. Although it is possible to achieve results higher than the Oracle by working on the supports given by the base classifier wozniak2010designing; wozniak2014survey, from a dynamic selection point of view, the Oracle represents the perfect dynamic selection scheme, since it always selects the classifiers that predict the correct label DidaciGRM05. As stated by Ko et al. knora, to achieve better results using dynamic selection methods, we need to better understand the behavior of the Oracle. However, addressing its behavior is more complex than applying a single selection criteria, since distinct classification problems may require the use of different selection criteria as they associated with distinct degrees of data complexity HoB02.
In this paper, we propose a new optimization scheme to the META-DES framework in order to better address the behavior of the Oracle. In the first stage, a pool of linear classifiers is generated using the Bagging technique bagging
. In this case, the Perceptron classifier is considered as the base classifier model, since we demonstrated inreportarXiv that using dynamic selection it is possible to solve non-linear classification problems with complex decision boundaries, using a pool containing only five linear base classifiers. Even though the individual accuracy of each base classifier is approximately 50%, the selection mechanism embedded in the framework is able to select the most competent ones for the classification of a given query instance.
In the second stage, 15 sets of meta-features are proposed, using sources of information that were not explored in the previous version framework, such as ranking, ambiguity and probabilistic models applied over the supports obtained by the meta-classifier, for a better estimation of the competence level of the base classifiers. The additional meta-features are motivated by a recent analysis conducted in Cruz2014ANNPR, demonstrating that using different sources of information to estimate the competence level of the base classifiers leads to a more robust DES technique. The meta-features are used as input to a meta-classifier that is trained to identify whether or not a base classifier is competent enough for the classification of an input sample.
Following that, a meta-feature selection scheme is applied in order to optimize the performance of the meta-classifier, based on a formal definition of the Oracle. A Binary Particle Swarm Optimization (BPSO) using a V-shaped and S-shaped transfer function MirjaliliL13
is used in the optimization process. The difference between the level of competence estimated by the meta-classifier and that estimated by the Oracle is used as the fitness function for the BPSO. In other words, the optimization scheme seeks a meta-features vector that minimizes the difference between the behavior of the meta-classifier and that of the Oracle in estimating the competence level of the base classifiers. Thus, the meta-classifier is more likely to present results that are closer to that of the Oracle. We call the proposed system META-DES.Oracle, since the formal definition of the Oracle is used during the training stage of the meta-classifier.
The classification stage is performed using a hybrid dynamic selection and weighting scheme. First, the meta-classifier is used to estimate the competence level of each base classifier. The classifiers that attain a certain level of competence are selected to compose the ensemble. Next, the meta-classifier is used to compute the weights of the selected base classifiers to be used in a weighted majority voting scheme. In this way, the base classifiers that present a higher level of competence have greater influence on the ensemble decision.
Experiments are conducted over 30 classification problems derived from different data repositories. We compare the results obtained by the proposed META-DES.Oracle with 10 state-of-the-art dynamic selection techniques,as well as static ensemble methods (e.g., AdaBoost boosting
and Random Forestsbreiman2001random; rokach2016decision
) and single classifier models, such as Support Vector Machines (SVM) with Gaussian Kernel, Multi-Layer Perceptron (MLP) Neural Network. The goal of the experimental study is to answer the following research questions: (1) Are different sets of meta-features better suited for different problems? (2) Are all 15 sets of meta-feature relevant? (3) Does the META-DES.Oracle obtain a significant gain in classification accuracy when compared to the previous versions of the META-DES framework? (4) Does the META-DES.Oracle outperform state-of-the-art DES techniques? (5) Is the performance obtained by the proposed framework comparable to that of the best families of classifiers in the literaturedelgado14a?
In a nutshell, the contributions of this work are: (1) A novel DES framework based on meta-learning which selects the best set of meta-features in order mimic the selection mechanism of the Oracle. (2) The definition of 15 sets of meta-features as well as categorization of several DS criteria based on their source of information. (3) A formal definition of the Oracle as the ideal classifier selection scheme. (4) Optimization of the META-DES framework based on the formal definition of the Oracle. (5) A extensive comparison among the proposed META-DES.Oracle with 10 state-of-the-art techniques as well as static ensemble and the best single classifier models based on delgado14a. As far as we know, this is the first paper in the dynamic selection literature that perform a comparison among several DS techniques and different classification schemes.
This paper is organized as follows: Section 2 introduces state-of-the-art techniques for dynamic classifier and ensemble selection. The META-DES.Oracle is detailed in Section 3. In Section 4, we describe the 15 sets of meta-features proposed in this work. An illustrative example using synthetic data is shown in Section 5. The experimental study is conducted in Section 6. Finally, our conclusion and future works proposals are given in the last section.
2 Related Works
2.1 Dynamic selection
In static ensemble methods, such as in Decision Forests rokach2016decision and in Boosting methods boosting, the ensemble of classifiers is defined in the training phase, and is used to predict the label of all test samples during the generalization phase. In contrast, dynamic ensemble selection techniques classrank; knora; docs; paulo2; Woloszynski; lca; vriesmann2015combining; cruz2016prototype consists of, based on a pool of classifiers , finding a single classifier or an ensemble of classifiers that has the most competent classifiers to predict the label for a specific test sample, . The ensemble is selected in a dynamic fashion according to each new test sample. This property makes dynamic ensemble selection techniques a robust approach to deal with many pattern recognition applications, such as, handwritten recognition knorabashbaghi2016robust, remote sensing image classification Smits_2002, offline signature verification batista2012dynamic and the recognition of EMG signals in a bio-prosthetic hand kurzynski2011dynamic.
In addition, recent works have demonstrated that dynamic selection techniques can also be used in different classification contexts. For instance, in one-class classification Krawczyk201643, where the system has no access to counterexamples during the training stage, and may require to select the most appropriate classifiers on-the-fly. Another context were dynamic selection has shown some success is in One-Versus-One (OVO) decomposition strategies Galar20111761. OVO works by dividing a multi-class classification problem into as many binary problems as all possible combinations between pair of classes Galar20111761. Each base classifier is trained solely to distinguish between each pair of classes. When a new query sample is presented for classification, the outputs of all base classifiers are combined to predict its label. The problem of OVO strategies relies on the fact that each base classifier is only trained to distinguish between two classes. Not all base classifiers are competent to classify the query sample, since they might not even be trained for the corresponding pair of classes. The outputs of such non-competent classifiers may hinder the performance of the system Galar20133412.
Galar et al. Galar20133412 proposed the Dynamic-OVO strategy, which applies a dynamic selection mechanism in order to avoid non-competent classifiers to weight in the ensemble decision. In this strategy, the neighborhood of the query instance is computed using the K-Nearest Neighbors method. Only the classifiers that were trained considering the classes present in the neighborhood of the query sample are used in the combination scheme. An updated version of the Dynamic-OVO, the Distance-based Relative Competence Weighting combination (DRCW-OVO) was proposed in Galar201528 to further reduce the impact of non-competent classifiers using a weighting mechanism. The outputs of the selected classifiers are weighted depending on the closeness of the query instance to the nearest neighbors of each class in the problem. The larger the distance is, the lower weight the classifier, has and vice versa Galar201528. Another interesting strategy is the DYNOVO technique Mendialdua2015298. This method performs dynamic classifier selection in each sub-problem of the OVO decomposition, and select the best base classifiers to classify the query sample. In this case, an adaptation of the Overall Local Accuracy (LCA) lca strategy for OVO is proposed to estimate the competence of the base classifiers.
Nevertheless, the most important component of DES techniques is the criterion used to measure the level of competence of a base classifier for the classification of a given query sample . The most common approach involves estimating the accuracy of the base classifiers in small regions of the feature space surrounding the query sample, , called the region of competence. This region is usually defined based on the nearest neighbor rule applied to either the training lca or validation data knora. Based on the region of competence, there are several sources of information that can be used to measure the competence of the classifier in the DES literature Alceu2014: Measures based solely on accuracy, such as the Overall Local Accuracy (OLA) lca, Local Classifier Accuracy (LCA) lca and Modified Local Accuracy (MLA) lca, ranking information such as the Classifier Rank classrank and the simplified classifier rank lca
, probabilistic information calculated over the decision obtained by the base classifiers such as the Kullback Leibler divergence, DES-KLWoloszynskiKPS12 and the randomized reference classifier DES-PRC Woloszynski, classifier behavior calculated using output profiles such as the KNOP technique paulo2 and the KNORA family of techniques knora using Oracle information. Brun et al. Brun2016 also presented the use of data complexity measures such as the Fisher’s Discriminant Ratio HoB02 to aid in the search for the most competent classifiers. Furthermore, there are some selection criteria that estimate the competence level of a whole ensemble of classifiers rather than the competence of each base classifier individually, such as the degree of consensus used in the Dynamic Overproduction and Choose technique (DOCS) docs, diversity YasarSaglam2016; anne and data handling dceid.
An important concept in the DES literature is the definition of the Oracle. The Oracle is an abstract model defined in Kuncheva:2002, which always selects the classifier that predicted the correct label, for the given query sample, if such a classifier exists. In other words, it represents the ideal classifier selection scheme. The Oracle is used in the DES literature in order to determine whether the results obtained by the proposed DES techniques is close to ideal accuracy or whether there is still room for improvements. As reported in a recent survey Alceu2014, the results obtained by DES techniques based solely on one source of information are still far from those achieved by the Oracle. As stated by Ko et al. knora, addressing the behavior of the Oracle is much more complex than applying a simple neighborhood approach, and the task of figuring out its behavior based merely on the pattern feature space is not an easy one. In addition, in our previous work ijcnn2011, we demonstrated that the use of local accuracy estimates alone is insufficient to achieve good generalization performance.
To address these issues, in CruzPR we proposed a novel DES framework using meta-learning, called META-DES. From a meta-learning perspective, the dynamic selection problem can be seen as another classification problem, called the meta-problem. This meta-problem uses different criteria regarding the behavior of a base classifier in order to decide whether or not a base classifier is competent enough to classify a given sample . In this paper, our aim therefore is to optimize the performance of the meta-classifier, using the meta-classification environment, to obtain results closer to those of the Oracle.
2.2 Feature selection using Binary Particle Swarm Optimization (BPSO)
Given a set of features , the objective of feature selection is to identify the most informative subset of features . The reasons for using feature selection methods KhushabaAA11
are: removal of redundant and irrelevant features, reduction of dimensionality, reduction of the computational complexity of the system, as well as improvement of the classification accuracy. There are two main factors when dealing with feature selection: the evaluation method, which is applied to compute the fitness of each solution, and the search strategy, which is used to explore the feature space in the search for a more suitable subset of features.
Particle Swarm Optimization (PSO) is an evolutionary computation technique inspired from the social behavior of birds flocking Kennedy:PSO
. PSO is one of the most used evolutionary algorithms, due to its simplicity and low computational cost. The technique is based on a group of particles flying around in the search space to find the best solution. Recent works have shown the preference for PSO over other classical optimization techniques, such as GA because GA has too many parameters to set. Moreover, GA is very sensitive to the probability of crossover and mutation operators, as well as to the initial population of solutions. Therefore, it is likely to get stuck into local minimaKhushabaAA11. For this reason, BPSO has been shown to outperform other optimization algorithms in performing feature selection Kennedy:PSO; Chuang:2008; FirpiG04.
3 The META-DES.Oracle
The META-DES framework is based on the assumption that the dynamic ensemble selection problem can be considered as a meta-problem icpr2014. This meta-problem uses different criteria regarding the behavior of a base classifier , in order to decide whether it is competent enough to classify a given test sample . The meta-problem is defined as follows CruzPR:
The meta-classes are either “competent” (1) or “incompetent” (0) to classify .
Each set of meta-features corresponds to a different criterion for measuring the level of competence of a base classifier.
The meta-features are encoded into a meta-features vector .
A meta-classifier is trained based on the meta-features to predict whether or not will achieve the correct prediction for , i.e., if it is competent enough to classify .
An overview of the META-DES framework is illustrated in Figure 1. The framework is divided into three phases: (1) Overproduction, (2) Meta-training, and (3) Generalization. Phases (1) and (2) are performed in offline mode, i. e., during the training stage of the framework. In the overproduction phase, the pool of classifiers is generated using the training set . The following step is the meta-training stage, in which the meta-features are extracted for the training of the meta-classifier . In this stage, the meta-features are extracted from the meta-training set, , and from the dynamic selection dataset, . The meta-data extracted from , denoted by , are used for the training of the meta-classifier, and those extracted from , denoted by , are used as validation data during the BPSO optimization process. Phase (3) is conducted on-the-fly, with the arrival of each new test sample, , coming from the generalization dataset . For each base classifier , a meta-features vector is extracted, corresponding to the behavior of the base classifier for the classification of . is passed down to the meta-classifier that estimates if is competent enough to predict the label for . After all the classifiers in the pool are evaluated, the selected classifiers are combined using a weighted majority voting approach to predict the label of . The main changes to the META-DES framework proposed in this paper are highlighted in different colors:
The meta-feature extraction process, in which 15 sets of meta-features are extracted. Ten new sets of meta-features are proposed in this work in order to explore different sources of information for estimating the competence level of the base classifiers, such as probabilistic models, ambiguity, behavior and ranking. The meta-feature extraction process is presented in Section 4.
The meta-features selection using Binary Particle Swarm Optimization and guided by Oracle information for achieving a behavior closer to the Oracle. The meta-features selection process is detailed in Section 3.2.2.
The combination approach, where a hybrid dynamic selection and weighting approach is considered for the classification of the query sample (Section 3.3).
Similarly to CruzPR, the Overproduction phase is performed using the Bagging technique bagging. The Bagging technique works by randomly selecting different bootstraps of the data for training each base classifier . Each bootstrap uses of 50% of the training data. The pool of classifiers is composed of 100 linear Perceptrons for the two-class problems and 100 multi-class linear Perceptrons for the multi-class problems. The use of linear classifiers is motivated by the finding in reportarXiv; AlpaydinJ96; KunchevaR07 showing that the META-DES framework can solve complex non-linear classification problems with complex decision boundaries using only linear classifiers reportarXiv.
3.2 Meta-training Phase
In this stage, the meta-features are extracted for the training of the meta-classifier . In this version of the framework we extract meta-data from two sets: the meta-training set and the dynamic selection (validation) . The meta-data extracted from the set , denoted by are used for the training of the meta-classifier. The meta-data extracted from the set , denoted by are used as validation data in the BPSO optimization scheme for preventing overfitting.
3.2.1 Sample Selection
The first step in the meta-data generation process is the sample selection mechanism. The sample selection mechanism is employed in order to focus the training of the meta-classifier to deal with cases in which the extent of consensus of the pool is low, i.e., when there is a disagreement between the classifiers in the pool, for the correct label. For each instance 111 coming from the set DSEL or coming from the set coming from either the meta-training set, , or the dynamic selection dataset DSEL, the consensus of the pool is computed by the percentage of base classifiers in the pool that predicts its correct label, denoted by . If the percentage falls below the consensus threshold, , the sample, , is passed down to the meta-features extraction process.
Next, for each base classifier, , 15 sets of meta-features are computed. Each set of meta-features is detailed in Section 4. The meta-feature vector containing the 15 sets of meta-features is obtained at the end of the process. The meta-feature vector represents the behavior of the base classifier for the classification of the query sample . If the base classifier predicts the correct label for , the class attribute of , ( belongs to the meta-class “competent”), otherwise (belongs to the meta-class “incompetent”). is stored in either or DSEL.
3.2.2 Meta-Feature Selection Using Binary Particle Swarm Optimization (BPSO)
Since we are dealing with feature selection, a binary version of the PSO algorithm, BPSO is considered. BPSO has been shown in many applications to outperform other optimization algorithms in performing feature selection Kennedy:PSO; Chuang:2008; FirpiG04. There are many versions of the BPSO algorithm, such as the Improved BPSO Chuang:2008, CatfishBPSO ChuangTY11 and MBPSO WangWFZ08. Mirjalili et al. MirjaliliL13 shows that the most important factor for achieving good convergence and avoiding local minima is the transfer function, that is responsible for mapping the continuous search space into a binary space. Generally speaking, there are two main types of transfer functions, S-shaped and V-shaped MirjaliliL13. The main difference between the two families derives from the observation that the S-Shaped functions force the particles to switch 0 or 1 values at each generation, while the V-Shaped transfer functions encourage particles to stay in their current position when their velocity values are low, and switch the values only when the velocity is high. For these reasons, V-Shaped transfer functions were shown to be better both in terms of robustness to local minima and convergence speed. In this work, we consider one S-Shaped transfer function and one V-Shaped function, which presented the best overall performance, considering 25 benchmark functions MirjaliliL13.
Each particle (solution) is composed of a binary string ( is the number of meta-features), where every bit represents a single meta-feature. The value “1” means the meta-feature is selected and “0” otherwise.
At each generation, the velocity of the i-th particle is computed using Equation 1:
Each particle makes use of its private memory, , which represents the best position the i-th particle visited as well as the knowledge of the swarm, , which represent the global best position visited, considering the whole swarm. The constant corresponds to the inertia weight, and are the acceleration coefficients, and is a randomly generated number between and . The term represents the private knowledge of the i-th particle, and the term represents the collaboration of particles.
When dealing with binary search spaces, updating the position of a particle means switching between “0” and “1”, i.e., whether or not the meta-feature is selected. The switching is conducted based on the velocity of the particle. The higher the velocity of a particle, the higher its probability of changing positions should be. However, the velocities are computed in the real space rather than in the binary space (as shown in Equation 1). The velocity of the particle needs to be converted into a probability value, representing the probability of changing the position of the particle from “” to “” and vice versa. This step is conducted using a transfer function, . A transfer function should work in a way that the higher the velocity value, the higher the probability of changing position will be, since particles with higher velocity values are probably far from the best solutions ( and ). Similarly, a transfer function must present a lower probability of switching position for lower velocity values MirjaliliL13. The position of the i-th particle is updated according to Equation 2.
Generally speaking, there are two main types of transfer functions, S-shaped and V-shaped MirjaliliL13. In this work we consider one S-shaped transfer function proposed in Kennedy:PSO and one V-shaped transfer function proposed in MirjaliliL13, in Equations 3 and 4, respectively. These transfer functions were selected since they obtained the best results in several optimization benchmarks MirjaliliL13.
22.214.171.124 Fitness Function - distance to the oracle
The optimization of the meta-classifier is conducted based on the definition of the Oracle. From the dynamic selection point of view, the Oracle is seen as the ideal dynamic selection technique, which always selects the classifier that predicts the correct label, , and rejects otherwise. From the classifier competence point of view, the selection mechanism employed by the Oracle as the ideal dynamic classifier selection scheme. In this work, we formalize the Oracle as an ideal selection scheme using Equation 5.
The level of competence of a base classifier is equals to if it predicts the correct label for , and otherwise. In the META-DES framework, we want the meta-classifier, to perform similarly to the Oracle, in a way that it should identify which base classifiers in the pool is competent to predict the label of an unknown test instance and should be selected to compose the ensemble. In order to achieve such behavior, we measure the difference between the estimation of competence achieved by the ideal selection scheme,represented by the Oracle, and the estimation of competence obtained by the meta-classifier in the fitness function of the optimization scheme.
The fitness function is computed as follows: Given that and are the level of competence of the base classifier for the classification of the instance, , computed by the META-DES framework and the Oracle, respectively. The difference between both techniques is calculated by the mean squared difference between their competence estimates, and (Equation 6).
where and are the size of the dataset and pool of classifiers, respectively.
Therefore, the BPSO optimization searches for a meta-classifier which minimizes the distance . In other words, we search for a meta-classifier that presents a behavior closer to the ideal dynamic selection technique, for estimating the competence level of the base classifiers. We call the proposed system META-DES.Oracle since the optimization of the meta-classifier is based on the definition of the Oracle.
126.96.36.199 Overfitting Control Scheme
Since the fitness function takes into account the performance of the meta-classifier, i.e., the wrapper approach, the optimization process becomes another learning process and may be prone to overfitting RadtkeWS06; SantosSM09; SantosOSM08. The best solution found during the optimization routine may have overfitted the optimization dataset, and may not have a good generalization performance. To avoid overfitting, the sets used in the BPSO feature selection scheme are divided as illustrated in Figure 2. The meta-feature dataset,, is split on the basis of 50% for the training of the meta-classifier and 50% for the optimization dataset which is used to guide the search in the BPSO scheme. The meta-feature vectors extracted from the dynamic selection dataset, , are used to validate the solutions found by the BPSO algorithm.
There are three common methods for controlling overfitting in optimization systems SantosSM09: Partial Validation (PV), Backwarding Validation (BV), and Global Validation (GV). In this work, we use the GV approach since previous works in the literature demonstrate that the GV is a more robust alternative for controlling overfitting in optimization techniques RadtkeWS06; SantosSM09. In the GV scheme (see Algorithm 1), at each generation, the fitness of all particles are evaluated using the validation set, (line 18 of the algorithm). If the fitness of the particle is better than the fitness of the particle kept in the archive, denoted by , is stored in the archive (lines 19 and 20). Thus, at the end of the optimization process, the particle kept in is the one presenting the best fitness value, considering the validation data. The solution kept in the archive, , is used as the meta-classifier .
3.3 Generalization Phase
The generalization procedure is formalized by Algorithm 2. Given the query sample, , the region of competence is computed using the samples from the dynamic selection dataset . Following that, the output profiles, of the test sample, , are calculated. The set with similar output profiles, , of the query sample , is obtained through the Euclidean distance applied over the output profiles of the dynamic selection dataset.
For each base classifier, , belonging to the pool of classifiers , the meta-feature extraction process is called (Section 4), returning the meta-features vector (lines 5 and 6). Only the selected meta-features, which are kept in the archive are extracted. Then, is used as input to the meta-classifier . The support, , obtained by for the “competent” meta-class, is computed as the level of competence of the base classifier, , for the classification of the test sample, . The classification of the query sample, , is performed using a hybrid dynamic selection and weighting approach. First, the base classifiers that achieve a level of competence, , are considered competent, and are selected to compose the ensemble, (lines 7 to 9). Next, the decision of each selected base classifier is weighted by its level of competence, , using a weighted majority voting scheme (line 13) to predict the label of the query sample . Thus, the base classifiers that attained a higher level of competence, , have more influence in the final decision.
4 Meta-Feature Extraction
A total of 15 sets of meta-features are considered, with ten sets proposed in this paper, and five coming from our previous work CruzPR. Each set captures a different property of the behavior of the base classifier, and can be seen as a different criterion to dynamically estimate the level of competence of the base classifier, such as the classification performance estimated in a local region of the feature space and the classifier confidence for the classification of the input sample. Using 15 distinct sets of meta-features, even though one criterion might fail due to imprecisions in the local regions of the feature space or due to low confidence results, the system can still achieve a good performance, as other meta-features are considered by the selection scheme.
Table 1 shows the criterion used by each , the object used to extract the meta-feature (e.g., the region of competence, ), and its categorization based on the DES taxonomy suggested in Alceu2014. Each set of meta-features may generate more than one feature. The size of the feature vector, , is .
|Meta-Feature||Criterion||Domain||Object||No. of Features|
|*||Classification of the K-Nearest Neighbors||Accuracy|
|*||Posterior probability obtained for the K-Nearest Neighbors||Probabilistic|
|*||Overall accuracy in the region of competence||Accuracy||1|
|Conditional accuracy in the region of competence||Accuracy||1|
|*||Degree of confidence for the input sample||Confidence||1|
|Ambiguity in the vector of class supports||Ambiguity||1|
|Logarithmic difference between the class supports||Probabilistic|
|Probability of Random Classifier||Probabilistic|
|Minimum difference between the predictions||Probabilistic|
|Entropy in the vector of class supports||Probabilistic|
|Exponential difference between the class supports||Probabilistic|
|*||Output profiles classification||Behavior|
|Classifier ranking in the feature space||Ranking||DSEL||1|
|Classifier ranking in the decision space||Behavior and Ranking||1|
Given a new sample, , the first step in extracting the meta-features involves computing its region of competence, denoted by . The region of competence is defined in the dynamic selection dataset set using the K-Nearest Neighbor algorithm. Then, is transformed into an output profile . The output profile of the instance is denoted by , where each is the decision yielded by the base classifier for the sample paulo2. Then, the similarity between and the output profiles of the samples in is obtained through the Euclidean distance. The most similar output profiles are selected to form the set , where each output profile is associated with a label .
4.1 Local Accuracy Meta-Features
These meta-features are based on the performance of the base classifier in a local region of the feature space surrounding the query instance . Three sets of meta-features using local accuracy estimation are considered:
4.1.1 Overall Local accuracy:
The accuracy of over the whole region of competence is computed and encoded as (Equation 7).
4.1.2 Conditional Local Accuracy:
The local accuracy of is estimated with respect to the output classes; ( is the class assigned for by ) for the samples belonging to the region of competence, (Equation 8).
4.1.3 Neighbors’ hard classificationL:
First, a vector with elements is created. For each instance , belonging to the region of competence , if correctly classifies , the -th position of the vector is set to 1, otherwise it is 0. Thus, meta-features are computed.
Ambiguity measures the level of confidence the base classifier has in its answer. A common concept used to estimate the confidence of a classifier is based on the margin theory boosting; Breiman:1999. The margin of a classifier is regarded as a good indicator of the classifier’s confidence. Two meta-features are considered: one based on the maximum margin theory and one based on the minimum margin theory . Since these meta-features do not take into account the correct label of the sample, they are extracted directly from the query sample, .
4.2.1 Classifier’s confidence:
The perpendicular distance between the input sample, , and the decision boundary of the base classifier is calculated and encoded as . The value of is normalized to a range using the Min-max normalization.
This information is simply computed by the difference between scores of the class with highest support and the second highest one for the query sample, , e.g., consider that for a 3-class classification problem, the scores obtained by the base classifier for a given query sample, , are , and . Then, the ambiguity value is . A higher value in means that the classifier decision is less ambiguous.
4.3 Probabilistic Meta-Features
This class of meta-features is based on probabilistic models that are applied over the vector of class supports produced by the base classifier for the classification of a given query sample. The motivation behind probabilistic measures derives from the observation that classifiers that perform worse than the random classifier, i.e., a classifier that randomly select the classes with equal probabilities, deteriorate the majority voting performance. In contrast, if the base classifiers are significantly better than the random classifier, they are likely to improve the majority voting accuracy WoloszynskiKPS12. Hence, each set of meta-features in this group estimates the probability that the performance of a given base classifier is significantly different from that of a random classifier derived from different probabilistic and information theory perspectives Woloszynski; WoloszynskiK10; WoloszynskiKPS12; zbMATH05935973; WoloszynskiK09.
For the definitions below, let be the vector of class supports estimated by the base classifier for a given sample, , where each value represents the support given to the -th class and . Let be the support given by the base classifier for the correct class label of
. The output of the random classifier follows a uniform distribution, and is denoted by.
4.3.1 Posterior probability:
First, a vector with elements is created. Then, for each instance , belonging to the region of competence , the posterior probability of , is computed and inserted into the -th position of the vector. Consequently, meta-features are computed.
First, a vector with elements is created, . For each instance, , belonging to the region of competence , the support obtained by the base classifier for the correct class label, , is estimated. Then, a logarithmic function WoloszynskiK09 is applied to (Equation 9). The function is used such that the value of the meta-feature is negative if the support obtained for the correct class label is lower than the support obtained from random guessing (i.e., ) and positive otherwise. The result of the logarithmic function is inserted into the -th position of the vector. Hence, meta-features are computed.
The entropy measures the degree of uncertainty in the vector of supports, , obtained by the base classifier, . The meta-feature is calculated as follows: first, a vector with elements is created, . Then, for each instance, , belonging to the region of competence, , the entropy of the vector of class supports is computed, and inserted in the -th position of the vector (Equation 10). Thus, meta-features are computed.
4.3.4 Minimal difference:
First, a vector with elements is created, . Then, for each sample, , belonging to the region of competence, , the Minimal Difference (as proposed in zbMATH05935973) is computed as the difference between the support obtained by the base classifier for the correct class label of , , and those obtained by for each of the other classes, , are calculated. The difference which produces the minimal value is inserted in the -th position of the vector (Equation 11). Thus, meta-features are computed.
4.3.5 Kullback-Leibler Divergence:
The Kullback-Leibler (KL) divergence Kullback51klDivergence estimates the competence of a base classifier from the information theory perspective WoloszynskiKPS12. The meta-feature is computed as follows: first, a vector with elements is created, . Then, for each member, , belonging to the region of competence , the KL divergence between the vector of class supports, , obtained by the base classifier, , and those obtained by the random classifier, is computed. The result of the KL divergence is inserted in the -th position of the vector (Equation 12). Consequently, meta-features are calculated.
First, a vector with elements is created, . For each sample, , belonging to the region of competence , the support obtained by the base classifier for the correct class label, , is estimated. Next, an exponential function WoloszynskiK09 is applied over to compute (Equation 13). Using the exponential function, the value of increases exponentially when the value of is higher than that obtained from random guessing (), and is negative otherwise. The result of the exponential function is inserted in the -th position of the vector. Hence, meta-features are computed.
4.3.7 Randomized Reference Classifier:
First, a vector with elements is created, . For each sample, , belonging to the region of competence , the conditional probability of correct classification estimated by the randomized reference classifier (RRC) proposed in Woloszynski222Matlab code for this technique is available on: http://www.mathworks.com/matlabcentral/fileexchange/28391-a-probabilistic-model-of-classifier-competence . The result is inserted in the -th position of the vector. Thus, meta-features are computed.
4.4 Behavior meta-features
These measures take into consideration information extracted from the decision space, i.e., the outputs or behavior of the classifiers in the pool, rather than information from the feature space. Global information about the whole pool of classifiers is considered. Furthermore, many authors have successfully utilized DES criteria based on classifier behavior in estimating the competence of base classifiers paulo2; logid; CruzPR.
4.4.1 Output profiles classification:
First, a vector with elements is created. Then, for each member, , belonging to the set of output profiles, , if the label produced by for is equal to the label of , the -th position of the vector is set to 1, otherwise it is 0. A total of meta-features are extracted.
4.5 Ranking Meta-Features
Ranking methods for estimating the competence of base classifiers have been proposed in classrank. The ranking is computed such that classifiers with higher ranking values are more likely to be competent. In this work, we consider two types of ranking meta-features, one based on the feature space, and the other on the decision space. They are defined below:
4.5.1 Simplified classifier rank:
This meta-feature is inspired by the simplified classifier rank technique proposed in lca. The first step in extracting the ranking meta-feature is to order the instances in by its distance to the query sample . is computed as the number of consecutive correct predictions made by the base classifier , starting from the closest sample to . The search stops when the first misclassification is made.
4.5.2 classifier rank OP:
This meta-feature is computed similarly to the previous . However the search is conducted in the decision space, using the output profiles, , rather than the dataset . Hence, the first step is to order the output profiles in by their similarity to the output profile of the query sample . Then, the number of consecutive correct predictions made by the base classifier is computed as .
5 Case study using synthetic data
In this section, we conduct experiments using a synthetic dataset in order to illustrate the benefits of the meta-feature selection process and compare different versions of the META-DES framework for solving a problem with a complex non-linear geometry using a pool composed of linear classifiers. The P2 is a two-class problem, presented by Valentini Valentini05, in which each class is defined in multiple decision regions delimited by polynomial and trigonometric functions (Equations 14, 15, 16 and 17). As in henniges, was modified such that the area of each class is approximately equal. The P2 problem is illustrated in Figure 3. It is impossible to solve this problem using a single linear classifier, and the performance of the best possible linear classifier is around 50%.
For this illustrative example, the P2 problem was generated as in reportarXiv: 500 samples for training (), 500 instances for the meta-training dataset (), 500 instances for the dynamic selection dataset , and 2000 samples for the test set, . The pool of classifiers is composed of 5 Perceptrons (shown in Figure 4). The best classifier of the pool (Single Best) achieves an accuracy of 53.5%. The performance of all other base classifiers is around the 50% mark. The Oracle result of this pool obtained a recognition performance of 99.5%. In other words, there is at least one base classifier that predicts the correct label for 99.5% of the test instances. The problem lies in selecting the competent classifiers in order to achieve a classification accuracy close to the Oracle.
Figures 5 (a) and (b) show the decision boundary obtained by the META-DES CruzPR, and the proposed META-DES.Oracle, respectively333The results achieved by different dynamic and static ensemble techniques for the P2 problem are presented in the following report reportarXiv.. We can observe that the META-DES.Oracle obtains a really good approximation of the real decision boundary for the P2 problem. The META-DES.Oracle proposed in this paper obtained a recognition accuracy of 97%, while the accuracy of the META-DES was 94.5% reportarXiv. Using the extended sets of meta-features and the meta-feature selection procedure based on the Oracle definition, we observed a significant gain in performance for the P2 problem. Thus, it is possible to reduce the big gap that exists between the performances of the current state-of-the-art DES techniques and the ideal one, the Oracle.
The experiments were conducted on the same 30 classification datasets used in our previous work CruzPR; ijcnn2015. The key features of each dataset are shown in Table 2.
|Database||# samples||# features||# Classes||Repository|
|Steel Plate Faults||1941||27||7||UCI|
|MAGIC Gamma Telescope||19020||10||2||KEEL|
6.2 Experimental protocol
For each dataset, the experiments were conducted using 20 replications. For each replication, the datasets were divided using the holdout method hastie_09 on the basis of 50% for training, 25% for the dynamic selection dataset, , and 25% for the test set,
. The divisions were performed while maintaining the prior probabilities of each class. For the proposed META-DES-Oracle, 25% of the training data was used in the meta-training process.
For the two-class classification problems, the pool of classifiers was composed of 100 Perceptrons generated using the Bagging technique. For the multi-class problems, the pool of classifiers was composed of 100 multi-class Perceptrons. The use of linear Perceptron classifiers was motivated by the results reported in Section 5 showing that the META-DES framework can solve non-linear classification problems with complex decision boundaries using only a few linear classifiers. The values of the hyper-parameters, , and , were set at 7, 5 and 70%, respectively. They were selected empirically based on previous publications ijcnn2011; icpr2014; CruzPR. Hence, the size of the meta-feature vector is 67 ( ).
The parameters of the BPSO algorithm were set based on previous work in the literature ChuangTY11; Chuang:2008; FirpiG04: the population size was set at 20, the maximum number of generations . The weight function, , and acceleration coefficients, , were set using the standard values from Kennedy:PSO. Moreover, the optimization process was stopped if the fitness of the best solution failed to improve after 5 consecutive iterations. Since the BPSO optimization process is a stochastic algorithm, for each replication, the BPSO was run 30 times. The best result, considering the Global Validation overfitting control scheme, was used for generalization phase.
6.3 Analysis of the selected meta-features
In this section, we analyze the set of meta-features that are selected by the proposed technique. The objective of this analysis is: (1) to verify whether different sets of meta-features are better suited for different classification problems; and (2) to identify whether or not the proposed sets of meta-features are relevant.
In the first analysis, we compare how often each individual meta-feature was selected. Figure 6 illustrates the selection frequency per meta-feature, considering 20 replications. We present the results for each dataset separately. Each square represents an individual meta-feature. The color of each square represents the frequency that each meta-feature is selected. A white square indicates that the corresponding meta-feature was selected less than 25% of the time. A light grey square means the meta-feature was selected with a frequency between 25% and 50%. A dark grey square represents a frequency of 50% to 75%, and a black square represents a frequency of selection higher than 75%.
It can be seen that the frequency at which each meta-feature is selected varies considerably between different datasets. For instance, the meta-feature based on the classification of the neighbor samples, , was selected with a frequency between 25 and 50% in the majority of datasets. However, for the Wine dataset, it was not selected at all. The only exceptions are for the meta-feature sets, , which presented a 100% frequency of selection for all 30 datasets, and . This finding demonstrates that distinct classification problems require a different set of meta-features in order to better address the behavior of the Oracle. Different problems are associated with different degrees of data complexity HoB02, and may require a distinct set of meta-features in order to obtain a meta-classifier that presents a behavior closer to the Oracle for estimating the competence of the base classifiers. Hence, the results show that the choice of the best set of meta-features is problem-dependent. In addition, we can see that each individual meta-feature is selected for at least 20% of the datasets, considering all 30 classification problems (Figure 7). Hence, we believe that all sets of meta-features proposed in this work are relevant.
6.4 Comparative study
In this section, we compare the results obtained by the proposed META-DES.Oracle, which is based on 15 sets of meta-features, against the previous versions of the META-DES framework, which are based only on five sets of meta-features defined in CruzPR. The objective of this comparative study is to answer the following research questions: (1) Does the optimization based on the Oracle behavior lead to a significant gain in classification accuracy? (2) Does the use of more meta-features lead to a more robust DES system?
The following versions of the META-DES framework are compared in this section:
S-shaped GV: The proposed META-DES.Oracle using S-shaped transfer function with global validation.
V-shaped GV: The proposed META-DES.Oracle using V-shaped transfer function with global validation.
S-Shaped: The proposed META-DES.Oracle using S-shaped transfer function without global validation.
V-Shaped: The proposed META-DES.Oracle using V-shaped transfer function without global validation.
META-DES.ALL: The framework using the 15 sets of meta-features proposed in this work without the optimization process.
META-DES.H: The Hybrid version, META-DES.H proposed in ijcnn2015.
META-DES: The first version of the META-DES framework CruzPR.
|Dataset||S-Shaped GV||V-Shaped GV||S-Shaped||V-Shaped||META-DES.ALL||META-DES.H ijcnn2015||META-DES CruzPR|
|Magic||85.69 (1.37 )||86.02 (2.20)||85.79 (1.21)||85.80(2.54)||85.25(3.21)||85.650(2.27)||84.35(3.27)|
|Wilcoxon Signed Test||
Classification accuracies are reported in Table 3. The best result achieved for each dataset is highlighted in bold. The Friedman Friedman test is used in order to compare the results of all techniques over the 30 classification datasets. The Friedman test is a non-parametric equivalent of the repeated ANOVA measures, used to make comparison between several techniques over multiple datasets Demsar:2006
. For each dataset, the Friedman test ranks each algorithm, with the best performing one getting rank 1, the second best rank 2, and so forth. Then, the average rank and its standard deviation are computed, considering all datasets. The best algorithm is the one presenting the lowest average rank. Since we are comparing seven techniques, the degree of freedom is 6. We set the level of significance, i.e., 95% confidence. The Friedman test shows that there is a significant difference between the seven approaches. Then, a post-hoc Bonferroni-Dunn test was conducted for a pairwise comparison between the ranks achieved by each technique. The performance of two classifiers is significantly different if their difference in average rank is higher than the critical difference. The critical difference is computed using the following equation: , where the critical value is based on the Studentized range statistic divided by . The results of the post-hoc test are presented using the critical difference diagram proposed in Demsar:2006 (Figure 8). The performance of techniques in which the difference in average ranks is higher than the critical difference are considered significantly different. Techniques with no statistical difference are connected by a black bar in the CD diagram.
One interesting fact is that all techniques proposed in this work obtained lower rank values when compared to the previous version of the META-DES framework. The META-DES.Oracle using the V-shaped transfer function obtained the best overall performance, achieving an average rank of 3.00. Moreover, the results obtained by this technique were also significantly better than those obtained by both the META-DES and META-DES.H.
The second statistical analysis is conducted in a pairwise fashion in order to verify whether the difference in classification accuracy obtained by the META-DES.Oracle significantly improves the classification accuracy when compared to the previous versions of the framework. To that end, the Wilcoxon non-parametric signed rank test with the level of significance was used since it was suggested in Demsar:2006 as a robust method for a pairwise comparison between classification algorithms over several datasets. The results of the Wilcoxon statistical test are shown in the last row of Table 3. Techniques that achieve performances equivalent to the META-DES.H are marked with "~"; those that achieve statistically superior performance are marked with a "+", and those with inferior performance are marked with a "-". -values are also shown in the last row of Table 3.
The results of the Wilcoxon signed rank test also demonstrate that the META-DES.Oracle using the V-Shaped transfer function and the global validation overfitting control scheme obtained classification results that are significantly superior when compared to both the META-DES.H and the META-DES, with a 95% confidence over the 30 datasets considered in this work. Thus, based on the analysis, we can answer the two research questions posed at the beginning of this section: The meta-features selection optimization process does indeed significantly improve the classification performance of the system, when compared to the previous versions of the framework. In addition, we can also see that the system using 15 sets of meta-features without meta-feature selection, META-DES.ALL, achieves similar results when compared to previous versions of the framework (e.g., META-DES and META-DES.H). This suggest that simply adding more meta-features does not always lead to a better classification accuracy. The meta-feature selection stage is important for better addressing the behavior of the Oracle.
For the sake of simplicity, we refer to META-DES.Oracle, the version of the framework using the V-Shaped transfer function and global validation, in the rest of this paper.
6.5 Comparison with the state-of-the-art DES techniques
In this section, we compare the accuracy obtained by the proposed META-DES.Oracle against ten state-of-the-art dynamic selection techniques Alceu2014. The goal of this analysis is to know if the performance of the proposed system is significantly superior when compared to state-of-the-art DES techniques. The dynamic selection techniques used in this analysis are: Local Classifier Accuracy (LCA) lca, Overall Local Accuracy (OLA) lca, Modified Local Accuracy (MLA) Smits_2002, K-Nearest Oracles-Eliminate (KNORA-E), K-Nearest Oracles-Union (KNORA-U) knora, K-Nearest Output Profiles (KNOP) paulo2, Multiple Classifier Behavior (MCB) mcb, Randomized Reference Classifier (DES-RRC) Woloszynski and DCS-Rank classrank. These techniques were selected because they presented the very best results in the dynamic selection literature according to a recent survey on this topic Alceu2014.
The same pool of classifiers is used for all techniques in order to ensure a fair comparison. For all techniques, the size of the region of competence, , was set at since it achieved the best result in previous experiments ijcnn2011; CruzPR. The results are shown in Table 4
. For each dataset, we performed a pairwise comparison between the results obtained by the proposed META-DES.Oracle against those obtained by each state-of-the-art DES technique. The comparison was conducted using the Kruskal-Wallis non-parametric statistical test, with a 95% confidence interval. Results that are significantly better are marked with a. In addition, the average rank of each technique, as well as the result of the sign test, are presented at the end of Table 4.
|Database||META-DES.Oracle||KNORA-E knora||KNORA-U knora||DES-FA ijcnn2011||LCA lca||OLA lca||MLA Smits_2002||MCB mcb||KNOP paulo2||DES-RRC Woloszynski||DCS-Rank classrank|
Figure 9 illustrates the average rank of each technique using the CD diagram. Similarly to Section 6.4, the CD was calculated using the Bonferroni-Dunn post-hoc test. The META-DES.Oracle obtained the lowest average rank, 2.73, followed by the technique based on probabilistic models, DES-RRC Woloszynski, presenting an average rank of 4.40. Hence, the performance of the META-DES.Oracle is significantly better when compared to the majority of the state-of-the-art DES techniques. Only the DES-RRC obtained a statistically equivalent performance. However, when we compared those two techniques in terms of wins, ties and losses as reported in Table 4, we could see that the META-DES.Oracle obtained the best accuracy for 24 datasets, while the DES-RRC outperformed the META-DES.Oracle only in 4 datasets. For two datasets, the results of both techniques were tied. Furthermore, we also performed the Wilcoxon non-parametric signed rank test with the level of significance for a pairwise comparison between the results obtained by the META-DES.Oracle against state-of-the-art DES techniques over the 30 datasets. The results of the Wilcoxon test are presented in the last row of Table 4.
When a pairwise comparison between the techniques is performed, we can see that the META-DES.Oracle dominates when compared against previous DES techniques. Its performance is statistically better when compared to any of the 10 state-of-the-art techniques. This can be explained by two factors: state-of-the-art DES techniques are based only on one criterion to estimate the competence of the base classifier; this could be, local accuracy, ranking, probabilistic models, etc. For instance, the ranking and probabilistic criteria used by the DCS-Rank and DES-RRC techniques are embedded in the META-DES framework as meta-features and , respectively. In addition, through the BPSO meta-features selection scheme, only the meta-features that are relevant for the given classification problem are selected and used for the training of the meta-classifier. As shown in Figure 6, the selected meta-features vary considerably according to different classification problems. Thus, it is expected that the proposed framework obtains a significant gain in performance when compared to previous DES techniques.
6.6 Comparison with Static techniques
In this section, we compare the results obtained by the META-DES.Oracle against static ensemble techniques as well as single classifier models. For the static ensemble techniques, we evaluate the performance of the AdaBoost boosting, Bagging bagging, Random Forest breiman2001random; rokach2016decision, the classifier with the highest accuracy in the validation data (Single Best) and a static ensemble selection method based on the majority voting error proposed in classmaj. Furthermore, two single classifier models were considered: Multi-Layer Perceptron (MLP) Neural Network and a Support Vector Machine with Gaussian Kernel (SVM). These classifiers were selected based on a recent study delgado14a that ranked the best classifiers in a comparison considering a total of 179 classifiers over 121 classification datasets.
The objective of this study is to determine whether the proposed META-DES.Oracle obtain recognition accuracy that is either statistically better or equivalent to the ones achieved by the best classifiers in the literature delgado14a. This is an important analysis since the DES literature still lacks a comparison with classical classification approaches that do not use ensembles. In the DES literature, the accuracy of the proposed techniques are only compared either with other DES techniques or with static ensemble selection considering the same pool of classifiers Alceu2014.
All classifiers were evaluated using the Matlab PRTOOLS toolbox PRTools. Since static techniques require neither a meta-training nor a dynamic selection phase, the training () and meta-training set () were merged into a single training set. The dynamic selection dataset (DSEL) was used as the validation dataset. The test set, , remained unchanged. For each replication, the hyper-parameters of the each classifier model were set as follows:
MLP Neural Network: We varied the number of neurons in the hidden layer from 10 to 100 at 10 point intervals. The configuration that achieved the best results in the validation data was used. The MLP training process was conducted using the Levenberg-Marquadt algorithm. The process was stopped if the performance on the validation set decreased or failed to improve for five consecutive epochs.
SVM with a Gaussian Kernel: A grid search was performed in order to set the values of the regularization parameter, , and the Kernel spread parameter .
Random Forest: We varied the number of trees from 25 to 200 at 25 point intervals. The configuration with the highest performance on the validation dataset was used for generalization.
The classification accuracy of each technique is reported in Table 5. For each dataset, we performed a pairwise comparison between the results obtained by the proposed META-DES.Oracle, against the results obtained by each state-of-the-art DES technique. The comparison was conducted using the Kruskal-Wallis non-parametric statistical test, with a 95% confidence interval. Results that are significantly better at a 95% confidence are marked with . Moreover, we also report the average ranks and the results of the Wilcoxon test at the end of Table 5. Figure 10 illustrates the critical difference diagram.
|Database||META-DES.Oracle||Single Best Alceu2014||Bagging bagging||AdaBoost boosting||Static Selection classmaj||MLP NN||SVM||Random Forest|
|Wilcoxon Signed Test||n/a||
Based on the result,s we can conclude that the META-DES.Oracle outperforms static ensemble techniques. This result was expected since many works in the DES literature have shown that dynamic selection outperforms static combination rules in many applications Alceu2014. Moreover, this claim is especially true when a pool of weak linear classifiers is considered since they become experts into different regions of the feature space. As reported in reportarXiv, a static combination of base classifiers in such a case may not yield a good classification performance since there may never be a consensus in the correct answer between the classifiers in the pool. However, when dynamic selection is used, only the most competent classifiers for the given query sample are selected to predict its label. As such, the classifiers that are not experts in the local region do not influence the ensemble decision negatively.
When compared with single classifier models, the META-DES.Oracle obtained the lowest average rank. The results achieved META-DES.Oracle is statistically equivalent to those achieved by the SVM classifier, based on both the Friedman test with Bonferroni-Dunn post-hoc test, and the Wilcoxon sign test at significance. Hence, the analysis demonstrate the classification performance achieved by the proposed META-DES.Oracle is among the best classifier models in the literature, since both SVM and Random Forests presented the overall best performance in the analysis conducted by Delgado et al. delgado14a.
It is important to point out that the META-DES.Oracle obtained a small advantage in terms of wins, ties and losses when compared to the SVM classifier. The META-DES.Oracle presented the best recognition accuracy in 16 datasets, while the SVM obtained a higher accuracy in 12 datasets. For two datasets (Vertebral Column and Mammographic), the results were tied. This result may be explained by the fact most of the datasets used in this analysis are ill-defined, i.e., small sample size datasets. For such datasets, the training data may not have enough samples to train a single classifier model and select the best hyperparameters, e.g., the number of neurons in the hidden layer of an MLP neural network, or the regularization parameter,, and the Kernel spread parameter, , in an SVM. In addition, since the training set was small, there might be variations between the training and test distribution. The META-DES.Oracle obtained the best results for several ill-defined problems, such as Liver disorders, Blood transfusion, Heart, Laryngeal1, Wine and Thyroid. Those are all small-sized datasets with less than 500 samples available for training. One advantage of the META-DES framework is that the pool is composed of linear classifiers which do not require the selection of any hyper-parameters. Thus, the training can be performed using small size datasets. Since the training set is relatively small, the classifiers may specialize in local regions of the feature space. Using dynamic selection, only the most competent classifiers in the local region where the test sample is located are used to predict its label. Thus, through DES, it is still possible to obtain high classification accuracy even for ill-defined problems.
Furthermore, the optimization process of the META-DES.Oracle framework is conducted in the meta-problem, using the meta-data extracted in the meta-training stage. Several meta-feature vectors are generated for each training sample in the meta-training phase. For instance, consider that training samples are available for the meta-training stage (); if the pool is composed of weak classifiers (), the meta-training dataset is the number of training samples the number classifiers in the pool , . Hence, even though the problem may be ill-defined, the framework generates enough meta-training data in order to properly train the meta-classifier. There is more data to train the meta-classifier than for the generation of the pool of classifiers itself. Hence, even though the classification problem may be ill-defined, given the size of the training set, using the proposed framework, we can overcome this limitation since the size of the meta-problem is up to 100 times bigger than the classification problem.
In this paper, we propose a novel DES framework using meta-learning and Oracle information, called META-DES.Oracle. 15 sets of meta-features are proposed, using different sources of information found in the DES literature for dynamically estimating the level of competence of base classifiers; these include, local accuracy, ranking, probabilistic, ambiguity and behavior. Next, a meta-feature selection scheme using overfitting cautious Binary Particle Swarm Optimization is performed to optimize the performance of the meta-classifier. The optimization process is guided by a formal definition of the Oracle. Thus, the meta-classifier can better address the complex behavior of the Oracle.
We have conducted a case study using the P2 problem, which is a synthetic dataset with a complex non-linear decision border. We demonstrate that using a pool composed of 5 linear Perceptron classifiers, it is possible to approximate the complex decision boundary of the P2 problem using the proposed framework. The proposed META-DES.Oracle obtained a recognition performance of 97%, which is closer to the results obtained by the Oracle, and compares very favorably against previous versions of the META-DES framework.
Experiments were conducted using 30 classification problems. First, we performed an analysis of the meta-features that were selected for each problem. The analysis demonstrated that the selected sets of meta-features varies considerably according to different datasets. In addition, each meta-feature was selected in at least 20% of the datasets. All sets of meta-features was thus relevant in better addressing the complex behavior of the Oracle. Next, the performance obtained by the proposed META-DES.Oracle was compared with previous versions of the META-DES framework, as well as ten state-of-the-art dynamic selection techniques. Experimental results demonstrate that the META-DES.Oracle outperforms the previous versions of the technique in the majority of the datasets. In addition, the gain in performance obtained by the META-DES.Oracle is shown to be statistically significant based on both the Friedman test with a post-hoc Bonferroni-Dunn correction and the Wilcoxon sign rank test. Thus, the BPSO meta-features selection scheme proposed in this paper does indeed significantly improve the classification performance of the framework.
When compared with static and single classifier methods, the results achieved by the proposed META-DES.Oracle are comparable with the best performing classifiers. Moreover, the results confirm the claim that DES techniques outperform single classifier models for ill-defined problems. Since the optimization process of the META-DES.Oracle is performed using the meta-data generated during the meta-training stage, there is enough data to train and optimize the meta-classifier. Thus, the proposed framework can deal with small sample size classification problems.
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the École de technologie supérieure (ÉTS Montréal) and CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico).