A Self-adaptive Weighted Differential Evolution Approach for Large-scale Feature Selection

Recently, many evolutionary computation methods have been developed to solve the feature selection problem. However, the studies focused mainly on small-scale issues, resulting in stagnation issues in local optima and numerical instability when dealing with large-scale feature selection dilemmas. To address these challenges, this paper proposes a novel weighted differential evolution algorithm based on self-adaptive mechanism, named SaWDE, to solve large-scale feature selection. First, a multi-population mechanism is adopted to enhance the diversity of the population. Then, we propose a new self-adaptive mechanism that selects several strategies from a strategy pool to capture the diverse characteristics of the datasets from the historical information. Finally, a weighted model is designed to identify the important features, which enables our model to generate the most suitable feature-selection solution. We demonstrate the effectiveness of our algorithm on twelve large-scale datasets. The performance of SaWDE is superior compared to six non-EC algorithms and six other EC algorithms, on both training and test datasets and on subset size, indicating that our algorithm is a favorable tool to solve the large-scale feature selection problem. Moreover, we have experimented SaWDE with six EC algorithms on twelve higher-dimensional data, which demonstrates that SaWDE is more robust and efficient compared to those state-of-the-art methods. SaWDE source code is available on Github at https://github.com/wangxb96/SaWDE.

READ FULL TEXT VIEW PDF
06/06/2017

Embedding Feature Selection for Large-scale Hierarchical Classification

Large-scale Hierarchical Classification (HC) involves datasets consistin...
02/27/2020

High-Dimensional Feature Selection for Genomic Datasets

In the presence of large dimensional datasets that contain many irreleva...
04/02/2020

IVFS: Simple and Efficient Feature Selection for High Dimensional Topology Preservation

Feature selection is an important tool to deal with high dimensional dat...
08/22/2019

Applications of Nature-Inspired Algorithms for Dimension Reduction: Enabling Efficient Data Analytics

In [1], we have explored the theoretical aspects of feature selection an...
06/12/2018

MISSION: Ultra Large-Scale Feature Selection using Count-Sketches

Feature selection is an important challenge in machine learning. It play...
01/06/2019

Spectrum-Diverse Neuroevolution with Unified Neural Models

Learning algorithms are being increasingly adopted in various applicatio...
12/06/2020

Towards Better Object Detection in Scale Variation with Adaptive Feature Selection

It is a common practice to exploit pyramidal feature representation to t...

1 Introduction

In the context of the rise of Internet technology, data is growing exponentially, leading to the accumulation of numerous large-scale datasets cai2018feature. Hence, it is essential to obtain valuable information from such huge amount of data ayesha2020overview

. However, the increasing amount of high-dimensional data brings the “curse of dimensionality” problems, which poses computational challenges especially for classification

xue2019self song2020variable. In addition, existing computational algorithms reveal some limitations, such as high complexity, computationally expensive, poor robustness, and low generalizability wainwright2019high. Therefore, it is necessary to develop effective computational models to select important features in data classification to discover valuable information A17.

Feature selection A18

is an effective method to reduce redundant features and has been proved useful in the data process, regarded as a combinatorial optimization problem, with

solutions for -dimensional features xue2019self

. In the past, many heuristic methods have been proposed to address feature selection problems which can be divided into three categories

sheikhpour2017survey, including filter methods, wrapper methods and embedded methods. Filter methods labani2018novel can uncover the importance of features using the internal structure of the training data, which takes less time in the training stage; for instance, Liu et al. liu1996probabilistic proposed a probabilistic-based filter solution for feature selection. Yu et al. yu2003feature developed a fast correlation-based filter solution to tackle the high-dimensional feature selection problem. Hancer et al. hancer2018differential combined the information theory and feature ranking with DE for feature selection. However, filter methods cannot always provide the good results. The wrapper method sheikhpour2016particle adopts the performance of the last-used learner as the evaluation criterion for feature subsets, while the embedded method zhang2015embedded automatically selects features in the learning and training process through a combination of filter method and wrapper method; for instance, Kabir et al. kabir2010new

proposed a neural network-based wrapper feature selection method. Mafarja

et al. mafarja2018whale applied the whale optimization algorithm to feature selection. Maldonado et al. maldonado2009wrapper proposed a SVM-based wrapper approach. Wang et al. wang2015embedded proposed an embedded unsupervised feature selection method. Maldonado et al. maldonado2018dealing developed an embedded strategy that penalizes the cardinality of the feature set by a scaling factor technique. Lu et al. lu2019embedded proposed a new embedded methods considering unknown data heterogeneity. However, these heuristic methods can easily trapped in the local optima due to the uncertainty of selecting and evaluating features individually by greedy strategies. Moreover, the performance of these heuristics is not sufficiently stable since their performance often depends on the particular scenario and the design approach.

To improve the quality of feature selection and to avoid stagnation in local optimality, evolutionary computation (EC) methods A3

, including Genetic algorithms (GAs)

sayed2019nested

, particle swarm optimization (PSO)

zhang2017pso, ant colony optimization (ACO) manoj2019aco , artificial bee colony algorithm (ABC) hancer2018pareto, and differential evolutionary (DE) zhang2020binary algorithms have been proposed to address the aforementioned problems. For example, Chen et al. chen2020evolutionary proposed a multitasking-based evolutionary feature selection approach for high-dimensional classification. Sayed et al. sayed2019nested developed a Nested-GA method to find the optimal feature subset in high-dimensional cancer Microarray datasets. Xue et al. xue2012particle designed a multi-objective based PSO feature selection approach. Tran et al. tran2017new proposed a potential particle swarm optimization (PPSO) algorithm with a new representation method for feature selection. Xue et al. proposed a novel initialization and mechanism in PSO for feature selection. Kashef et al. kashef2015advanced presented an advanced binary ACO (ABACO), which treats each feature as a graph node and regards the feature selection problem as a graph model. Hancer et al. integrated a similarity search strategy in the ABC algorithm hancer2015binary. Mlakar et al. mlakar2017multi proposed a multi-objective differential evolution feature selection method for facial expression recognition. Zorarpac et al. zorarpaci2016hybrid combined ABC and DE algorithm to construct a hybrid method for feature selection. Zhang et al. zhang2020binary proposed a binary DE algorithm with self-learning strategy for multi-objective feature selection problem. Khushaba et al. khushaba2011feature proposed a repair mechanism to integrate with DE for feature selection.

Although many EC algorithms have been employed to address feature selection problems, most EC methods always encounter the complication of stagnation in local optima and numerical instability when dealing with large-scale feature selection problems since large-scale data contains more irrelevant and redundant features xue2012particle chen2020evolutionary. The reason may be that many EC methods are unable to explore and exploit the search space in a balanced manner under different conditions. Recently, the self-adaptive mechanism has been proved as an effective strategy for solving feature selection. For instance, Xue et al. xue2020self proposed a self-adaptive strategy-based PSO algorithm to exploit the global and local information for feature selection. Aladeemy et al. aladeemy2017new proposed the self-adaptive cohort intelligence (SACI) for coinstantaneous feature selection and model selection. Xue et al. xue2019self developed a self-adaptive PSO feature selection method for large-scale datasets. Xue et al. xue2014ensemble proposed a self-adaptive learning techniques-based ensemble algorithm for high-dimensional numerical optimization. Brester et al. brester2014self investigated several multi-objective genetic algorithms for choosing the most important features in a dataset. Huang et al. huang2014music used a self-adaptive harmony search (SAHS) algorithm to select local feature subsets to achieve an automatic music genre-classification system. Essentially, we observe that the algorithms are able to address the problems that standard EC methods cannot.

Differential evolution algorithm is a heuristic stochastic search algorithm for solving optimization problems based on population differences, which was proposed by R. Storn and K. Price with the advantages of fast convergence, few control parameters, and simple setup storn1997differential. The differential evolution algorithm evolves multiple solutions by mutation, crossover, and selection, searching for the best solution. Several studies have already applied DE to feature selection; for instance, Zainudin et al. zainudin2017feature combined relief-f with DE for feature selection, in which a self-adaptive mechanism can adjust the population and generation size. Aladeemy et al. aladeemy2020new proposed an opposition-based self-adaptive cohort intelligence (OSACI) algorithm aladeemy2017new. Ghosh et al. ghosh2013self proposed self-adaptive differential evolution (SADE) to generate feature subsets. Besides, Fister et al. fister2018novel used a threshold mechanism in self-adaptive differential evolution (SADE) to eliminate the irrelevant features. Gaspar-Cunha et al. gaspar2014self

designed a self-adaptive evolutionary multi-objective approach (MOEA), in which the parameters of the classifier are dynamically updated. However, most of these studies lacked generalization ability and neglected to consider high-dimensional data.

Meanwhile, since large-scale data has more search space, an improper search may lead to a time consuming and low classification performance. Hence, population partitioning techniques have been developed to diversify candidate solutions and strategies to find the best answer for large-scale feature selection problems. Zhang et al. zhang2019novel proposed a multi-population niche GA (MPNGA) for feature selection. It combines several filter methods and basic knowledge to reduce the barriers for enhancing search ability of multi-populations. The experiment conducted in this paper shows that the structure of the multi-populations is useful for keeping the population diversified. Park et al. park2020multi proved that a multi-population approach can prevent premature convergence in the course of evolution. Besides, Chen et al. chen2020efficient designed a multi-population original fruit fly algorithm (MOFOA) to boost the search ability and improve performance of feature selection results. Meanwhile, Nseef et al. nseef2016adaptive proposed an adaptive multi-population ABC algorithm for dynamic optimization problems to maintain diversity and cope with dynamic changes. In addition, Chen et al. chen2020multi combined DE with three different embedded multi-population mechanisms for Harris hawk hunting optimization, showing that the method can effectively enhance exploratory and exploitative performance.

From this perspective, we designed a weighted self-adaptive DE algorithm for feature selection. In this algorithm, a self-adaptive search mechanism from global to local is proposed, and five equal sub-populations are generated. Besides, a pool of candidate solution generation strategies is constructed to find the most suitable evolutionary strategy in a dynamic way. Then, eight mutation strategies are considered and the five best-performing mutation strategies are selected to further form the strategy pool. Further, a weighted model is designed to identify the most important features, enabling the model to generate the best solution. The proposed SaWDE algorithm is tested on twelve benchmark datasets and compared to several benchmark algorithms. Results indicate an effective competitive performance compared to these algorithms. Moreover, we have experimented SaWDE with six EC algorithms on twelve higher-dimensional data, which demonstrates that SaWDE is more robust and efficient compared to those state-of-the-art methods.

The rest of the paper is organized as follows: Section II describes the basic knowledge of original DE. Section III describes the proposed method in detail. Section IV details the experimental design. Section V shows the experimental results with discussion. In Section VI , we draw the final conclusions and present future work.

2 Differential Evolution

Differential evolution algorithm, proposed by R. Storn and K. Price storn1997differential, is a stochastic heuristic search algorithm for solving optimization problems. In the beginning, each candidate solution can be denoted as , where = 1, 2, …, , and is the dimension of the data. is initialized randomly as follows:

(1)

where and , is the number of generations and denotes a random number on the interval [0,1].

After that, DE algorithm achieves individual mutation by differential mutation strategy. Taking the mutation strategy “DE/best/1” as an example, the newly generated mutation vector is as follows:

(2)

where and are two mutually exclusive random numbers within the range [1, ], is the best individual in the generation , and is the scaling factor for scaling the difference vector.

Then, the DE algorithm randomly selects individuals by the crossover operation, and the trail vector is generated as follows:

(3)

where CR is the crossover probability. After that, the selection operation is applied to select the better individual as follows:

(4)

where denotes the objective function.

3 Methods

3.1 Methodology Overview of SaWDE

In this study, we develop the SaWDE algorithm to find the best feature subsets on large-scale data. The schematic overview of the algorithm is summarized in Fig. 1. And the main thoughts of SaWDE can be seen from Algorithm 1. First of all, the multi-population mechanism is employed to increase the diversity of a population. In this experiment, the original population is divided into five same sized sub-populations, each of which chooses specific solution generation strategies through the self-adaptive mechanism. The evolution of each sub-population is carried out separately, and a search operation is carried out on each sub-population. Moreover, to maintain the diversity of each sub-population, its individuals are dynamically changed at each generation.

Figure 1: Overview of the proposed SaWDE algorithm. The initialized population is divided equally into five sub-populations, , , , , . Then, each sub-population selects an strategy EnS through the self-adaptive mechanism from the strategy pool. After that, the is evaluated and updated by the selected EnS. Meanwhile, the strategies EnSs are evaluated for additional rewards and further sub-strategy pool construction. Finally, a weighted model is proposed to assess the importance of each feature and search for the best solution by evaluating these features in a combinatorial way.

Second, a strategy pool including ten types of mechanism is employed in evolution. We named those mechanism as to , where each strategy contains three different single candidate mutation scenarios (CMS). In particular, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, and EnS includes , CMS and CMS. CMSs represent DE mutation strategies, which are detailed in Section 3.4. After that, strategies are first selected in a random way and the performance of each strategy is recorded. Then, to select strategies from the the strategy pool in a self-adaptive manner, a self-adaptive selection mechanism is proposed. After that, the strategy pool is further reduced according to the performance of each previous strategy and used to bring the selection from global to local, which ensures the diversity of strategies and enhances the searching for ability of the best-performing strategy. Moreover, during this evolution course, the self-adaptive mechanism can conduct additional incentive selection each twenty generations to further improve the search ability of our algorithm, inspired by wu2016differential.

Input: Set population (), number of population (N), the dimension of data (D), fitness evaluations (FES), maximum FES (MaxFES), subset size (SZ), candidate mutation scenario (CMS), classification accuracy (Acc), the objective function () and so on.
Output: The classification accuracy and the subset size;
Diversification initialization;
;
while (FES MaxFES) do
          Randomly partition into 5 sub-populations;
          for i = 1 5 do
                   Use the Self-adaptive strategy mechanism to select ;
                   for m = 1 3 do
                            in ;
                            for  k = 1 N/5 do
                                     , Update individuals;
                           
                  Update FES;
                   weight Use the weighted model to calculate the feature importance;
                   Use the self-adaptive strategy mechanism to evaluate the performance of the selected ;
                  
          Update the by weighted model;
         
Algorithm 1 Pseudo Code of the SaWDE Algorithm.

Finally, a weighted model is proposed to discover the important features of the dataset at each generation. In this model, we first record the updated features of the current population and those features in the top 20% of individuals in the sub-populations. After that, the model selects and evaluates the features in a combinatorial way according to their rank determined in the previous step.

3.2 Representation of Solutions

Since the standard DE algorithm is a continuous optimization algorithm, the standard continuous encoding scheme of DE cannot be used to directly address large-scale feature selection problems. In our study, we transfer a continuous vector into a binary string under a threshold . At first, to apply an individual to represent the feature, the set population, with -dimensional vector can be defined as follows:

(5)
(6)

After that, we use the threshold , to transfer each element of the individual into a binary string. If the value of the th dimension of the individual is greater than , we set to 1. Otherwise, the th dimension of is set as 0. We can observe that the value in is 0 or 1. 1 represents that the th feature is selected, while 0, means that the th feature is not selected, which can be described as follows:

(7)

3.3 Multi-population-based Strategy

In this work, the population will be randomly and dynamically generated with equal-sized sub-populations. To maintain the diversity of populations without increasing the complexity of our algorithm, five random sub-populations , , , and will be generated in each iteration, as follows:

(8)

We use , as the population size of the parent population, and , , , , and , represent the size of the sub-populations , , , and respectively. In our study, each sub-population has an equal population size.

(9)
(10)

Next, each sub-population is independently optimized for each generation in turn. During the evolution, a sub-population selects its own strategy separately by self-adaptive mechanism, and then evaluates and evolves individuals in each subspace. As each sub-population adopts the self-adaptive mechanism to select particular strategies, different strategies may be used in different sub-populations, which increases the exploration ability of the algorithm from several perspectives.

In addition, the size of sub-population will not change at each generation. Moreover, the individuals can evolve dynamically in each generation, maintaining the diversity of the sub-population and avoiding getting trapped in local optima.

3.4 Construction of the Mutation Strategy Pool

In DE, particular mutation strategies have different performances on the various datasets and as such, some mutation strategies may be more appropriate at specific stages of the evolution than the single mutation strategy. So, we select multiple strategies to accelerate the convergence speed by establishing a strategy pool. Before constructing the strategy pool, selecting appropriate mutation strategies is a very important step due to their potential different performance.

There are two aspects to consider, how many mutation strategies should be selected to build the pool, and which should be selected. Based on this, we first chose eight typical mutation operators that are representative of the current DE algorithms, to build eight candidate mutation strategies CMSs. In our investigation, CMS to CMS represent, “DE/current to best/1”, “DE/current to rand/1”, “DE/rand/3”, “DE/best/1”, “DE/rand to best/1”, “DE/rand/2”, “DE/best/2” and “DE/best/3” wang2011differentialqin2008differential.

During the evolution, DE generates a mutation vector = (), for each individual in the -th generation. The indices and are random integers, which are mutually exclusive within the range [1, N]. The , denotes the best individual in the present population. Besides, is the scaling factor for scaling the difference vector.

  • “DE/current to best/1”

    (11)
  • “DE/current to rand/1”

    (12)
  • “DE/rand/3”

    (13)
  • “DE/best/1”

    (14)
  • “DE/rand to best/1”

    (15)
  • “DE/rand/2”

    (16)
  • “DE/best/2”

    (17)
  • “DE/best/3”

    (18)

Following, as practice is the only criterion for testing truth, we identify effective CMSs experimentally. First, the strategies are tested respectively on a variety of datasets. Then, the top five CMSs are selected to form the strategy pool.

To enhance the search ability and prevent overfitting, we combine every three different CMSs to form the ensemble mechanism as a final strategy, called EnS, five CMSs are selected and ten different EnSs generated according to the combination principle. Specifically, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, EnS includes CMS, CMS and CMS, and EnS includes , CMS and CMS. These EnSs compose our initial strategy pool, which can seen in Fig. 2.

Figure 2: The process of strategy pool construction. First, the CMSs are in a queue and selected to train the training datasets in sequence. After that, all the CMSs are evaluated based on their performance, and the top 5 CMSs are selected to construct the strategy pool. On this basis, every three different CMSs are selected to construct an EnS ensemble strategy and all of these make up the initial strategy pool.

However, once a strategy pool is built it usually does not change. Some strategies EnSs in the strategy pool will not play an active role in the evolution process, which increases the computational cost of the algorithm and does not improve the algorithm’s performance. Moreover, a single strategy is insufficient to meet the requirements of evolution at different stages due to lack of diversity as stated above. Therefore, we propose a self-adaptive mechanism to choose and evaluate the EnSs.

3.5 Self-Adaptive Mechanism

In our study, the adaptive parameters in our algorithm are F and CR, similar to the references wu2016differential zhang2009jade. Under the adaptive F and CR, we propose our self-adaptive mechanism to automatically adapt the most appropriate EnS depends on the characteristics of the datasets and their performance. There are two aspects to consider in self-adaptive mechanism: (1) How to choose the EnSs ensemble strategy; (2) and how to evaluate each EnS during evolution. Moreover, the performance of the DE is strongly related to the setting of the strategy and control parameters mallipeddi2011differential. To address these problems, a novel self-adaptive mechanism is proposed here so that choosing the most appropriate EnS depends on the characteristics of the datasets and and the EnSs’ performance on these datasets. The proposed self-adaptive mechanism is described in Algorithm 2.

First of all, the self-adaptive selection mechanism randomly chooses EnSs during the first half iterations to maintain a fair competition. Thereafter, we use the to record the number of each EnSs selected. Meanwhile, we also record the increased accuracy as, for each EnSs that improved the individuals in the population successfully, which may be defined as follows:

(19)
(20)

where, ’’ denotes the increased accuracy of successful individuals. After each twenty generations, the EnSs with the best performance is rewarded regarding its prior performance, . The is defined by the maximum ratio of the increased accuracy, with the consumed FES (). Then, the for each EnSs is:

(21)
Input: sub population (), consumed FES (), the increased accuracy of changed individuals (), count the number of selected EnSs (), the number of reward EnSs (), is used to analyze the relationship between strategy choice and reward strategy, is the top five strategies according to , and denotes the change rate in accuracy of individuals.
Output: The selected EnS.
while (FES MaxFES) do
          for i = 1 5  do
                   if  FES MaxFES then
                            ;
                  else
                            ;
                            = ;
                            ;
                            ;
                  if  then
                            ;
                            = ;
                            EnS;
                            =+1;
                   ;
                   Update and FES;
                  
         Update FES;
         
Algorithm 2 Pseudo Code of the Self-adaptive Strategy Mechanism.
Figure 3: The workflow of the self-adaptive mechanism. Throughout evolution, each EnS in the strategy pool is selected with the same probability in the first half time. The selected EnS is then evaluated and recorded. After that, the best EnS on each twenty generation performance will give the extra selection reward. The strategy pool is the same as the initialized strategy pool in Fig. 2 in the first half time but reduced by half according to the performance of EnSs in the last half time. The top 5 EnSs are dynamically selected to the sub-strategy pool before each function evaluation based on their performance. Then those EnSs are selected in the same way as the first half time.
(22)

Based on this, the EnSs are selected from global to local based on the performance of first half time. Specifically, a sub-strategy pool with half the content of the original strategy pool is constructed to update the strategy search from global to local. To build that, we first calculate the performance of each strategy EnS before each evaluation, then the top five EnSs with the highest ratio of total number of rewards (Reward) to total number of choices (EnSNum) are selected to construct the sub-strategy pool for further search. After that, the EnS is selected from the sub-strategy pool in the same way as the first half time to ensure the appropriate EnSs are selected.

3.6 Weighted Model

To assess the importance of the each feature, a weighted model is proposed to calculate the weight of features in evolution. There are two main processes in the weighted model, the first one records and assesses the importance of each feature at each generation, and the second searches for the solution feature subset.

Figure 4: The recording procedure of the weighted model. During the evolutionary stage, the new features in individual are recorded by a matrix , when it updates successfully. And after an EnS, all the individuals are ranked due to their performance, and the top 20% individuals’ all features are recorded by a matrix .

The first process consists of two steps each having a different assessment approach of important features. The first step (CF) works in the evolutionary stage, it stores the new features of individuals who performed better during evolution, and the second step records directly all the features of the best 20% of individuals (AF) at the end of an EnS in the ranking stage. The first 2-step process can be formulated as follows:

(23)
(24)

The aim is to rank the importance of the features and then reduce the search space to find a good solution efficiently and effectively. A more vivid view of the first weighting process is depicted in Fig. 4.

In the second process, solutions are searched every twenty generations considering the first process. The features are first selected according to their rank in the corresponding weight evaluation matrix, according to the two different weightings of process 1 and then each feature subset sequentially evaluated at each stage. After that, the best individual is compared with the worst individual in the parent population, and if the selected best individual shows a better performance, replaces that individual. The procedure of the weighted model can be seen in Algorithm 3.

3.7 Time Complexity Analysis

In this section, we analyze the time complexity of SaWDE method. In the initialization phase, SaWDE costs , where , is the number of individuals in the population and

, is the dimension of the dataset. After that, as each dataset is classified via KNN, it costs

, where , is the number of samples and , is the parameter used to determine the number of nearest neighbors. On evolution, SaWDE costs in the transformation from real numbers to 0 and 1 before function evaluation, and costs for function evaluation in each generation. In addition, the self-mechanism costs, . What’s more, the weighted model costs in the first step and in the second step of the first process at each generation, where , is the number of changed individuals and , is the number of updated features. In the second process, the weighted model costs, for searching the potential solutions and costs, for function evaluations.

Input: selected features in individual (), updated features in individual (), sub population (), zeros (1, D), zeros (1, D).
Output: A new .
while  FES MaxFES do
          for i = 1 5  do
                   ;
                   ;
                   for each k in  do
                            +1;
                           
                   ;
                   for k = 1 * ( ( ())) do
                            for each j in  do
                                     + 1;
                                    
                           
                  
         if  (generations, 20) == 0 then
                   = ;
                   = ;
                   for  i = 1 D do
                            ;
                            ;
                  for  i = D + 1 D do
                            ;
                            ;
                  if  then
                            ;
                  
         Update FES;
         
Algorithm 3 Pseudo Code of the Weighted Model.

4 Experimental Design

Our experimental strategy is first to construct a good strategy pool for SaWDE by selecting the five best performers from eight candidate CMSs. Meanwhile, every three CMSs from the five selected CMSs are combined into an EnS to make up a robust strategy for evolution. After that, the performance of SaWDE is compared to the above eight CMSs.

Second experiment we test which population size () is best for the main experiment. The influence of different population size on evaluation performance is different, a smaller population is not conducive to global search and a larger population will increase the cost of calculation. Therefore, a suitable parent population size is a useful preparation for evolution. In this experiment, we set the N to 50, 100, 200, 300 or 500, on all twelve datasets.

The main experiment is to test the performance of the SaWDE algorithm on the twelve datasets. After the experiment, it will compare other compared algorithms with the performance on training and test datasets.

Finally, without the loss of generality, twelve higher-dimensional datasets are used for evaluating the SaWDE model.

4.1 Datasets

As datasets, we use twelve datasets from the University of California Irvine (UCI) Machine Learning Repository

xue2019self. Each dataset contains information about instances, features and classes. The details of the datasets are shown in Table 1 . Each dataset was randomly divided into a training dataset and test dataset with a ratio of 7 to 3, respectively.

To further validate the performance of SaWDE, twelve higher-dimensional datasets de2008clustering are employed to evaluate SaWDE, which can download from “https://schlieplab.org/Static/Supplements/CompCancer/datasets.htm”. The details of these data are summarized in Table 1, which contains a wide range of cancer types, mostly above 1000 dimensions, with the highest dimensionality of 4553.

Dataset I Instances Features Classes Dataset II Instances Features Classes
grammaticalfacialexpression01 1,062 301 2 Alizadeh-2000-v1 42 1095 2
SemeionHandwrittenDigit 675 256 10 Alizadeh-2000-v2 62 2093 3
isolet5 1,040 617 26 Armstrong-2002-v1 72 1081 2
MultipleFeaturesDigit 1000 649 10 Bittner-2000 38 2201 2
HAPTDataSet 1200 561 12 Dyrskjot-2003 40 1203 3
har 900 561 6 Garber-2001 66 4553 4
UJIIndoorLoc 900 522 3 Liang-2005 37 1411 3
MadelonValid 600 500 2 Nutt-2003-v2 28 1070 2
OpticalRecognitionofHandwritten 1000 64 10 Pomeroy-2002-v1 34 857 2
ConnectionistBenchData 208 60 2 Pomeroy-2002-v2 42 1379 5
wdbc 596 30 2 Shipp-2002-v1 77 798 2
LungCancer 32 56 3 West-2001 49 1198 2
Table 1: High-dimensional datasets used to evaluate the performance of SaWDE. Each dataset contains information about instances, features and classes.

4.2 Parameter Settings

We use MaxFES (Maximum Function Evaluations) as the stop criterion. In all experiments, we set the MaxFES to . Meanwhile, SaWDE ends the evolution early when 100% accuracy is achieved in training and the feature subset size is less than half of the original subset. The parameter settings of the comparison algorithms are summarized in Table 2. For our proposed SaWDE, we set population size to 100, and eight initial F and CR are set to 0.5, 1, 0.6, 0.9, 0.5, 0.9, 0.6 and 1; 0.1, 0.2, 0.9, 0.8, 0.9, 0.1, 0.8 and 0.2, respectively. KNN is used to classify all datasets over the evolution process, the number of nearest neighbors is set to 3 xue2015survey. Meanwhile, we used the 3-fold cross-validation in KNN xue2019self.

Algorithms Parameter values
LRS21 l = 2; r = 1
LRS32 l = 3; r = 2
DE F = 0.5; CR = 0.1; MaxFES =
SaDE Initial CR = 0.5; ; LP = 10; MaxFES =
GA CR = 0.7; MR = 0.1; SR = 0.5; MaxFES =
Original PSO C1 = C2 = 1.49618; w = 0.7298; MaxFES =
Standard PSO C1 = C2 = 1.49618; ; MaxFES =
SaPSO = 0.2; LP = 10; ps = 100; = 0.6 ; Ub = 1; Lb = 0;
Ubv = 0.5; Lbv = -0.5; MaxFES =
SaWDE Initial CR = [0.1, 0.2, 0.9, 0.8, 0.9, 0.1, 0.8, 0.2];
Initial F = [0.5, 1, 0.6, 0.9, 0.5, 0.9, 0.6, 1]; MaxFES =
Table 2: Parameter values of the comparison algorithms xue2019self

5 Results

5.1 Computational Results and Comparisons

In this part, we compare the experimental results of our SaWDE algorithm with those six other non-EC and six EC algorithms. The results are summarized as the four parts, namely, the subset size, classification accuracy and convergence curves of the algorithm on the training datasets, the classification accuracy on the test datasets. As can be seen from the experimental results, the performance of our SaWDE algorithm is better than that of all the EC and non-EC algorithms.

5.1.1 Results of subset size on Test Datasets

Table 3 shows the subset sizes obtained by the SaWDE algorithm and six non-EC and six EC algorithms. The upper half of Table 3 shows the experimental results obtained with the non-EC and SaWDE algorithms, and the lower half Table 3 the experimental results obtained with the EC and SaWDE algorithms. In the table, the first column represents the data used in the experiment, and the subset sizes are given as mean values of the trials after all experimental times, ‘%’ represents the reduction rate between the subset size and the original size. The best results for each dataset are highlighted in bold text.

In the non-EC comparison, we see that although our SaWDE algorithm doesn’t necessarily perform the best in percentage reduction, it has a reduction rate of more than 90% or close to this in over half of the datasets. At the same time, the SaWDE algorithm works nearly as well as the best results for subset sizes on many datasets.

In the EC algorithm comparison, SaWDE’s subset size results from all but the eleventh dataset are much better than other EC algorithms. Further, the advantages of SaWDE in subset size become more apparent especially as the feature dimensions of the dataset get larger and larger. For example, compared to SaPSO, the best previous EC algorithm, SaWDE reduces the subset size by 10 to 20 percent more on most datasets. In addition, as seen in the Table, the standard deviation of our algorithm on solving the subset size is relatively low, indicating the stability of our algorithm.

Datasets SFS SBS LRS21 LRS32 SFFS SBFS SaWDE
Mean Std % Mean Std % Mean Std % Mean Std % Mean Std % Mean Std % Mean Std %
grammaticalfacialexpression01 4.9 1.4 98.3 298.3 1.4 0.8 4.8 1.1 98.4 5.6 1.3 98.1 5.7 1.6 98.1 298.4 0.9 0.8 7.5 1.0 97.5
SemeionHandwrittenDigit 14.8 5.0 94.2 254.4 0.9 0.6 14.0 4.0 94.5 14.1 3.8 94.4 11.8 2.4 95.3 253.9 0.3 0.8 79.5 13.0 68.9
isolet5 16.0 3.9 97.4 615.2 0.9 0.2 15.9 3.0 97.4 16.3 3.5 97.3 13.6 2.0 97.7 614.8 0.3 0.3 75 2.0 87.8
MultipleFeaturesDigit 10.2 2.5 98.4 646.1 0.9 0.4 10.1 2.7 98.4 9.5 2.4 98.5 10.8 1.6 98.3 646.5 0.6 0.3 108.3 55.5 83.3
HAPTDataSet 8.5 2.0 98.4 559.2 1.1 0.3 8.3 1.6 98.5 8.7 2.1 98.4 9.6 2.3 98.2 558.6 0.6 0.4 51.5 23.0 90.8
har 7.7 2.3 98.6 559.3 1.0 0.3 7.1 1.6 98.7 7.1 2.1 98.7 9.0 2.3 98.3 558.9 0.3 0.3 60.8 2.5 89.2
UJIIndoorLoc 1.8 0.4 99.6 521.0 0.0 0.1 2.0 0.0 99.6 2.0 0.0 99.6 2.0 0.4 99.6 519.8 0.6 0.4 1.0 0.0 99.8
MadelonValid 3.1 2.4 99.3 498.1 0.9 0.3 5.7 2.7 98.8 3.7 2.5 99.2 6.5 2.0 98.6 497.8 0.4 0.4 17.8 7.2 96.5
OpticalRecognitionofHandwritten 15.3 2.5 76 62.4 0.8 2.4 15.4 3.1 75.8 14.9 2.6 76.6 9.7 2.5 84.7 61.7 0.6 3.5 25.8 5.5 59.8
ConnectionistBenchData 4.1 1.6 93.1 58.1 0.9 3 4.5 1.5 92.3 3.9 1.2 93.3 4.4 1.6 92.6 57.7 0.5 3.7 8.8 4.9 85.4
wdbc 3.4 1.0 88.5 28.4 0.7 5.2 3.5 1.0 88.2 3.3 1.1 89 3.1 1.0 89.5 27.8 0.3 7.1 11.3 4.2 62.5
LungCancer 3.1 1.4 94.4 54.1 0.8 3.2 2.9 1.0 94.7 3.3 1.4 94.1 3.5 1.7 93.7 53.8 0.4 3.8 5.8 3.2 89.7
Datasets GA Original PSO Standard PSO SaPSO DE SaDE SaWDE
Mean Std % Mean Std % Mean Std % Mean Std % Mean Std % Mean Std % Mean Std %
grammaticalfacialexpression01 124.6 13.8 58.6 123.7 10.1 58.8 121.0 8.0 59.8 77.1 10.1 74.3 121.7 8.4 59.5 114.0 7.7 62.1 7.5 1.0 97.5
SemeionHandwrittenDigit 188.1 22.5 26.5 150.0 13.9 41.3 165.3 15.8 35.4 107.5 4.8 58 110.9 17.2 56.6 108.3 7.2 57.6 79.5 13.0 68.9
isolet5 339.3 51.6 45 262.3 20.8 57.4 286.9 36.8 53.4 159.3 8.1 74.1 244.5 11.4 60.3 233.6 9.9 62.1 75 2.0 87.8
MultipleFeaturesDigit 333.8 48.5 48.5 294.3 24.6 54.6 299.9 24.0 53.7 147.4 14.9 77.2 252.7 12.7 61 249.3 11.2 61.5 108.3 55.5 83.3
HAPTDataSet 324.9 54.8 42 273.5 24.4 51.2 286.8 31.2 48.8 122.9 15.6 78 227.0 11.6 59.5 220.2 11.7 60.7 51.5 23.0 90.8
har 342.1 49.9 39 289.3 29.2 48.4 308.9 34.7 44.9 123.0 15.8 78 224.1 11.6 60 216.5 10.8 61.3 60.8 2.5 89.2
UJIIndoorLoc 225.4 45.0 56.8 85.4 8.8 83.6 23.0 6.7 95.5 3.4 4.1 99.3 171.2 2.9 67.1 155.7 4.5 70.1 1.0 0.0 99.8
MadelonValid 290.6 54.1 41.8 228.9 20.6 54.2 255.9 38.4 48.8 111.8 10.8 77.6 201.0 11.6 59.7 185.8 10.5 62.8 17.8 7.2 96.5
OpticalRecognitionofHandwritten 42.2 4.3 34 39.9 2.7 37.6 42.8 3.3 33.1 32.8 1.7 48.6 36.4 7.2 43 34.5 3.1 46 25.8 5.5 59.8
ConnectionistBenchData 23.1 3.7 61.4 21.5 3.7 64.1 23.4 3.0 60.8 18.2 2.5 69.6 21.4 2.7 64.2 20.2 3.2 66.2 8.8 4.9 85.4
wdbc 13.4 3.6 55.1 13.4 2.3 55.3 15.0 3.4 50 9.9 2.3 67 11.3 2.1 62.1 11.3 1.8 62.3 11.3 4.2 62.5
LungCancer 17.0 4.5 69.5 18.3 3.6 67.2 17.2 3.9 69.2 11.5 2.5 79.4 19.6 3.8 65 18.6 4.0 66.7 5.8 3.2 89.7
Table 3: Subset sizes of six other non-EC, six EC methods and SaWDE on training datasets

5.1.2 Results of Classification Accuracy on Training Datasets

The classification accuracy results of the non-EC and EC algorithms on the training datasets are shown in Table 4 . In the upper half of Table 4, we observe that the SaWDE algorithm performs better than the other six non-EC methods in terms of classification accuracy across all twelve training datasets. The analysis of the data reveals the superiority of our algorithm. In fact, SaWDE improves the classification accuracy of more than half of the data by at least 5%, some increasing by 15%, compared to the best classification results of the six existing non-EC algorithms. Table 4 lower half shows that SaWDE also performs well when compared to the six EC algorithms, producing the best classification results in ten of the twelve datasets. The results show that our SaWDE algorithm performs effectively compared to existing EC and non-EC methods, providing a valid strategy for solving feature selection problems.

Datasets SFS SBS LRS21 LRS32 SFFS SBFS SaWDE
Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std
grammaticalfacialexpression01 0.8929 0.0082 0.8147 0.0113 0.8924 0.0078 0.8965 0.0077 0.9002 0.0051 0.8170 0.0108 0.9250 0.0020
SemeionHandwrittenDigit 0.6832 0.0492 0.8471 0.0041 0.6772 0.0511 0.6822 0.0453 0.6396 0.0350 0.8475 0.0034 0.9094 0.0011
isolet5 0.7481 0.0392 0.7615 0.0041 0.7460 0.0277 0.7475 0.0361 0.6949 0.0363 0.7599 0.0033 0.9100 0.0041
MultipleFeaturesDigit 0.8449 0.0208 0.9493 0.0017 0.8453 0.0209 0.8373 0.0193 0.9194 0.0150 0.9478 0.0030 0.9843 0.0000
HAPTDataSet 0.9244 0.0111 0.9192 0.0022 0.9261 0.0081 0.9266 0.0120 0.9282 0.0074 0.9185 0.0020 0.9702 0.0000
har 0.9200 0.0110 0.9134 0.0031 0.9196 0.0091 0.9170 0.0114 0.9268 0.0119 0.9137 0.0031 0.9774 0.0040
UJIIndoorLoc 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000
MadelonValid 0.6595 0.0858 0.7155 0.0072 0.7392 0.1062 0.6803 0.0804 0.8160 0.0399 0.7145 0.0060 0.9042 0.0068
OpticalRecognitionofHandwritten 0.9482 0.0108 0.9675 0.0021 0.9490 0.0116 0.9470 0.0121 0.8185 0.0737 0.9676 0.0028 0.9857 0.0020
ConnectionistBenchData 0.8476 0.0230 0.8095 0.0093 0.8492 0.0190 0.8444 0.0181 0.8540 0.0243 0.8080 0.0111 0.9643 0.0035
wdbc 0.9419 0.0052 0.9474 0.0023 0.9452 0.0046 0.9449 0.0051 0.9442 0.0060 0.9466 0.0024 0.9705 0.0037
LungCancer 0.7300 0.0657 0.6440 0.0223 0.7167 0.0460 0.7419 0.0514 0.7470 0.0481 0.6315 0.0253 0.8929 0.0275
Datasets GA Original PSO Standard PSO SaPSO DE SaDE SaWDE
Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std
grammaticalfacialexpression01 0.9078 0.0035 0.9107 0.0022 0.9127 0.0040 0.9159 0.0026 0.8988 0.0024 0.9029 0.0023 0.9250 0.0020
SemeionHandwrittenDigit 0.8702 0.0057 0.8753 0.0041 0.8809 0.0060 0.9031 0.0049 0.8464 0.0034 0.8510 0.0044 0.9094 0.0011
isolet5 0.8205 0.0076 0.8446 0.0089 0.8495 0.0138 0.8899 0.0025 0.7941 0.0035 0.8059 0.0047 0.9100 0.0041
MultipleFeaturesDigit 0.9688 0.0029 0.9705 0.0034 0.9734 0.0041 0.9831 0.0015 0.9585 0.0017 0.9610 0.0017 0.9843 0.0000
HAPTDataSet 0.9467 0.0032 0.9510 0.0027 0.9529 0.0024 0.9693 0.0043 0.9351 0.0020 0.9391 0.0020 0.9702 0.0000
har 0.9379 0.0056 0.9429 0.0056 0.9476 0.0064 0.9745 0.0048 0.9262 0.0028 0.9269 0.0025 0.9774 0.0040
UJIIndoorLoc 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000
MadelonValid 0.7837 0.0116 0.8157 0.0110 0.8221 0.0169 0.8722 0.0050 0.7608 0.0082 0.7724 0.0046 0.9042 0.0068
OpticalRecognitionofHandwritten 0.9860 0.0018 0.9838 0.0018 0.9849 0.0019 0.9863 0.0009 0.9693 0.0023 0.9748 0.0016 0.9857 0.0020
ConnectionistBenchData 0.9447 0.0108 0.9467 0.0114 0.9461 0.0146 0.9537 0.0086 0.9024 0.0055 0.9124 0.0062 0.9643 0.0035
wdbc 0.9657 0.0035 0.9670 0.0037 0.9665 0.0042 0.9682 0.0033 0.9644 0.0013 0.9658 0.0014 0.9705 0.0037
LungCancer 0.9355 0.0289 0.9431 0.0232 0.9484 0.0303 0.9310 0.0192 0.8718 0.0245 0.8855 0.0211 0.8929 0.0275
Table 4: Classification accuracy of six other non-EC, six EC methods and SaWDE on training dataSets
Figure 5: Comparative convergence curves of different algorithms in terms of training classification accuracy and feature subset size on different datasets.

5.1.3 Results of Convergence curves on Training Datasets

The convergence performance comparison of SaWDE and other evolutionary methods in terms of the training classification accuracy and feature subset size are shown in Fig. 5. From this figure, it can be observed that all algorithms have similar convergence in the early stages. Specifically, SaWDE outperforms GA, original PSO, standard PSO, DE, SaDE, and SaPSO in terms of classification accuracy and feature subset size on most datasets even in the early convergence stage. Moreover, we can find that SaWDE still has excellent search ability in the late evolutionary stage. Therefore, based on the convergence curves, we can conclude that SaWDE is robust compared to other evolutionary models.

5.1.4 Results of Classification Accuracy on Test Datasets

Table 5 shows the classification accuracy results of non-EC and EC algorithms on the test datasets. The performance of the classification accuracy on the test datasets is an important measure of the robustness of an algorithm. As can be seen from Table 5, the classification accuracy of the SaWDE algorithm on test datasets perform well. Compared with the non-EC algorithms, the SaWDE algorithm performs better on all datasets except two. Compared to the EC algorithm, only one classification accuracy result is inferior to EC results. This shows that the SaWDE algorithm has better robustness and performs better than other algorithms in most cases.

Datasets SFS SBS LRS21 LRS32 SFFS SBFS SaWDE
Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std
grammaticalfacialexpression01 0.8447 0.0133 0.7301 0.0313 0.8471 0.0150 0.8446 0.0169 0.8165 0.0211 0.7319 0.0333 0.8967 0.0059
SemeionHandwrittenDigit 0.5478 0.0663 0.7994 0.0167 0.5495 0.0613 0.5542 0.0587 0.4355 0.0548 0.7991 0.0188 0.7963 0.0062
isolet5 0.6568 0.0498 0.7144 0.0162 0.6562 0.0482 0.6586 0.0505 0.4620 0.0578 0.7139 0.0167 0.8414 0.0153
MultipleFeaturesDigit 0.7666 0.0261 0.9308 0.0090 0.7598 0.0242 0.7609 0.0250 0.8444 0.0374 0.9308 0.0087 0.9615 0.0039
HAPTDataSet 0.8520 0.0220 0.8479 0.0130 0.8531 0.0185 0.8516 0.0219 0.7764 0.0253 0.8449 0.0127 0.9120 0.0112
har 0.8702 0.0204 0.8545 0.0144 0.8779 0.0165 0.8720 0.0267 0.8017 0.0397 0.8554 0.0162 0.9271 0.0098
UJIIndoorLoc 0.9981 0.0032 0.9996 0.0011 0.9974 0.0038 0.9989 0.0021 0.9979 0.0029 0.9993 0.0022 0.9996 0.0001
MadelonValid 0.5400 0.0728 0.6173 0.0291 0.6029 0.1033 0.5797 0.0779 0.6482 0.0673 0.6141 0.0325 0.7841 0.0074
OpticalRecognitionofHandwritten 0.8833 0.0216 0.9317 0.0084 0.8843 0.0244 0.8831 0.0287 0.7248 0.0789 0.9319 0.0111 0.9493 0.0037
ConnectionistBenchData 0.6634 0.0498 0.6513 0.0624 0.6608 0.0581 0.6456 0.0391 0.6649 0.0737 0.6494 0.0595 0.7728 0.0149
wdbc 0.8622 0.0248 0.8887 0.0148 0.8774 0.0293 0.8719 0.0300 0.8881 0.0513 0.8881 0.0141 0.9144 0.0039
LungCancer 0.4870 0.2037 0.6204 0.0990 0.4417 0.1694 0.4139 0.1892 0.4269 0.1658 0.5926 0.1085 0.5455 0.0134
Datasets GA Original PSO Standard PSO SaPSO DE SaDE SaWDE
Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std
grammaticalfacialexpression01 0.8513 0.0151 0.8449 0.0131 0.8507 0.0126 0.8448 0.0157 0.8506 0.0133 0.8470 0.0117 0.8967 0.0059
SemeionHandwrittenDigit 0.7872 0.0201 0.7821 0.0272 0.7892 0.0254 0.7795 0.0216 0.7690 0.0207 0.7685 0.0183 0.7963 0.0062
isolet5 0.7487 0.0258 0.7875 0.0232 0.7777 0.0202 0.8106 0.0167 0.7375 0.0214 0.7534 0.0223 0.8414 0.0153
MultipleFeaturesDigit 0.9356 0.0091 0.9327 0.0105 0.9348 0.0125 0.9388 0.0092 0.9300 0.0106 0.9334 0.0104 0.9615 0.0039
HAPTDataSet 0.8530 0.0159 0.8585 0.0146 0.8599 0.0152 0.8814 0.0121 0.8496 0.0199 0.8504 0.0172 0.9120 0.0112
har 0.8679 0.0146 0.8727 0.0156 0.8713 0.0174 0.9140 0.0212 0.8630 0.0212 0.8673 0.0243 0.9271 0.0098
UJIIndoorLoc 0.9962 0.0053 0.9901 0.0064 0.9952 0.0043 0.9986 0.0026 0.9908 0.0082 0.9919 0.0103 0.9996 0.0001
MadelonValid 0.6421 0.0291 0.6598 0.0352 0.6503 0.0402 0.6863 0.0351 0.6400 0.0357 0.6387 0.0346 0.7841 0.0074
OpticalRecognitionofHandwritten 0.9312 0.0097 0.9278 0.0098 0.9324 0.0078 0.9364 0.0107 0.9187 0.0135 0.9247 0.0136 0.9493 0.0037
ConnectionistBenchData 0.6908 0.0556 0.6972 0.0462 0.6804 0.0423 0.7005 0.0511 0.6779 0.0442 0.6841 0.0553 0.7728 0.0149
wdbc 0.9125 0.0232 0.9162 0.0191 0.9092 0.0213 0.9099 0.0135 0.9021 0.0200 0.9021 0.0127 0.9144 0.0039
LungCancer 0.4750 0.1602 0.4565 0.1211 0.4565 0.1348 0.5102 0.1442 0.4852 0.1523 0.4481 0.1446 0.5455 0.0134
Table 5: Classification accuracy of six other non-EC, six EC methods and SaWDE on test dataSets

5.2 Performance of Different Mutation Strategies

Fig. 6 shows the mean classification accuracy of all eight CMSs and our SaWDE algorithm on training and test datasets. The heatmap represents the classification accuracy results on the training datasets, the red indicating the superior results. The diagram shows the classification accuracy results on the test datasets. The different datasets are denoted by specific colors, and the best result in each dataset has a different color.

Figure 6: The subset size on training datasets (a) and test datasets (b). The numbers 1 to 12 in both (a) and (b) denote the data grammaticalfacialexpression01, SemeionHandwrittenDigit, isolet5, MultipleFeaturesDigit, HAPTDataSet, har, UJIIndoorLoc, MadelonValid, OpticalRecognitionofHandwritten, ConnectionistBenchData, wdbc and LungCancer, respectively.
Figure 7: The classification accuracy on each single CMS and SaWDE on training datasets. The numbers on the ring denote the corresponding algorithms’ subset size. The best result of all of the algorithms is shown beneath the crown in the center of each ring. The size is rounded off here.

Fig. 7 shows the mean subset size of all eight CMSs and our SaWDE algorithm on training datasets. In Fig. 7, each color on the ring represents a different algorithm, and the number in the color denotes the subset size. The best result of all of the algorithms is shown beneath the crown in the center of each ring. As can be seen from Fig. 7, SaWDE generally delivers the best results in terms of the size of the solution.

In the dataset, grammaticalfacialexpression01, the three with the worst accuracy for classification are CMS, CMS and CMS. The worst three in the data, SemeionHandwrittenDigit are CMS, CMS, and CMS; in the data, isolet5 CMS, CMS and CMS; in the data, MultipleFeaturesDigit CMS, CMS, and CMS; in the data, HAPTDataSet CMS, CMS, and CMS and in the data, har CMS, CMS, and CMS. The classification accuracy of all strategies in UJIIndoorLoc is the same, so it is not counted. The worst three in the data, MadelonValid are CMS, CMS and CMS. Data OpticalRecognitionofHandwritten worst values contain four CMSs, therefore, statistics of four CMS, CMS, CMS and CMS respectively. The worst three in the data, ConnectionistBenchData are CMS, CMS, and CMS; in data WDBC, are CMS, CMS, and CMS; and data LungCancer, CMS, CMS and CMS. Among these, CMS, CMS, CMS, CMS, CMS, CMS, CMS and CMS are counted 3 times, 1 time, 8 times, 3 times, 5 times, 3 times, 5 times and 6 times, respectively. Therefore, CMS with 8 counts and CMS with 6 counts are excluded first. CMS and CMS, were counted 5 times, but CMS is eliminated because only 3 datasets of CMS are not as accurate as CMS. Based on the above, CMS, CMS, CMS, CMS, and CMS are selected for further strategy pool building.

Figure 8: Classification accuracy convergence curves on data 1-data 12. Each shape denotes an algorithm, and the numbers in abscissa denote the function evaluation times, which the unit of measurement is FES.

Fig. 8 depicts the convergence curve of classification accuracy from CMS to CMS and SaWDE on all datasets, respectively. It can be seen from the results that different strategies have varying degrees of convergence at the different stages. Given that the search capability of each strategy is different, the combination of strategies and the construction of a strategy pool can efficiently extend the search capability of the algorithm and prevent it falling into the local optimal situation. From the performance of the SaWDE algorithm on all datasets, it compares to the other eight separate CMSs, getting the best results on most datasets.

5.3 Influence of Different Population Sizes on Datasets

Fig. 9 represents the results of our investigation on the subset size on the training datasets. Figs. 10 - 11 represent the results of our investigation on the classification accuracy of different population sizes on the training datasets and the classification accuracy on the test datasets respectively. The aim of this experiment is to find the most suitable population number to get the best evolutionary results.

In this section, we compare our SaWDE algorithm with SaPSO to optimize the feature subsets. As shown in Fig. 9, when is 50 or 500, all the subset sizes of SaWDE are smaller than those of SaPSO. When is 100, 200, or 300, there are one, three, and two slightly bigger solutions than SaPSO, respectively. Different populations can have some effect on subset size, but overall, SaWDE has an advantage over SaPSO in terms of subset size.

We see that when is 50 in Fig. 10, SaWDE has only five results that are better than or equal to SaPSO’s results. This indicates that when the population size is 50, there is a certain degree of loss of diversity. However, when is 100, 200, 300, or 500 respectively, SaWDE has only two results inferior to SaPSO, which indicates that when the population number is greater than or equal to 100, the diversity requirements of the population can be basically satisfied.

Figure 9: The influence of different N on subset size. Each color denotes a situation, and the bigger the area, indicating the subset size is bigger.

We see that when is 50 or 300 in Fig. 11, SaWDE’s classification accuracy on the test dataset is in one situation worse than SaPSO’s, and when is 200 or 500, SaPSO’s classification accuracy is in two situations. However, when is 100, SaWDE’s classification accuracy on all test datasets is superior to SaPSO’s. Combining the above three aspects and considering other problems such as computational complexity, we finally set the population to a size of 100.

Figure 10: The influence of different N on classification accuracy of training datasets. The classification accuracy of each case is connected with broken lines to show the difference more clearly, and there is a crown in the best case.
Figure 11: The influence of different N on classification accuracy of test datasets. The column is used to represent accuracy, which the higher the column indicates the better results.

5.4 Effect of the Number of Different sub-populations on Datasets

The number of sub-populations can affect the degree of evolution; too few populations may not achieve the desired results, while too many populations affect the complexity of the algorithm. To address this problem, we have conducted an experiment to discuss the effect of the number of sub-populations. In our study, since all the sub-populations are of the same size, we set the number of sub-populations at 2, 4, 5, and 10 to enable the population of sub-populations to be divisible by the number of total populations.

Datasets 2-Subs 4-Subs 5-Subs 10-Subs
grammaticalfacialexpression01 0.9233 0.9246 0.9250 0.9179
SemeionHandwrittenDigit 0.9026 0.9195 0.9094 0.9048
isolet5 0.9080 0.8970 0.9100 0.8860
MultipleFeaturesDigit 0.9843 0.9843 0.9843 0.9829
HAPTDataSet 0.9679 0.9714 0.9702 0.9679
har 0.9746 0.9730 0.9774 0.9810
UJIIndoorLoc 1.0000 1.0000 1.0000 1.0000
MadelonValid 0.8952 0.9592 0.9042 0.9048
OpticalRecognitionofHandwritten 0.9800 0.9648 0.9857 0.9800
ConnectionistBenchData 0.9521 0.9167 0.9643 0.9386
wdbc 0.9674 0.8976 0.9705 0.9599
LungCancer 0.9167 0.9843 0.8929 0.8750
Average 0.9477 0.9494 0.9495 0.9415
Table 6: Classification accuracy of different numbers of sub-populations on training datasets
Datasets 2-Subs 4-Subs 5-Subs 10-Subs
grammaticalfacialexpression01 0.8934 0.8903 0.8967 0.8903
SemeionHandwrittenDigit 0.7980 0.8128 0.7963 0.8079
isolet5 0.8205 0.8333 0.8414 0.8013
MultipleFeaturesDigit 0.9633 0.9700 0.9615 0.9600
HAPTDataSet 0.9278 0.8917 0.9120 0.8972
har 0.9407 0.9556 0.9271 0.9185
UJIIndoorLoc 1.0000 1.0000 0.9996 1.0000
MadelonValid 0.7444 0.7419 0.7841 0.7778
OpticalRecognitionofHandwritten 0.9233 0.9298 0.9493 0.9300
ConnectionistBenchData 0.7742 0.3000 0.7728 0.7903
wdbc 0.9006 0.8056 0.9144 0.9415
LungCancer 0.2000 0.9233 0.5455 0.6000
Average 0.8239 0.8379 0.8584 0.8596
Table 7: Classification accuracy of different numbers of sub-populations on test datasets

The experimental results are summarized in Tables 6-7. The best results for each dataset are highlighted in bold text. Meanwhile, the average performance of each algorithm on all datasets are tabulated in the last row of each Table. In those tables, 2-Subs, 4-Subs, 5-Subs and 10-Subs denote the proposed algorithm with 2, 4, 5, and 10 sub-populations, respectively.

Table 6 summarized the classification accuracy of different numbers of sub-populations on training datasets. It can be observed that the algorithm with the 5 sub-populations can achieves the best 7 results out of 12 on the training set. Besides, it can provide the best average performance. After that, we also provide the training models with different number of sub-populations to predict the test data. Table 7 tabulates the performance of the algorithm on test datasets with different numbers of sub-populations. We can find that the 10-Subs provided the best performance for the test data, while the 5-Subs gave similar results to the 10-Subs. However, 10-Subs cannot present better results in the training phase. According to hyperparameter optimization in machine learning

bergstra2011algorithms, we need to discuss the parameters in the training phase rather than the test phase. Therefore, the number of sub-populations was set to 5 to conduct a fair comparison.

5.5 Extend Performance Comparisons for Higher-dimensional Datasets

To illustrate the robustness and generalization ability of SaWDE, twelve higher-dimensional datasets are used for further experiments in this section. The experimental results are tabulated in Tables 8 - 10. The best results for each dataset are highlighted in bold text. Meanwhile, we set the average performance of each algorithm on all datasets in the last row of each table.

Datasets GA Original PSO Standard PSO SaPSO DE SaDE SaWDE
Alizadeh-2000-v1 0.6667 0.8333 0.5833 0.5000 0.8333 0.7500 0.8333
Alizadeh-2000-v2 0.8333 1.0000 0.9444 0.8889 0.9444 1.0000 1.0000
Armstrong-2002-v1 0.8571 0.8571 0.9048 0.9107 0.9048 0.9524 0.9048
Bittner-2000 0.7273 0.6364 0.7273 0.7500 0.8182 0.7273 0.7273
Dyrskjot-2003 0.6667 0.7500 0.7500 0.5833 0.6667 0.7500 0.8333
Garber-2001 0.6842 0.8421 0.8421 0.8095 0.7895 0.7895 0.8421
Liang-2005 0.8182 0.8182 0.8182 0.7222 0.8182 0.8182 0.8182
Nutt-2003-v2 0.6250 0.6250 0.6250 0.5000 0.5000 0.8750 0.8750
Pomeroy-2002-v1 0.9000 0.7000 0.7000 0.6111 0.6000 0.9000 0.8000
Pomeroy-2002-v2 0.7500 0.5833 0.6667 0.5222 0.5833 0.7500 0.6667
Shipp-2002-v1 0.8261 0.7826 0.7391 0.7798 0.9130 0.8261 0.8696
West-2001 0.4286 0.7143 0.6429 0.8500 0.7143 0.7857 0.8571
Average 0.7319 0.7619 0.7453 0.7023 0.7571 0.8270 0.8356
Table 8: Test accuracy of six EC methods and SaWDE on twelve higher-dimensional datasets
Datasets GA Original PSO Standard PSO SaPSO DE SaDE SaWDE
Alizadeh-2000-v1 564.0 505.0 557.0 9.0 563.0 548.0 217.0
Alizadeh-2000-v2 1059.0 1036.0 1016.0 23.0 1037.0 1044.0 833.0
Armstrong-2002-v1 547.0 526.0 550.0 11.0 537.0 507.0 411.0
Bittner-2000 1160.0 1088.0 1114.0 29.0 1081.0 1091.0 110.0
Dyrskjot-2003 598.0 597.0 583.0 6.0 595.0 593.0 401.0
Garber-2001 2253.0 2240.0 2256.0 1173.0 2956.0 2278.0 3.0
Liang-2005 673.0 708.0 710.0 4.0 712.0 698.0 1.0
Nutt-2003-v2 526.0 526.0 548.0 5.0 558.0 542.0 191.0
Pomeroy-2002-v1 447.0 440.0 437.0 6.0 437.0 420.0 130.0
Pomeroy-2002-v2 680.0 660.0 666.0 282.0 804.0 657.0 5.0
Shipp-2002-v1 385.0 376.0 387.0 103.0 387.0 407.0 243.0
West-2001 589.0 598.0 609.0 9.0 616.0 583.0 245.0
Average 790.1 775.0 786.1 138.3 856.9 780.7 232.5
Table 9: Subset size of six EC methods and SaWDE on twelve higher-dimensional datasets
Datasets GA Original PSO Standard PSO SaPSO DE SaDE SaWDE
Alizadeh-2000-v1 9624.1 10088.5 10373.7 9472.2 10089.5 9799.1 67.8
Alizadeh-2000-v2 10227.5 11332.6 11703.9 9460.7 10966.4 10542.3 15.9
Armstrong-2002-v1 9677.1 10604.9 10276.6 10496.1 10140.4 9939.2 6.8
Bittner-2000 9933.9 11154.0 11264.5 11153.6 10505.9 10084.0 16.2
Dyrskjot-2003 9528.7 10482.4 10105.1 10553.1 9840.0 9685.7 38.0
Garber-2001 11238.3 14121.2 14264.2 17146.3 13098.3 12787.3 68932.2
Liang-2005 11544.8 12094.2 11767.9 11005.1 15487.9 9746.0 19040.3
Nutt-2003-v2 9467.6 9978.9 9914.4 10594.2 9752.0 9526.1 90.0
Pomeroy-2002-v1 9456.1 9853.7 9732.0 10152.7 9641.3 9504.8 44.3
Pomeroy-2002-v2 9695.7 10643.3 10351.3 9665.6 9910.3 9862.0 16729.5
Shipp-2002-v1 9709.8 10286.8 9979.0 10539.6 9975.2 9901.0 163.0
West-2001 9668.9 10298.7 10311.2 10666.4 10010.9 9798.0 4049.5
Average 9981.0 10911.6 10837.0 10908.8 10784.9 10098.0 9099.5<