In statistical pattern recognition, each pattern represents a real world object described by a set of features (synonymously called as dimensions, here after). More the number of features used, better the description of the object. However, all the features may not be important for the decision making problem on hand. For instance, astudent can be described with the features like height, weight, regularity, father name, family income, etc. Now, for the problem on hand, like selecting a student for a basket ball team, the feature like height and weight are highly relevant, where as the features like father name, family income are irrelevant. The features, regularity and family income are highly relevant to classify whether the student shall be awarded the fellowship or not. Hence, best feature selection for the problem on hand is important for quality decision making.
Now-a-days, with the development of high-throughput technologies, it is possible to measure hundreds of feature values for each object, which has resulted in large volumes of high dimensional data for analysis. In hyper spectral image analysis, using advanced hyper spectral instruments, hundreds of feature values (each one corresponding to one spectral band) can be measured for each object on earth. In contemporary scientific applications, it is quite often to get such large volumes of high dimensional data, which becomes very challenging problem for analysis .
In pattern classification, irrelevant (some times redundant or noisy) features will affect the classification accuracy. It has been proved that, in the presence of large number of features, the learning models become overfit on the training data, which leads to poor generalizability of the trained model, offering a great challenge for pattern classification and prediction problems. Thus, the feature selection process has been considered as a pre-processing step to eliminate irrelevant and redundant features, which is critical for decision making in real world applications [3, 4, 5, 6, 7].
The feature selection algorithms have been widely used in many application areas such as genomic analysis , text classification , information retrieval , intrusion detection , bio informatics  etc. A comprehensive survey on feature selection methods is published in . Empirical studies on feature selection algorithms for real world problems are presented in [14, 15, 16, 17].
Feature selection is an optimization problem which aims to determine an optimal subset of features out of features in the input data , that maximize the classification or prediction accuracy. Performing an exhaustive search to find an optimal subset of features out of all possible candidate feature subsets, based on some evaluation criterion, is computationally infeasible, and it becomes an NP-hard problem with the increasing value . Hence, different other search strategies like complete, sequential, random search are explored. However, most of these approaches suffers from local minima problem. Therefore, Evolutionary Computation (EC) techniques, which ensures global optimum or near global optimum, such as Genetic Algorithms(GAs) 
, Genetic Programming (GP) and Particle Swarm Optimization (PSO), were used in many feature selection problems. As stated in, PSO is simple to understand and easy to implement than GP and GAs and able to handle optimization problems with multiple local optima reasonably well, it requires less number of parameters and can converge more quickly. However, the efficiency of PSO depends on various input parameters that are to be tuned properly , . More detailed study on PSO and its improvements is presented in .
In the standard PSO, the swarm size is an important parameter, where very small swarm size will lead to local minima, while large swarm size would slow down the algorithm . To address this issue, in the present work, we intend to vary the population sizes of the particles in standard PSO based on the data sets in real time. A new objective function has been developed which integrates the accuracy of the classifier with the modified F-Score. Finally, we propose a new PSO search method for feature selection using tunable swarm size configuration. The efficiency of the proposed method is compared with other popular contemporary feature selection methods.
This paper is organized as follows. Section II presents the brief review of the existing methods for feature selection. In Section III we briefly outlined the standard PSO methodology and presented the motivation for the tunable swarm size configuration in the present work. Section IV outlines the Alternating Decision Tree classifier, which is used along with the standard PSO for feature subset selection. The proposed Tunable Particle Swarm Size Optimization (TPSO) algorithm is presented in Section V. The experiments and results are presented in Section VI. Conclusions and discussion are presented in Section VII.
Ii Feature Selection Methods
In literature, the feature selection methods are broadly classified into three categories viz., filter, wrapper and embedded methods. Filter methods select the feature based on the given data, irrespective of the classifier. In the wrapper model, feature selection will be done based on the feedback of the predefined learning model. Wrapper based methods will find better and optimal feature subsets with high accuracy, as they are considering the feedback of the learning model, but it requires expensive computation. However, it is proved that filters have better generalization capabilities than wrapper based ones .
Algorithms with embedded models such as C4.5 and least angle regression (LARS) , the variable or feature selection process is incorporated as part of the training process, and the relevance of the selected feature is analyzed by the objective function of the learning model under consideration. Both filter and embedded approaches may result a subset of selected features or the weights that represent the relevance or importance of all features.
Some feature selection methods compute the ranks of all features using some ranking criterion, such methods are simple and computationally efficient. These rank based methods are more robust against over fitting, resulting more bias with less variance[4, 26]. Further, the statistical approaches such as T-Statistics, F- Statistics, Chi-square test etc., have been explored significantly in the literature , . A few other feature selection approaches are based on the concepts of information theory such as information gain , mutual information [4, 30], and entropy-based measure 31, 32, 33]. More recently, the evolutionary Computing techniques such as such as Genetic Algorithms(GAs) and Particle Swarm Optimization (PSO) are being used popularly used for feature selection. Bing Xue et.al., explored the performance of PSO and various other improvements in . PSO is widely used for Feature Selection on High-dimensional Datasets . A good survey on novel population topologies for improving the performances of population-based optimization algorithms for solving single objective optimization, multiobjective optimization and other classes of optimization problems is presented in .
This paper presents an improvement over the standard PSO, which is a wrapper based approach to improve the classification accuracy with reduced number of features.
Iii Standard Particle Swarm Optimization
Particle Swarm Optimization imitates the movement of a flock of birds, where each bird has its own intelligence to find the best direction to move and to reach the destination as a whole. In standard PSO, each single candidate solution is considered as a particle in the search space. For each particle, there is a fitness value, computed using a fitness function to be optimized, and velocity, which determine the movement of the particles. During movement, each particle updates its position based on its previous position, velocity and as well as considering the positions of neighbouring particle.
The standard PSO starts with a randomly initialized population (particles) of size . Each particle is identified as a point in the dimensional space . represents the fitness values of the best positions of the particles given by . represents the index of the particle that has the best fitness value in the swarm. The velocity of a particle is represented by .
The iterative approach starts with an initial random solutions (particles in initial swarm). In each iteration, for each particle, the velocity and the position are updated using the following equations:
where , is a positive linear function of time which updates according to the generation iteration. The and represent the acceleration terms that pull the particles towards and . The and
are random number generation functions, which generates random values that are uniformly distributed in. The terms and represents the and in the dimension respectively. The velocities of the particles are bounded by a maximum limit . If is too small then it may end up with a local optima, and if the is too large then the particles may fly beyond the good solutions.
The swarm size is a critical parameter in this standard PSO algorithm wherein very few particles will make the algorithm to get stuck at the local optima, while too many particles would slow down the algorithm . It is the key factor that has motivated the present research work.
In this paper, we propose a new particle swarm optimization search for feature subset selection using tunable swarm size configuration, which is explained in Section V.
Iv Alternating Decision Trees (ADT)
Alternating Decision Trees (ADT) are often considered as generalization of conventional decision trees . ADT generates the decision rules based on majority voting taking all simple rules into account. It consists of decision nodes and prediction nodes. The prediction nodes contain a numeric value having a positive or negative sign, and the decision node specify a condition. The decision nodes will be splitting nodes whereas the prediction nodes are either root or leaf nodes. An instance is classified by traversing from the root by following all paths where all the decision nodes are true. A positive sum of all prediction nodes that are been traversed implies the membership of one class and the negative sum implies the membership of other class. Empirical studies proved that, under some favourable conditions ADTs are more robust than the conventional decision trees, C4.5 and J48 .
V Tunable Particle Swarm Size Optimization Algorithm (TPSO)
In this section we present our new algorithm called Tunable Particle Swarm Size Optimization Algorithm (TPSO) which will find the best initial swarm size for the given data to overcome the local minima problem .
The data set is split into testing and training folds using a stratified fold cross validation procedure. For each of the training data sets we first initialize swarm size and then select the features using the standard PSO and Alternating Decision Tree (ADT). We then compute the test accuracy using the features subset identified in the previous step and ADT classifier. A new feature score which measures the discrimination between features having two sets of numbers categorical or numeric with respect to the decision attribute is then computed following the procedure in Section V-A.
V-a New Feature Discrimination Score
Consider a given the data set , for , having rows and features where the last feature is the decision class. If the decision class is binary then and denote positive and negative instances respectively. In 
a feature discriminatory score using mean of the attribute values is computed. In our approach instead of mean we employ median as it is the best representative the central tendency of data sets with skewed distribution. We define the feature score of thefeature as:
where denotes the median of the values in the attribute corresponding to the positive decision class, denotes the median of the attribute values corresponding to the negative decision class, denotes the median of all the values of the attribute, is the median value of the feature of the positive instance and is the median value of the feature of the negative instance.
V-B Fitness function
We develop a new fitness function to evaluate the effectiveness of the feature subsets as mentioned below.
where is the accuracy obtained using ADT, is the sum of the discriminatory scores of the features in the reduced feature subset, that is where , is the sum of the discriminatory scores of all the features in the data set, that is . We assume the condition that as .
In the proposed algorithm we perform a stratified fold cross validation and split the data set into ten training and test data sets. For each training data set we extract the feature sub set using standard PSO and ADT classifier. We then compute the feature discrimination score using the formula 3. We compare the new feature score with the previous scores and the algorithm increases the particle population size in the standard PSO by a factor of one till a local maximum is found. To obtain the local maximum point we first obtain the first and second derivative of number of particles in the iteration (say ) and the feature discriminatory score (say ). The local maxima is computed as given in Equation V-B. The loop is terminated when the conditions in Equation V-B are met.
To evaluate the performance of the proposed Tunable Particle Swarm Size Optimization Algorithm, we first obtain the feature subset corresponding to the number of particles found using the above procedure. We then train an ADT using the features identified in the previous step and then compute test accuracies on the test data set of the corresponding fold. The procedure is repeated for all the ten folds and the average accuracy is computed. The above procedure is given as Algorithm 1.
Vi Experiments and Results
We have conducted experiments on bench mark data sets obtained from University of California Irvin (UCI) data repository  StatLog project, Keel  and Bangor data repositories (https://www.bangor.ac.uk/). The performance of the proposed algorithm TPSO is compared with standard PSO and GA with alternating decision tree classifier. The description of the data sets are given in the Table I.
|Name||Source||Number of||Number of||Number of|
|Heart (HRT)||UCI Statlog||13||2||270|
|distress syndrome (RDS)||Bangor||17||2||85|
In the present methodology we employ a stratified -fold cross validation () procedure. The folds are selected so that the mean response value is approximately equal in all the folds. In case of a dichotomous classification, this means that each fold has roughly the same proportions of the two types of class labels. Table II provides details of the default necessary parameters of GA and PSO in the current experimental study.
|PSO||initial number of particles Z= 50|
Vi-a Computational Complexity and Scalability
The computational complexity is a measure of the performance of the algorithm. For each data set having attributes and records, we select only those subset of records , in which missing values are present. The distances are computed for all attributes excluding the decision attribute. So, the time complexity for computing the distance would be . The time complexity for selecting the nearest records is of order . For computing the frequency of occurrences for nominal attributes and average for numeric attributes the time taken would be of the order . In case of the proposed method, let be the time complexity of wrapper based feature sub set identification using standard PSO and ADT. For folds the complexity would be . For changes in the swarm size the time complexity of feature selection step would be . Therefore, for a given data set with -fold cross validation having attributes and records, the time complexity of TPSO would be which is asymptotically linear.
A plot between the varying sizes of the data sets and the time taken for processing by the proposed algorithm (TPSO) is shown in Fig. 1
. Also, we employed a linear regression on our results and obtained the relation between the time taken (T) and the data size (D) as, , , .
The presence of the linear trend between the time taken and the varying database sizes ensures the numerical scalability of the performance of TPSO in terms of asymptotic linearity.
Vi-B Performance Comparison on Benchmark Data sets
Firstly, we compared the accuracy of the proposed TPSO method with accuracies of ADT classifier without employing any feature selection. The results are tabulated in Table III.
|87.54 3.82||84.35 3.33|
|CON||87.89 10.31||84.78 12.81|
|GER||74.4 3.75||72.8 4.59|
|HRT||83.33 7.46||78.89 10.48|
|ION||94.86 3.24||90.02 4.92|
|LAR||86.86 6.91||82.23 9.37|
|RDS||90.56 11.37||90.56 11.37|
|SNR||86.95 8.00||83.05 9.23|
|WBD||97.19 3.12||94.38 3.86|
|WEA||87.42 4.38||81.44 6.35|
Later, we considered GA and standard PSO algorithms for feature subset selection and ADT classifier as wrapper for feature evaluation. A comparison of the performances of TPSO with GA+ADT and Standard PSO+ADT methods on benchmark data sets is shown in Table IV.
|87.54 3.82||85.51 3.86||84.49 2.65|
|CON||87.89 10.31||77.78 14.56||80.78 9.99|
|GER||74.4 3.75||69.8 4.9||70.4 5.64|
|HRT||83.33 7.46||83.7 7.24||76.67 9.08|
|ION||94.86 3.24||89.75 6.74||92.01 3.77|
|LAR||86.86 6.91||79.81 7.62||79.83 9.19|
|RDS||90.56 11.37||90.56 8.23||89.31 10.9|
|SNR||86.95 8.00||81.19 11.33||81.12 10.10|
|WBD||97.19 3.12||95.26 3.7||95.25 3.42|
|WEA||87.42 4.38||84.78 3.81||82.76 5.88|
The mean and the standard deviation of the number of features selected for folds of the cross validation procedure is shown Table V.
|AUS||14||3.8 2.20||5.3 1.77||2.8 2.78|
|CON||27||13.2 5.12||9.5 2.46||11.5 4.28|
|GER||20||10 2.31||7.8 2.04||8.5 2.68|
|HRT||13||9.3 1.49||4.3 2.11||9.3 1.16|
|ION||34||13.8 4.57||12.6 2.63||11.2 3.88|
|LAR||16||4.9 2.18||5.4 2.12||6.8 2.1|
|RDS||17||10.0 2.26||6.4 1.35||9.3 1.49|
|SNR||60||33.8 10.21||28.6 4.65||35.0 9.93|
|WBD||30||13.4 3.06||9.4 2.84||13.0 2.36|
|WEA||17||9.6 2.17||8.3 1.25||9.7 1.89|
From Table V it can be observed that the TPSO methodology has rendered higher accuracies using less than of the original set of attributes.
To substantiate the improvement in classification accuracy using TPSO methodology a statistical test based on Wilcoxon method is employed and the results are presented in Table VI.
From the Table VI we infer that TPSO is superior to the standard PSO feature selection method with positive rank sum of , and significance. The TPSO method indicating a remarkable performance when compared with GA feature selection with a positive rank sum of , and
Vii Conclusions and Discussion
In this paper, we have discussed the issues related to high dimensionality of the data sets and feature selection as a solution to the curse of dimensionality. The feature selection methods such as filter and wrapper have been discussed. Particle Swarm Optimization (PSO) is a population based optimization technique, which has been proved to get optimal feature subset provided the necessary input parameters are properly tuned. Particle swarm size is the critical parameter in standard PSO. To address this issue, we proposed a noveltunable swarm size configuration approach to find the population size of the particles based on the data sets in real time. The proposed algorithm is named as Tunable Particle Swarm Size Optimization Algorithm (TPSO). A new fitness function has been developed which integrates the accuracy of the classifier with the modified F-Score. Empirically, we compare the performance of our new algorithm with other state-of-the-art classifiers on bench marking data sets obtained from UCI, Keel and Bangor data repositories. Wilcoxon statistical test confirmed the fact that the proposed algorithm has improved the classification accuracies in comparison to other methods.
-  J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing data analysis and future challenges,” IEEE Geoscience and remote sensing magazine, vol. 1, no. 2, pp. 6–36, 2013.
-  J. Hua, W. D. Tembe, and E. R. Dougherty, “Performance of feature-selection methods in the classification of high-dimension data,” Pattern Recognition, vol. 42, no. 3, pp. 409–424, 2009.
-  H. Liu, J. Li, and L. Wong, “A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns,” GENOME INFORMATICS SERIES, pp. 51–60, 2002.
-  I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157 –1182, 2003.
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,”Machine Learning, vol. 46, pp. 389–422, 2002.
-  H. Liu, E. R. Dougherty, J. G. Dy, K. Torkkola, E. Tuv, H. Peng, C. Ding, F. Long, M. Berens, L. Parsons, Z. Zhao, L. Yu, and G. Forman, “Evolving feature selection,” IEEE Intelligent Systems, vol. 20, pp. 64–76, 2005.
-  H. Yang, Q. Du, and G. Chen, “Particle swarm optimization-based hyperspectral dimensionality reduction for urban land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 5, no. 2, pp. 544–554, 2012.
-  I. n. Inza, P. Larrañaga, R. Blanco, and A. J. Cerrolaza, “Filter versus wrapper gene selection approaches in dna microarray domains.” Artificial intelligence in medicine, vol. 31(2), pp. 91–103, 2004.
-  F. George, “An extensive empirical study of feature selection metrics for text classification,” Journal of Machine Learning Research, vol. 3, pp. 1289–1305, 2003.
D. Swets and J. Weng, “Efficient content-based image retrieval using automatic feature selection,” in
IEEE International Symposium On Computer Vision, 1995, pp. 85–90.
-  W. Lee, S. J. Stolfo, and K. W. Mok, “Adaptive intrusion detection: A data mining approach,” AI Review, vol. 14(6), pp. 533–567, 2000.
-  Y. Saeys, I. Inza, and P. LarrANNaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23(19), pp. 2507–2517, 2007.
-  G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, 2014.
-  L. Tao, Z. Chengliang, and O. Mitsunori, “A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression,” Bioinformatics, vol. 20(15), pp. 2429–2437, 2004.
-  Y. Sun, C. F. Babbs, and E. J. Delp, “A comparison of feature selection methods for the detection of breast cancers in mammograms: Adaptive sequential floating search vs. genetic algorithm,” in Engineering in Medicine and Biology Society, 2005. IEEE-EMBS 2005. 27th Annual International Conference of the, 2005, pp. 6532–6535.
-  S. Ma, “Empirical study of supervised gene screening,” BMC Bioinformatics, vol. 7, pp. 537+, 2006.
-  M. Carl, W. Owen, L. Anna, and N. Robert, “Comparison of small n statistical tests of differential expression applied to microarrays,” BMC Bioinformatics, vol. 10(1), p. 45, 2009.
-  M. L. Raymer, W. F. Punch, E. D. Goodman, L. A. Kuhn, and A. K. Jain, “Dimensionality reduction using genetic algorithms,” IEEE transactions on evolutionary computation, vol. 4, no. 2, pp. 164–171, 2000.
-  B. Xue, M. Zhang, and W. N. Browne, “New fitness functions in binary particle swarm optimisation for feature selection,” in Evolutionary Computation (CEC), 2012 IEEE Congress on. IEEE, 2012, pp. 1–8.
-  Y. Shi and R. C. Eberhart, “Parameter Selection in Particle Swarm Optimization,” in EP ’98: Proceedings of the 7th International Conference on Evolutionary Programming VII. London, UK: Springer-Verlag, 1998, pp. 591–600.
-  P. Angeline, “Evolutionary optimization versus particle swarm optimization: Philosophy and performance differences,” in Evolutionary programming VII. Springer, 1998, pp. 601–610.
-  B. Xue, M. Zhang, W. N. Browne, and X. Yao, “A survey on evolutionary computation approaches to feature selection,” IEEE Transactions on Evolutionary Computation, vol. 20, no. 4, pp. 606–626, 2016.
-  F. van den Bergh and A. P. Engelbrecht, “Effects of swarm size on cooperative particle swarm optimisers,” in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), 2001, pp. 892– 899.
-  J. R. Quinlan, C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993.
-  B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32(2), pp. 407–499, 2004.
-  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2003.
-  X. Jin, A. Xu, R. Bie, and P. Guo, “Machine learning techniques and chi-square feature selection for cancer classification using sage gene expression profiles,” in International Workshop on Data Mining for Biomedical Applications. Springer, 2006, pp. 106–115.
-  S. Wang, C.-L. Liu, and L. Zheng, “Feature selection by combining fisher criterion and principal feature analysis,” in Machine Learning and Cybernetics, 2007 International Conference on, vol. 2, 2007, pp. 1149–1154.
-  Y. Liu, “A comparative study on feature selection methods for drug discovery,” Journal of Chemical Information and Computer Sciences, vol. 44(5), pp. 1823–1828, 2004.
-  H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27(8), pp. 1226–1238, 2005.
-  Q. Tan, M. Thomassen, K. Jochumsen, J. Zhao, K. Christensen, and T. Kruse, “Evolutionary algorithm for feature subset selection in predicting tumor outcomes using microarray data,” in Bioinformatics Research and Applications, ser. Lecture Notes in Computer Science, I. Mandoiu, R. Sunderraman, and A. Zelikovsky, Eds. Springer Berlin / Heidelberg, 2008, vol. 4983, pp. 426–433.
-  S. Winkler, M. Affenzeller, G. Kronberger, M. Kommenda, S. Wagner, W. Jacak, and H. Stekel, “Analysis of selected evolutionary algorithms in feature selection and parameter optimization for data based tumor marker modeling,” Computer Aided Systems Theory–EUROCAST 2011, pp. 335–342, 2012.
-  M. B. Åberg, L. Löken, and J. Wessberg, “An evolutionary approach to multivariate feature selection for fmri pattern analysis,” in BIOSIGNALS (2), 2008, pp. 302–307.
-  B. Tran, B. Xue, M. Zhang, and S. Nguyen, “Investigation on particle swarm optimisation for feature selection on high-dimensional data: Local search and selection bias,” Connection Science, vol. 28, no. 3, pp. 270–294, 2016.
-  N. Lynn, M. Z. Ali, and P. N. Suganthan, “Population topologies for particle swarm optimization and differential evolution,” Swarm and Evolutionary Computation, 2017.
-  Y. Freund and L. Mason, “The alternating decision tree learning algorithm,” in icml, vol. 99, 1999, pp. 124–133.
-  M. N. Kumar, “Alternating decision trees for early diagnosis of dengue fever,” arXiv preprint arXiv:1305.7331, 2013,.
-  Y. Liu, G. Wang, H. Chen, H. Dong, X. Zhu, and S. Wang, “An improved particle swarm optimization for feature selection,” Journal of Bionic Engineering, vol. 8, no. 2, pp. 191–200, 2011.
-  A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [Online]. Available: http://archive.ics.uci.edu/ml
-  A. Alcal -Fdez, A. Fernandez, Luengo, J. Derrac, S. G. J., L. S nchez, and F. Herrera, “Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic and Soft Computing, 2010.