An optimization problem is the task of choosing a set of values systematically to maximize or minimize a given function with a given set of input data. More specifically, optimization is the task of selecting the ”best available” of some specific objective function in a specified domain, provided that varieties of objective functions and different domains available. The past few years have seen an overwhelming growth of the application of single-objective and multi-objective optimization algorithms in the domain of artificial intelligence, predominantly in feature selection (FS) and is considered as the preprocessing task for several machine learning applications. FS is the task of efficiently selecting the subset of data from larger feature sets by keeping the most relevant attributes, thereby reducing the dimensionality of the feature set, and simultaneously retaining sufficient information to perform classification of data. This is important because the irrelevant or redundant features may often lead to poor classification performance in machine learning problems with unnecessary computational cost.
In recent years, nature-inspired metaheuristic optimization algorithms, for example Particle Swarm Optimization (PSO), Grey Wolf Optimization (GWO) 
, Genetic Algorithm (GA), Bee Swarm Optimization (BSO)  are widely used for selecting a good approximation (optimal) for various complex optimization problems, though they do not ensure the selection of the best solution always. The task of feature selection is challenging because for the original feature set of cardinality the feature selection task is to select the optimal subset among the intractable candidates. Hence, the number of combinations of optimal feature selection grows exponentially with the increase in the number of available features.
In literature, different feature-selection algorithms based on nature-inspired, metaheuristic or heuristic optimization algorithms[4, 25, 13, 31] have been recently employed for different machine learning applications. Swarm-intelligence based optimization algorithms like Ant Colony Optimization [17, 10], Particle Swarm Optimization [1, 32] have been modified and applied widely in recent years to serve the purpose of feature selection. Despite the efficient performance of metaheuristic feature selection algorithms over traditional machine learning approaches, the problem of an increasing amount of data makes the task difficult. Hence, researches have been made to propose hybrid optimization algorithms to improve the feature-selection performance [30, 18, 14, 5].
In this paper, we propose Reinforced Swarm Optimization (RSO), a novel optimization algorithm for feature selection that incorporates the features of both reinforcement learning along with swarm intelligence based BSO algorithm. BSO , is a metaheuristic optimization algorithm, that mimics the foraging activities of bee colony and have been used in various domains including cloud computing , maximum satisfiability problem (MAX-SAT) , document retrieval , parallel computing , biomedical image analysis [11, 2], and many more. On the other hand, reinforcement learning (RL) has been integrated into BSO to make it more adaptive and robust powered by a suitable balance between diversification and intensification of the search space, compensating the local search of the BSO search agents.
Ii Proposed method
In this section, we describe the detailed working principle of the natural bees and the inspiration behind the BSO algorithm in Section II-A, BSO algorithm for optimal feature selection in Section II-B, reinforcement learning and its effects in feature selection in Section II-C and the proposed RSO algorithm, incorporating the features of reinforcement learning within BSO in Section II-D.
Ii-a Intuitive behaviour of natural bees
Unlike other population-based methods, the BSO algorithm imitates the social hierarchy of the natural bees namely scouts, foragers, onlookers, etc. . A prospective is a bee agent with zero information about the surrounding environment or search space and is unaware of the possible location and type of any food source or potential threat. The bee, usually small in number, has the task of exploring the search space and gather information regarding the food source and pass them to the , who rests in the nest and processes the information collected from the foragers by implementing a probabilistic approach about the most profitable food source based on the information gathered and select the from numerous employed post advertising the information and can redefine the exploration trajectory towards the most profitable food-source. After collecting nectar from the food source, the forager returns to the hive and enters a decision-making process:
1) The food source is abandoned if the remaining nectar reaches scarcity or gets completely exhausted. The employed forager bee turns into an unemployed forager; 2) The search can continue without additional recruiters if a sufficient amount of nectar remains in the food source; 3) A waggle dance is performed by the forager bee to inform the nestmates about the source and the collection of nectar from the source continues.
Ii-B BSO algorithm
Bee Swarm Optimization (BSO) is a metaheuristic optimization algorithm that is inspired by the intelligent behaviour of self-organization, adaptation and hierarchical task-management of the natural bee colony. Proposed by , the BSO algorithm is an iterative search method that solves a particular instance of optimization problem imitating the intelligent foraging behaviour and probabilistic decision-making process of natural bees to select and exploit the most profitable food source. Initially, the first reference solution, known as is generated using heuristic and is considered as the reference to determine other sets of similar solutions, together forming the . The is defined by a set of solutions equidistant from the and the distance is inversely proportional to a parameter named , which determines the convergence of the search process. Each of these solutions is considered as the starting point of local search and a bee agent is assigned to each of them. The best and fittest solution is passed to the congeners from the table which is further used to select the next . The reference solutions are stored in a table named to avoid congestion. To avoid reaching local optima instead of a global one, the parameter is defined carefully. It is defined as the maximum number of chance given to an artificial bee agent to explore a before assigning another one. If a better solution is found within the range, intensification is done, otherwise, diversification is performed. The search stops after reaching , which is the maximum number of iterations or after finding the global optima . The working principle BSO algorithm is explained in Algorithm 1.
Ii-C Reinforcement Learning
Reinforced learning, also known as the Q-learning algorithm, is a machine-learning algorithm that deals with the environment with the notion of optimal cumulative reward based on the outcomes of previously implemented sets of actions. According to , it is defined as ”a way of programming agents by reward and punishment without needing to specify how the task is to be achieved”. Let be the set of states and be the set of actions, bound to select state from . A reward is received for every action performed in set . The algorithm tries to learn an approach to map in order to maximize the reward function which is defined by Equation 1.
where is defined as the ”discount parameter” and have range . The search agents tend towards long-term rewards if the value tends to 1 and short-term or immediate rewards if tends towards zero.
Temporal Difference (TD) is one of the widely used approaches in residual learning which incorporates the features of both the Monte Carlo (MC) algorithm 
and Markov Decision Process (MDP). Following the original work of , we implement the recursive Q-learning approach, a specific TD method, to calculate the immediate reward by acting in set given by Equation 2.
where is the resulting state after performing action over set , is another action. However, in this paper, we have slightly modified the equation to fit the purpose, given by Equation 3.
where is the learning rate and . The pseudo-code of the RL algorithm is given by the Algorithm 2.
|Dataset||# Attributes||# Instances||# Classes||Dataset||# Attributes||# Instances||# Classes|
Ii-D RSO: Reinforced Swarm Optimization
In this paper, we integrate Reinforcement Learning (RL) to Bee Swarm Optimization (BSO) to improve the learning process by making search agents learn from their previous experiences. One of the shortcomings of the BSO algorithm can be pointed to as the absence of intelligence or memory in their local search process which inhibits the agents to memorize the location of previously found optima. This often results in the algorithm getting stuck in local optima instead of the global one and makes the algorithm inefficient as compared to other swarm-intelligence algorithms. To address this, we propose a new algorithm by replacing the local-search algorithm with Q-learning to enable the agents’ benefits from other search agents. In the context of FS, the inclusion or deletion of a feature set from the optimal feature subset is considered as the action whereas reward obtained is the improvement in classification accuracy and reduction in feature subset as a secondary constraint.
In the iteration, let be the actions performed in set and . The reward obtianed in set is obtained leveraging the classification accuracy and number of elements in feature subset as follows:
Trivially the performance boost of the BSO method by incorporating the Residual Learning algorithm can be justified by the fact that each of the search agents learns from the previous experiences along with the experiences from other search agents. Now, in the case of the BSO algorithm, there are possibilities that one of the search agents get stuck at local minima, and considering it as the global one, the other agents converge towards that point. But in the proposed RSO method, as the agents learn from the experiences of the other search agents, the possibility of reaching the global minima increases quite significantly.
Iii Experimental Results
The experimentations for this work was performed using Python 3.1 environment on a PC with Intel Core 7th
generation CPU and 4 GB RAM. The RSO algorithm was used for the selection of optimal feature subset followed by a classification performance using the feature subset and KNN classifier.
Iii-a Dataset description
To validate the performance of the proposed RSO algorithm, we have used 25 publicly available datasets from UCI machine learning repository111https://archive.ics.uci.edu/ml/index.php and Knowledge Extraction based on Evolutionary Learning (KEEL) repository222https://sci2s.ugr.es/keel/datasets.php. The datasets were selected while keeping a considerable diversity in the number of feature attributes, number of instances and number of classes. The summary of the datasets is shown in Table I.
Iii-B Parameter setting
Parameter tuning has a pivotal role in the superior performance of any optimization algorithm. Hence, we have experimented with different parameters of the BSO and Residual Learning algorithm. To select the optimal set of parameters, the primary motive was to improve the classification accuracy as well as reducing the execution time. The optimal parameter setting for the algorithm was set experimentally by making a suitable compromise between these two conditions. Figure 1 shows the experimental results with different parameter tuning of the RSO algorithm using dataset. Different BSO parameters like , , , , were varied with integral differences from 1 to 10 wheres different RL parameters like , , are varied within the interval from 0 to 1 with an difference of 0.1, shown in Table II.
|Without OA||Proposed Method||BSO|
|Dataset||Accuracy||Precision||Recall||F1 Score||Accuracy(%)||Precision||Recall||F1 Score||
|Dataset||Accuracy(%)||Precision||Recall||F1 Score||Accuracy(%)||Precision||Recall||F1 Score||Accuracy(%)||Precision||Recall||F1 Score||Accuracy(%)||Precision||Recall||F1 Score||Accuracy(%)||Precision||Recall||F1 Score||Accuracy(%)||Precision||Recall||F1 Score||Accuracy(%)||Precision||Recall||F1 Score|
Iii-C Performance evaluation
The performance of the proposed method was evaluated on 25 standard datasets where we have selected optimal feature subset using RSO followed by classification using KNN classifier. The classification performance was evaluated using the evaluation metrics given by Equation5-8.
where = True Positive, = False Positive, = True Negative, and = False Negative.
Iii-D Comparison with existing methods
We have evaluated the performance of the proposed RSO method with different existing optimization algorithms. Table III shows the comparison of the experimental results in terms of accuracy, precision, recall, F1 score, execution time, and the number of selected features, obtained from our proposed method with the same obtained from the BSO algorithm.It is evident from the table that our proposed RSO method outperforms the BSO algorithm in 22 out of the 25 cases in terms of classification accuracy, by using a significantly smaller subset of feature data and thereby reducing the execution time.
We have also compared the obtained results with several feature selection algorithms like Particle Swarm Optimization (PSO) , Grey Wolf Optimization (GWO) , Genetic Algorithm (GA) , Harris Hawk Optimization (HHO) , Multi-Verse Optimization (MVO) , Moth Flame Optimization , Whale Optimization Algorithm (WOA)  as shown in Table IV. The proposed method outperformed all the methods compared in this paper in 19 out of 25 cases in terms of fitness of selected features, which is reflected in classification accuracy. However, our model performed inferior in the case of dataset with the best classification accuracy of 97.22% whereas most of the other methods were able to produce a superior feature subset, resulting in a classification accuracy of 100%. In the case of dataset, PSO, MVO, WOA and GWO performed the best with a classification accuracy of 76.62% as compared to 72.72% from the proposed RSO method. The MVO method performed the best in dataset as compared to 78.67% from our proposed method. For dataset, all the methods produced a classification accuracy of 100% whereas the RSO method produced a result of 97.85%. MVO, MFO, WOA, and HHO produced similar results to the BSO method in dataset with a classification accuracy of 95% as compared to 93.5% of our proposed method. In the case of data, the GWO performs the best with a classification accuracy of 100% as compared to 98.31% from RSO. Our proposed method performs the best in all other datasets as shown in Table IV.
Iv Conclusion and future work
In this paper, we propose a new hybrid wrapper-based feature selection algorithm named RSO, which integrates the Residual Learning algorithm with the metaheuristic BSO algorithm. Experimental results show that our proposed method outperforms the BSO as well as the existing and popularly known metaheuristic optimization algorithms in feature selection task in terms of accuracy by selecting comparatively fewer optimal features. In future, we plan to extend our research by experimenting and observing the performance of different hybrid optimization algorithms. We also plan to observe the performance of RSO on deep features, to study the impact of RL in the performance of feature selection algorithms.
-  (2018) Optimizing multi-objective pso based feature selection method using a feature elitism mechanism. Expert Systems with Applications 113, pp. 499–514. Cited by: §I.
-  (2021) Cervical cytology classification using pca & gwo enhanced deep features selection. arXiv preprint arXiv:2106.04919. Cited by: §I.
-  (1997) Selection of relevant features and examples in machine learning. Artificial intelligence 97 (1-2), pp. 245–271. Cited by: §I.
-  (2020) Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis 143, pp. 106839. Cited by: §I.
-  (2020) Optimizing speech emotion recognition using manta-ray based feature selection. arXiv preprint arXiv:2009.08909. Cited by: §I.
-  (2018) Bees swarm optimization guided by data mining techniques for document information retrieval. Expert Systems with Applications 94, pp. 126–136. Cited by: §I.
-  (2019) Exploiting gpu parallelism in improving bees swarm optimization for mining big transactional databases. Information Sciences 496, pp. 326–342. Cited by: §I.
-  (2019) Bee swarm optimization for solving the maxsat problem using prior knowledge. Soft Computing 23 (9), pp. 3095–3112. Cited by: §I.
-  (1995) A new optimizer using particle swarm theory. In MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, pp. 39–43. Cited by: §I, §III-D.
-  (2018) A new hybrid ant colony optimization algorithm for solving the no-wait flow shop scheduling problems. Applied Soft Computing 72, pp. 166–176. Cited by: §I.
Cancer classification based on support vector machine optimized by particle swarm optimization and artificial bee colony. Molecules 22 (12), pp. 2086. Cited by: §I.
-  (2019) Harris hawks optimization: algorithm and applications. Future generation computer systems 97, pp. 849–872. Cited by: §III-D.
-  (2020) Improved binary grey wolf optimizer and its application for feature selection. Knowledge-Based Systems 195, pp. 105746. Cited by: §I.
-  (2012) A new hybrid ant colony optimization algorithm for feature selection. Expert Systems with Applications 39 (3), pp. 3747–3763. Cited by: §I.
-  (1996) Reinforcement learning: a survey. Journal of artificial intelligence research 4, pp. 237–285. Cited by: §II-C.
-  (2005) An idea based on honey bee swarm for numerical optimization. Technical report Citeseer. Cited by: §I, §I, §II-B.
-  (2008) An efficient ant colony optimization approach to attribute reduction in rough set theory. Pattern Recognition Letters 29 (9), pp. 1351–1357. Cited by: §I.
-  (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260, pp. 302–312. Cited by: §I.
-  (2019) Energy-aware resource utilization based on particle swarm optimization and artificial bee colony algorithms in cloud computing. The Journal of Supercomputing 75 (5), pp. 2455–2496. Cited by: §I.
-  (1949) The monte carlo method. Journal of the American statistical association 44 (247), pp. 335–341. Cited by: §II-C.
-  (2016) The whale optimization algorithm. Advances in engineering software 95, pp. 51–67. Cited by: §III-D.
-  (2016) Multi-verse optimizer: a nature-inspired algorithm for global optimization. Neural Computing and Applications 27 (2), pp. 495–513. Cited by: §III-D.
-  (2014) Grey wolf optimizer. Advances in engineering software 69, pp. 46–61. Cited by: §I, §III-D.
-  (2015) Moth-flame optimization algorithm: a novel nature-inspired heuristic paradigm. Knowledge-based systems 89, pp. 228–249. Cited by: §III-D.
Optimal feature selection-based medical image classification using deep learning model in internet of medical things. IEEE Access 8, pp. 58006–58017. Cited by: §I.
-  (1995) Fault tolerant design using single and multicriteria genetic algorithm optimization. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §I, §III-D.
-  (2000) Optimal electricity supply bidding by markov decision process. IEEE transactions on power systems 15 (2), pp. 618–624. Cited by: §II-C.
Bee colony optimization: principles and applications.
2006 8th Seminar on Neural Network Applications in Electrical Engineering, pp. 151–156. Cited by: §II-A.
-  (1995) Temporal difference learning and td-gammon. Communications of the ACM 38 (3), pp. 58–68. Cited by: §II-C.
-  (2019) Hybrid binary coral reefs optimization algorithm with simulated annealing for feature selection in high-dimensional biomedical datasets. Chemometrics and Intelligent Laboratory Systems 184, pp. 102–111. Cited by: §I.
-  (2020) Top-k feature selection framework using robust 0-1 integer programming. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
-  (2017) A pso-based multi-objective multi-label feature selection method in classification. Scientific reports 7 (1), pp. 1–12. Cited by: §I.