1 Introduction
Feature selection is of great interest in all machine learning task, since it reduces the computational complexity of the models, frequently improves generalization, and helps in data understanding. In general, feature selection methods are divided into the following categories [1]:
 Filter methods

use feature metrics, such as correlation, information gain to distinguish between useful and useless features.
 Wrapper methods

use the feedback of model metrics to optimize the selected feature subset. This problem can be exactly solved only by brute force, which makes it intractable in the majority of cases. Numerous heuristics are suggested (modern researches mainly focused on swarm intelligence optimization
[2, 3, 4]), but they are can not guarantee optimality.  Embedded methods

exists for certain algorithms that create a feature importance score during training. Classical examples are decision treebased algorithms and
penalized linear models.
It is obvious that models able to automatically find optimal features are the most desired type of feature selector since it provides both trained model and important features subset simultaneously. Unfortunately, that is usually possible only for very simple models, while deep neural networks (NN), one of the most crucial stateoftheart algorithms, are unable to perform feature selection during training. The presented paper is devoted to the development of method to resolve that issue by augmenting the network with stochastic variant of penalization, which can be interpreted as stochastic search in the feature space.
2 penalization for neural networks
The most straightforward way of how to achieve sparsity with neural networks is to add penalty . This method is widely used to achieve representation sparsity [5, 6] by penalizing neuron activations or sparsity of convolutional kernels [7, 8] that improves performance of convolutional models. Although penalization efficiently sparsifies networks, the structure of the obtained sparse representation is unpredictable and thus can not be used for feature selection or neuron pruning. The work of Wen et al.[9] handles that issue by explicitly introducing structure, penalizing individual components of network such as channels, layers, etc. At the same time, penalization for feature selection has not been yet applied to neural networks.
We suggest how the wellknown sparsity constraints can be applied to neural networks input aiming feature selection. The proposed method exhibits high universality and can be applied to selection of input features, convolutional kernels, regions of importance, etc. It should not be confused with widely used weights or activation regularization.
3 Related works
Sparsification of neural networks is a popular research subject of significant importance, since it allows to make large and computationally demanding neural networks smaller and more efficient to be run on mobile devices. Application of structured penalty for optimization of neural network architecture was suggested by Wen et al. [9] and Scardapane et al. [10]. Both approaches are deterministic.
Since the proposed method is stochastic, it shares common properties with a wide variety of stochastic regularization technics, derived from the original Dropout [11]. Energybased dropout [12] regularizes and prunes network by optimizing scalar energy with differential evolution algorithm. Work of Srinivas et al. [13]defines a family of Dropoutlike techniques. One of them, Dropout++ uses stochastic neuron dropping with trainable parameters, derived through Bayesian NN, that lead to similar although not identical formulation of filtering units. Adaptive Dropout [14]
achieves tuning of dropping probabilities by augmenting neural network with binary belief network.
4 Binary stochastic filtering
The main idea of the proposed method (BSF) is application of penalty on the involvement of the variable in the training/prediction process. This is done by elementwise multiplying of input datum
by the random vector
such that , where vector defines a tunable set of parameters. This is similar to the Dropout technic, which performs the same multiplication, but its weights are predefined constant. Vector is penalized with norm, which stochastically forces the model to use only the most important features. Another way to imagine it is to think about it as stochastic investigation of parameter space, which at the same time penalizes the number of involved features.Gradients
To make the layer weights
trainable, it is necessary to define two gradients for backpropagation to work, namely,
and , where . We define the first gradient aswhich is a natural way to describe a variable passed or dropped, similarly to the Dropout. It is more tricky is to define due to its randomness. Instead, we can differentiate the expected value
and use it as gradient estimate. Moreover, it was empirically found that it is useful to scale the gradient by the weight value, i.e. to redefine the gradient as
. This modification has a clear interpretation: the lower weight the lower is feature involvement in the training process, thus the weights of this feature must be changed slower. This modification stabilizes training and prevents already disabled features from being reenabled.A behavior of the filtering layer during inference phase is altered by setting a threshold
and deterministically passing features above threshold, while features corresponding to weights below threshold are dropped. This replacement makes the layer at inference phase deterministic, which stabilizes validation metrics. Implementation of BSF layer in TensorFlow 2 framework can be found in the repository
^{1}^{1}1https://github.com/Trel725/BSFilter.Analysis
To get some understanding of how this method work we will investigate its behavior on the simplest possible model – linear regression. We will start with the general formula for linear regression
where is a vector of target values, is a vector of model weights, and is a matrix of input data, such that each row of the matrix is a single observation vector . Now, our goal is to investigate how will the optimization objective change if we multiply each by a random vector . Since our objective is now random, we will minimize its expected value, i.e.
where is a matrix, such that . It can be shown (the derivation of the equation below is given in the supporting information) that the optimization objective is equivalent to
where
, i.e. its diagonal elements correspond to standard deviations of features in
(supposing they are centered), denotes Hadamard product. We can see that if , the member can be taken out of the norm expression, which gives an identical expression to the one derived in [11] (when ). From that objective we can get some insights of model behavior:
For the th feature is efficiently penalized with norm, where the penalty is additionally scaled by the standard deviation of that feature. Thus, weights for the strongly varying feature are penalized more, which is similar to classical Dropout.

If , which is forced by the penalty, or , the middle term vanishes and weight of th feature is not penalized.
Stochastic vs deterministic
It is not immediately clear why to prefer stochastic regularization to deterministic. Firstly, weights penalization is clearly enough to achieve sparsity for simple shallow models like Lasso regression. At the same time, deep models can efficiently rescale back near zero features in the hidden layers. Stochastic regularization is free from that issue since it has only two possible states, feature is either passed without changes or set to zero. Moreover, it is well known in the machine learning literature that addition of noise to the network has positive effects on model generalization and convergence
[15, 16, 17]. It was observed in experiments that stochastic models are actually more stable at training phase and produce better separated (into important and unimportant) features. An example of the model convergence curves and selected feature importances is given in the Fig. 1, left.5 Experiments
Binary stochastic filtering layer was implemented in TensorFlow 2 framework [18] according to the definition above. A collection of datasets from OpenMLCC18 benchmark suite [19]
was used in the experiments. It contains 72 classification datasets that satisfy a number of desired properties, including balancing, reasonable number of features and observations, moderate classification difficulty, etc. Moreover, the authors provided reference preprocessing and crossvalidation splitting, which facilitates replication of experiments. NN models typically require tuning of hyperparameters to get fair results, thus a subset of 10 datasets was selected from the OpenMLCC18 and used in further experiments. Threshold
was set to 0.25 andscore was used as the main evaluation metric in all experiments.
5.1 Feature selection
penalized models. Early stopping after 20 epochs without loss improvement was used (right).
ID  BSF  DT  KBF  KBMI  RFE  SVC  Features 

16  0.0995  0.1490  0.1285  0.1280  0.4835  0.2950  6/64 
32  0.0014  0.0007  0.0058  0.0063  0.0019  0.0055  13/16 
45  0.0169  0.0169  0.0191  0.0185  0.0031  0.0053  6/60 
219  0.0371  0.0271  0.0217  0.0230  0.0218  0.0375  7/8 
3481  0.0213  0.0303  0.1445  0.1468  0.0182  0.0355  56/617 
9910  0.0192  0.0080  0.0056  0.0061  0.0357  0.0075  166/1776 
9957  0.0057  0.0048  0.0009  0.0010  0.0114  0.0048  23/41 
9977  0.0333  0.0187  0.0606  0.0607  0.0218  0.0180  7/118 
14952  0.0131  0.0024  0.0116  0.0111  0.0194  0.0149  15/30 
146825  0.0244  0.0304  0.1025  0.1027  —  0.1425  102/784 
167140  0.0053  0.0050  0.0822  0.0813  0.0031  0.0057  10/180 
For the main experiment features were selected from each experimental dataset by training a penalized model. The penalization coefficient was manually tuned to achieve maximal reduce in number of features, while keeping metrics reasonable. Other popular methods (implemented in scikitlearn [20]) were selected for comparison, corresponding abbreviations are given in parentheses:

Filtering features based on mutual information (KBMI) and ANOVA Fvalue (KBF)

Recursive feature elimination with SVM as a base classifier (RFE)
[21] 
Embedded methods: penalized SVM (SVC) and decision tree (CART algorithm, DT)
The same number of features was selected with these methods and NN model was trained on each of the selected feature subsets. Metrics for each crossvalidation split were collected and differences between reference fullfeatured score and featureselected one were used as a measure of feature selection efficiency. Crossvalidation splits were same for all experiments. Results are provided in Fig. 1, which visualizes the distribution of , i.e. positive values correspond to feature selected score higher than original one. It follows from the data that BSF leads to the lowest decrease of classification score. Although the difference with its closest rival (DT) is small, it is statistically significant with Wilcoxon test pvalue . Exact values are tabulated in the Tab. 1^{2}^{2}2RFE feature selection for dataset 146825 was intractable, thus this value is missing from the table. It is important to note that augmenting model with BSF layer has only minor impact of its convergence (Fig. 1, left), thus the filtering layer can be added to any model almost without overhead
5.2 Neuron pruning
For the second experiment every dropout layer was replaced with penalized BSF layer. Regularization coefficient was shared among layers, but normalized by the starting number of neurons in the layer to achieve equal penalization. Every model was trained on the same selected datasets, the BSF layers were removed, and neurons, corresponding to the low BSF values were pruned, which was achieved by removing corresponding columns and/or rows from the weight matrix for each layer (Fig. 2). Differences in score for the obtained pruned model are plotted against the relative amount of kept weights in Fig. 3. The same figure demonstrates how the number of weights can be further decreased by the price of reduce in classification metrics.
5.3 Region selection in spectra
Spectra are one of the most common data in natural sciences. Automated recognition of spectra is highly usesul in all branches of chemistry [22, 23, 24] and biology or medicine [25, 26, 27]. Such signals share important property, existence of importance regions, areas which are crucial for their interpretation. While for images relative positions of features matter (which are usually extracted with convolutional layers), spectra are recognized based on global positions of peaks or other features. Although it may seem like a problem for which fullyconnected network is more suitable, convolutional layers are still advantageous for processing spectral information since they learn preprocessing of data, such as background subtraction, noise filtering, etc. Extraction of the importance regions from spectral data is exceptionally useful, since it sheds light on the processes that generate the data. Numerous approaches were proposed to highlight most salient regions aiming explanation of model decisions, including GradCAM [28], LIME [29] and SHAP [30]. Unfortunately, these methods, developed to explain individual predictions, frequently produce overly complicated picture, highlighting noise and clearly useless regions. Combination of individual explanation to get datasetwise explanation is nontrivial and its interpretation is frequently unclear.
Although this problem can be formulated as classical feature selection, it is a poor approach since it disrupts the continuity of the spectra and breaks the convolutional preprocessing. Desired importance regions selection can be accomplished by selecting features at the output of convolutional counterpart of network, which can be performed with BSF layer that shares weights along the channels axis. For experiment, the custom Raman spectra dataset of glycoproteins was classified with simple convolutional classifier, and obtained importance regions were analyzed with GradCAM, adapted for analysis of 1D convolutional networks, SHAP explainer and BSF. The obtained results are presented in the Fig. 4. As it was mentioned above, SHAP and GradCAM detections of region importances are cumbersome and practically useless, while BSF has clearly selected the most informative regions which has clear chemical interpretation. This approach was successfully used in two analytical projects [31, 32].
6 Conclusion
The conducted experiments demonstrated that BSF selects features at least as efficiently as best of the classical methods. At the same time, it can be embedded directly in the NN model, eliminating the need for external feature selector. Moreover, thanks to its differentiability it can be utilized not only to drop nodes from the input layer (i.e. features) but can be placed in the middle of the model, which can be utilized for neuron pruning. This approach is also applicable for filtering of convolutional channels by simple weight sharing of the BSF layer along all axes except channel axis. Instead, if selection of regions of importance is an aim, BSF can be applied by sharing weights along channels axis. It was shown that for some datasets this method allows to reduce network size to approximately 1% of the original size without significant reduce of classification metrics. BSF has potential to become an indispensable tool for processing of spectral data, particularly valuable in natural sciences.
References
 [1] Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.
 [2] Shenkai Gu, Ran Cheng, and Yaochu Jin. Feature selection for highdimensional classification using a competitive swarm optimizer. Soft Computing, 22(3):811–822, 2018.
 [3] Emrah Hancer, Bing Xue, Mengjie Zhang, Dervis Karaboga, and Bahriye Akay. Pareto front feature selection based on artificial bee colony optimization. Information Sciences, 422:462–479, 2018.
 [4] Majdi Mafarja, Ibrahim Aljarah, Ali Asghar Heidari, Abdelaziz I Hammouri, Hossam Faris, AlZoubi Ala’M, and Seyedali Mirjalili. Evolutionary population dynamics and grasshopper optimization approaches for feature selection problems. KnowledgeBased Systems, 145:25–45, 2018.

[5]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
Deep sparse rectifier neural networks.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pages 315–323, 2011. 
[6]
Andrew Ng et al.
Sparse autoencoder.
CS294A Lecture notes, 72(2011):1–19, 2011. 
[7]
Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky.
Sparse convolutional neural networks.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pages 806–814, 2015.  [8] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, and Ingmar Posner. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1355–1361. IEEE, 2017.
 [9] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
 [10] Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
 [11] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
 [12] Hojjat Salehinejad and Shahrokh Valaee. Edropout: Energybased dropout and pruning of deep neural networks. arXiv preprint arXiv:2006.04270, 2020.
 [13] Suraj Srinivas and R Venkatesh Babu. Generalized dropout. arXiv preprint arXiv:1611.06791, 2016.
 [14] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in neural information processing systems, pages 3084–3092, 2013.
 [15] Salah Rifai, Xavier Glorot, Yoshua Bengio, and Pascal Vincent. Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250, 2011.
 [16] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 [17] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
 [18] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
 [19] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. Openml benchmarking suites, 2017.
 [20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[21]
Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik.
Gene selection for cancer classification using support vector machines.
Machine learning, 46(13):389–422, 2002.  [22] Kunal Ghosh, Annika Stuke, Milica Todorović, Peter Bjørn Jørgensen, Mikkel N Schmidt, Aki Vehtari, and Patrick Rinke. Deep learning spectroscopy: Neural networks for molecular excitation spectra. Advanced science, 6(9):1801367, 2019.
 [23] Chenhao Cui and Tom Fearn. Modern practical convolutional neural networks for multivariate regression: Applications to nir calibration. Chemometrics and Intelligent Laboratory Systems, 182:9–20, 2018.
 [24] Xiaolei Zhang, Tao Lin, Jinfan Xu, Xuan Luo, and Yibin Ying. Deepspectra: An endtoend deep learning approach for quantitative spectral analysis. Analytica chimica acta, 1058:48–57, 2019.
 [25] Sigurdur Sigurdsson, Peter Alshede Philipsen, Lars Kai Hansen, Jan Larsen, Monika Gniadecka, and HansChristian Wulf. Detection of skin cancer by classification of raman spectra. IEEE transactions on biomedical engineering, 51(10):1784–1793, 2004.
 [26] Yiding Chen, Shu Zheng, Jiekai Yu, and Xun Hu. Artificial neural networks analysis of surfaceenhanced laser desorption/ionization mass spectra of serum protein pattern distinguishes colorectal cancer from healthy population. Clinical Cancer Research, 10(24):8380–8385, 2004.
 [27] Jindřich Charvát, Aleš Procházka, Matěj Fričl, Oldřich Vyšata, and Lucie Himmlová. Diffuse reflectance spectroscopy in dental caries detection and classification. Signal, Image and Video Processing, pages 1–8, 2020.
 [28] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Gradcam: Visual explanations from deep networks via gradientbased localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
 [29] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 1317, 2016, pages 1135–1144, 2016.
 [30] Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765–4774, 2017.
 [31] O Guselnikova, A Trelin, A Skvortsova, P Ulbrich, P Postnikov, A Pershina, D Sykora, V Svorcik, and O Lyutakov. Labelfree surfaceenhanced raman spectroscopy with artificial neural network technique for recognition photoinduced dna damage. Biosensors and Bioelectronics, 145:111718, 2019.
 [32] M Erzina, A Trelin, O Guselnikova, B Dvorankova, K Strnadova, A Perminova, P Ulbrich, D Mares, V Jerabek, R Elashnikov, et al. Precise cancer detection via the combination of functionalized sers surfaces and convolutional neural network with independent inputs. Sensors and Actuators B: Chemical, 308:127660, 2020.