Log In Sign Up

Binary Stochastic Filtering: feature selection and beyond

Feature selection is one of the most decisive tools in understanding data and machine learning models. Among other methods, sparsity induced by L^1 penalty is one of the simplest and best studied approaches to this problem. Although such regularization is frequently used in neural networks to achieve sparsity of weights or unit activations, it is unclear how it can be employed in the feature selection problem. This work aims at extending the neural network with ability to automatically select features by rethinking how the sparsity regularization can be used, namely, by stochastically penalizing feature involvement instead of the layer weights. The proposed method has demonstrated superior efficiency when compared to a few classical methods, achieved with minimal or no computational overhead, and can be directly applied to any existing architecture. Furthermore, the method is easily generalizable for neuron pruning and selection of regions of importance for spectral data.


Binary Stochastic Filtering: a Solution for Supervised Feature Selection and Neural Network Shape Optimization

Binary Stochastic Filtering (BSF), the algorithm for feature selection a...

Sparsity by Redundancy: Solving L_1 with a Simple Reparametrization

We identify and prove a general principle: L_1 sparsity can be achieved ...

Neural Network Surgery with Sets

The cost to train machine learning models has been increasing exponentia...

Feature selection via simultaneous sparse approximation for person specific face verification

There is an increasing use of some imperceivable and redundant local fea...

Factorization Machines with Regularization for Sparse Feature Interactions

Factorization machines (FMs) are machine learning predictive models base...

Model-free feature selection to facilitate automatic discovery of divergent subgroups in tabular data

Data-centric AI encourages the need of cleaning and understanding of dat...

Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity

Interpretable machine learning has demonstrated impressive performance w...

1 Introduction

Feature selection is of great interest in all machine learning task, since it reduces the computational complexity of the models, frequently improves generalization, and helps in data understanding. In general, feature selection methods are divided into the following categories [1]:

Filter methods

use feature metrics, such as correlation, information gain to distinguish between useful and useless features.

Wrapper methods

use the feedback of model metrics to optimize the selected feature subset. This problem can be exactly solved only by brute force, which makes it intractable in the majority of cases. Numerous heuristics are suggested (modern researches mainly focused on swarm intelligence optimization

[2, 3, 4]), but they are can not guarantee optimality.

Embedded methods

exists for certain algorithms that create a feature importance score during training. Classical examples are decision tree-based algorithms and

-penalized linear models.

It is obvious that models able to automatically find optimal features are the most desired type of feature selector since it provides both trained model and important features subset simultaneously. Unfortunately, that is usually possible only for very simple models, while deep neural networks (NN), one of the most crucial state-of-the-art algorithms, are unable to perform feature selection during training. The presented paper is devoted to the development of method to resolve that issue by augmenting the network with stochastic variant of penalization, which can be interpreted as stochastic search in the feature space.

2 penalization for neural networks

The most straightforward way of how to achieve sparsity with neural networks is to add penalty . This method is widely used to achieve representation sparsity [5, 6] by penalizing neuron activations or sparsity of convolutional kernels [7, 8] that improves performance of convolutional models. Although penalization efficiently sparsifies networks, the structure of the obtained sparse representation is unpredictable and thus can not be used for feature selection or neuron pruning. The work of Wen et al.[9] handles that issue by explicitly introducing structure, penalizing individual components of network such as channels, layers, etc. At the same time, penalization for feature selection has not been yet applied to neural networks.

We suggest how the well-known sparsity constraints can be applied to neural networks input aiming feature selection. The proposed method exhibits high universality and can be applied to selection of input features, convolutional kernels, regions of importance, etc. It should not be confused with widely used weights or activation regularization.

3 Related works

Sparsification of neural networks is a popular research subject of significant importance, since it allows to make large and computationally demanding neural networks smaller and more efficient to be run on mobile devices. Application of structured penalty for optimization of neural network architecture was suggested by Wen et al. [9] and Scardapane et al. [10]. Both approaches are deterministic.

Since the proposed method is stochastic, it shares common properties with a wide variety of stochastic regularization technics, derived from the original Dropout [11]. Energy-based dropout [12] regularizes and prunes network by optimizing scalar energy with differential evolution algorithm. Work of Srinivas et al. [13]defines a family of Dropout-like techniques. One of them, Dropout++ uses stochastic neuron dropping with trainable parameters, derived through Bayesian NN, that lead to similar although not identical formulation of filtering units. Adaptive Dropout [14]

achieves tuning of dropping probabilities by augmenting neural network with binary belief network.

4 Binary stochastic filtering

The main idea of the proposed method (BSF) is application of penalty on the involvement of the variable in the training/prediction process. This is done by element-wise multiplying of input datum

by the random vector

such that , where vector defines a tunable set of parameters. This is similar to the Dropout technic, which performs the same multiplication, but its weights are predefined constant. Vector is penalized with norm, which stochastically forces the model to use only the most important features. Another way to imagine it is to think about it as stochastic investigation of parameter space, which at the same time penalizes the number of involved features.


To make the layer weights

trainable, it is necessary to define two gradients for backpropagation to work, namely,

and , where . We define the first gradient as

which is a natural way to describe a variable passed or dropped, similarly to the Dropout. It is more tricky is to define due to its randomness. Instead, we can differentiate the expected value

and use it as gradient estimate. Moreover, it was empirically found that it is useful to scale the gradient by the weight value, i.e. to redefine the gradient as

. This modification has a clear interpretation: the lower weight the lower is feature involvement in the training process, thus the weights of this feature must be changed slower. This modification stabilizes training and prevents already disabled features from being re-enabled.

A behavior of the filtering layer during inference phase is altered by setting a threshold

and deterministically passing features above threshold, while features corresponding to weights below threshold are dropped. This replacement makes the layer at inference phase deterministic, which stabilizes validation metrics. Implementation of BSF layer in TensorFlow 2 framework can be found in the repository



To get some understanding of how this method work we will investigate its behavior on the simplest possible model – linear regression. We will start with the general formula for linear regression

where is a vector of target values, is a vector of model weights, and is a matrix of input data, such that each row of the matrix is a single observation vector . Now, our goal is to investigate how will the optimization objective change if we multiply each by a random vector . Since our objective is now random, we will minimize its expected value, i.e.

where is a matrix, such that . It can be shown (the derivation of the equation below is given in the supporting information) that the optimization objective is equivalent to


, i.e. its diagonal elements correspond to standard deviations of features in

(supposing they are centered), denotes Hadamard product. We can see that if , the member can be taken out of the norm expression, which gives an identical expression to the one derived in [11] (when ). From that objective we can get some insights of model behavior:

  1. For the th feature is efficiently penalized with norm, where the penalty is additionally scaled by the standard deviation of that feature. Thus, weights for the strongly varying feature are penalized more, which is similar to classical Dropout.

  2. If , which is forced by the penalty, or , the middle term vanishes and weight of th feature is not penalized.

Stochastic vs deterministic

It is not immediately clear why to prefer stochastic regularization to deterministic. Firstly, weights penalization is clearly enough to achieve sparsity for simple shallow models like Lasso regression. At the same time, deep models can efficiently rescale back near zero features in the hidden layers. Stochastic regularization is free from that issue since it has only two possible states, feature is either passed without changes or set to zero. Moreover, it is well known in the machine learning literature that addition of noise to the network has positive effects on model generalization and convergence

[15, 16, 17]. It was observed in experiments that stochastic models are actually more stable at training phase and produce better separated (into important and unimportant) features. An example of the model convergence curves and selected feature importances is given in the Fig. 1, left.

5 Experiments

Binary stochastic filtering layer was implemented in TensorFlow 2 framework [18] according to the definition above. A collection of datasets from OpenML-CC18 benchmark suite [19]

was used in the experiments. It contains 72 classification datasets that satisfy a number of desired properties, including balancing, reasonable number of features and observations, moderate classification difficulty, etc. Moreover, the authors provided reference preprocessing and cross-validation splitting, which facilitates replication of experiments. NN models typically require tuning of hyperparameters to get fair results, thus a subset of 10 datasets was selected from the OpenML-CC18 and used in further experiments. Threshold

was set to 0.25 and

score was used as the main evaluation metric in all experiments.

5.1 Feature selection

Figure 1: Change in score after feature selection with different methods, sorted in ascending order according to the group means (left). Examples of validation loss evolution for reference model and

penalized models. Early stopping after 20 epochs without loss improvement was used (right).

16 -0.0995 -0.1490 -0.1285 -0.1280 -0.4835 -0.2950 6/64
32 -0.0014 0.0007 -0.0058 -0.0063 -0.0019 -0.0055 13/16
45 0.0169 0.0169 0.0191 0.0185 -0.0031 -0.0053 6/60
219 0.0371 0.0271 0.0217 0.0230 0.0218 0.0375 7/8
3481 -0.0213 -0.0303 -0.1445 -0.1468 -0.0182 -0.0355 56/617
9910 0.0192 0.0080 0.0056 0.0061 0.0357 0.0075 166/1776
9957 0.0057 0.0048 0.0009 -0.0010 0.0114 0.0048 23/41
9977 -0.0333 -0.0187 -0.0606 -0.0607 -0.0218 -0.0180 7/118
14952 -0.0131 -0.0024 -0.0116 -0.0111 -0.0194 -0.0149 15/30
146825 -0.0244 -0.0304 -0.1025 -0.1027 -0.1425 102/784
167140 -0.0053 -0.0050 -0.0822 -0.0813 -0.0031 -0.0057 10/180
Table 1: Mean differences between metrics for model trained on full and feature-selected datasets.

For the main experiment features were selected from each experimental dataset by training a penalized model. The penalization coefficient was manually tuned to achieve maximal reduce in number of features, while keeping metrics reasonable. Other popular methods (implemented in scikit-learn [20]) were selected for comparison, corresponding abbreviations are given in parentheses:

  • Filtering features based on mutual information (KB-MI) and ANOVA F-value (KB-F)

  • Recursive feature elimination with SVM as a base classifier (RFE)


  • Embedded methods: penalized SVM (SVC) and decision tree (CART algorithm, DT)

The same number of features was selected with these methods and NN model was trained on each of the selected feature subsets. Metrics for each cross-validation split were collected and differences between reference full-featured score and feature-selected one were used as a measure of feature selection efficiency. Cross-validation splits were same for all experiments. Results are provided in Fig. 1, which visualizes the distribution of , i.e. positive values correspond to feature selected score higher than original one. It follows from the data that BSF leads to the lowest decrease of classification score. Although the difference with its closest rival (DT) is small, it is statistically significant with Wilcoxon test p-value . Exact values are tabulated in the Tab. 1222RFE feature selection for dataset 146825 was intractable, thus this value is missing from the table. It is important to note that augmenting model with BSF layer has only minor impact of its convergence (Fig. 1, left), thus the filtering layer can be added to any model almost without overhead

5.2 Neuron pruning

Figure 2: Visualization of pruning with BSF. Neurons and BSF units are drawn in circles and squares respectively. Weights of BSF are shown as saturation of squares fill.
Figure 3: Change in classification metrics after pruning vs fraction of kept weights. Values for optimized regularization coefficient for all datasets (left); trade-off between model accuracy and complexity for different regularization coefficients for two selected datsets (right). Datset IDs are represented in colors.

For the second experiment every dropout layer was replaced with penalized BSF layer. Regularization coefficient was shared among layers, but normalized by the starting number of neurons in the layer to achieve equal penalization. Every model was trained on the same selected datasets, the BSF layers were removed, and neurons, corresponding to the low BSF values were pruned, which was achieved by removing corresponding columns and/or rows from the weight matrix for each layer (Fig. 2). Differences in score for the obtained pruned model are plotted against the relative amount of kept weights in Fig. 3. The same figure demonstrates how the number of weights can be further decreased by the price of reduce in classification metrics.

5.3 Region selection in spectra

Figure 4: Selected regions of importance with Grad-CAM and SHAP methods (left); Regions of importance selected by BSF. For visualization, two Raman spectra (of human glycoproteins) from dataset are plotted above with selected regions highlighted in red.

Spectra are one of the most common data in natural sciences. Automated recognition of spectra is highly usesul in all branches of chemistry [22, 23, 24] and biology or medicine [25, 26, 27]. Such signals share important property, existence of importance regions, areas which are crucial for their interpretation. While for images relative positions of features matter (which are usually extracted with convolutional layers), spectra are recognized based on global positions of peaks or other features. Although it may seem like a problem for which fully-connected network is more suitable, convolutional layers are still advantageous for processing spectral information since they learn preprocessing of data, such as background subtraction, noise filtering, etc. Extraction of the importance regions from spectral data is exceptionally useful, since it sheds light on the processes that generate the data. Numerous approaches were proposed to highlight most salient regions aiming explanation of model decisions, including Grad-CAM [28], LIME [29] and SHAP [30]. Unfortunately, these methods, developed to explain individual predictions, frequently produce overly complicated picture, highlighting noise and clearly useless regions. Combination of individual explanation to get dataset-wise explanation is nontrivial and its interpretation is frequently unclear.

Although this problem can be formulated as classical feature selection, it is a poor approach since it disrupts the continuity of the spectra and breaks the convolutional preprocessing. Desired importance regions selection can be accomplished by selecting features at the output of convolutional counterpart of network, which can be performed with BSF layer that shares weights along the channels axis. For experiment, the custom Raman spectra dataset of glycoproteins was classified with simple convolutional classifier, and obtained importance regions were analyzed with Grad-CAM, adapted for analysis of 1D convolutional networks, SHAP explainer and BSF. The obtained results are presented in the Fig. 4. As it was mentioned above, SHAP and Grad-CAM detections of region importances are cumbersome and practically useless, while BSF has clearly selected the most informative regions which has clear chemical interpretation. This approach was successfully used in two analytical projects [31, 32].

6 Conclusion

The conducted experiments demonstrated that BSF selects features at least as efficiently as best of the classical methods. At the same time, it can be embedded directly in the NN model, eliminating the need for external feature selector. Moreover, thanks to its differentiability it can be utilized not only to drop nodes from the input layer (i.e. features) but can be placed in the middle of the model, which can be utilized for neuron pruning. This approach is also applicable for filtering of convolutional channels by simple weight sharing of the BSF layer along all axes except channel axis. Instead, if selection of regions of importance is an aim, BSF can be applied by sharing weights along channels axis. It was shown that for some datasets this method allows to reduce network size to approximately 1% of the original size without significant reduce of classification metrics. BSF has potential to become an indispensable tool for processing of spectral data, particularly valuable in natural sciences.


  • [1] Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.
  • [2] Shenkai Gu, Ran Cheng, and Yaochu Jin. Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Computing, 22(3):811–822, 2018.
  • [3] Emrah Hancer, Bing Xue, Mengjie Zhang, Dervis Karaboga, and Bahriye Akay. Pareto front feature selection based on artificial bee colony optimization. Information Sciences, 422:462–479, 2018.
  • [4] Majdi Mafarja, Ibrahim Aljarah, Ali Asghar Heidari, Abdelaziz I Hammouri, Hossam Faris, Al-Zoubi Ala’M, and Seyedali Mirjalili. Evolutionary population dynamics and grasshopper optimization approaches for feature selection problems. Knowledge-Based Systems, 145:25–45, 2018.
  • [5] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    , pages 315–323, 2011.
  • [6] Andrew Ng et al.

    Sparse autoencoder.

    CS294A Lecture notes, 72(2011):1–19, 2011.
  • [7] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky.

    Sparse convolutional neural networks.


    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 806–814, 2015.
  • [8] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, and Ingmar Posner. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1355–1361. IEEE, 2017.
  • [9] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
  • [10] Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
  • [11] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • [12] Hojjat Salehinejad and Shahrokh Valaee. Edropout: Energy-based dropout and pruning of deep neural networks. arXiv preprint arXiv:2006.04270, 2020.
  • [13] Suraj Srinivas and R Venkatesh Babu. Generalized dropout. arXiv preprint arXiv:1611.06791, 2016.
  • [14] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in neural information processing systems, pages 3084–3092, 2013.
  • [15] Salah Rifai, Xavier Glorot, Yoshua Bengio, and Pascal Vincent. Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250, 2011.
  • [16] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • [17] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
  • [18] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
  • [19] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. Openml benchmarking suites, 2017.
  • [20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [21] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik.

    Gene selection for cancer classification using support vector machines.

    Machine learning, 46(1-3):389–422, 2002.
  • [22] Kunal Ghosh, Annika Stuke, Milica Todorović, Peter Bjørn Jørgensen, Mikkel N Schmidt, Aki Vehtari, and Patrick Rinke. Deep learning spectroscopy: Neural networks for molecular excitation spectra. Advanced science, 6(9):1801367, 2019.
  • [23] Chenhao Cui and Tom Fearn. Modern practical convolutional neural networks for multivariate regression: Applications to nir calibration. Chemometrics and Intelligent Laboratory Systems, 182:9–20, 2018.
  • [24] Xiaolei Zhang, Tao Lin, Jinfan Xu, Xuan Luo, and Yibin Ying. Deepspectra: An end-to-end deep learning approach for quantitative spectral analysis. Analytica chimica acta, 1058:48–57, 2019.
  • [25] Sigurdur Sigurdsson, Peter Alshede Philipsen, Lars Kai Hansen, Jan Larsen, Monika Gniadecka, and Hans-Christian Wulf. Detection of skin cancer by classification of raman spectra. IEEE transactions on biomedical engineering, 51(10):1784–1793, 2004.
  • [26] Yi-ding Chen, Shu Zheng, Jie-kai Yu, and Xun Hu. Artificial neural networks analysis of surface-enhanced laser desorption/ionization mass spectra of serum protein pattern distinguishes colorectal cancer from healthy population. Clinical Cancer Research, 10(24):8380–8385, 2004.
  • [27] Jindřich Charvát, Aleš Procházka, Matěj Fričl, Oldřich Vyšata, and Lucie Himmlová. Diffuse reflectance spectroscopy in dental caries detection and classification. Signal, Image and Video Processing, pages 1–8, 2020.
  • [28] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • [29] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144, 2016.
  • [30] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765–4774, 2017.
  • [31] O Guselnikova, A Trelin, A Skvortsova, P Ulbrich, P Postnikov, A Pershina, D Sykora, V Svorcik, and O Lyutakov. Label-free surface-enhanced raman spectroscopy with artificial neural network technique for recognition photoinduced dna damage. Biosensors and Bioelectronics, 145:111718, 2019.
  • [32] M Erzina, A Trelin, O Guselnikova, B Dvorankova, K Strnadova, A Perminova, P Ulbrich, D Mares, V Jerabek, R Elashnikov, et al. Precise cancer detection via the combination of functionalized sers surfaces and convolutional neural network with independent inputs. Sensors and Actuators B: Chemical, 308:127660, 2020.