Log In Sign Up

Sequential Attention for Feature Selection

Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a resource budget constraint. For neural networks, prior methods, including those based on ℓ_1 regularization, attention, and stochastic gates, typically select all of the features in one evaluation round, ignoring the residual value of the features during selection (i.e., the marginal contribution of a feature conditioned on the previously selected features). We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient implementation of greedy forward selection and uses attention weights at each step as a proxy for marginal feature importance. We provide theoretical insights into our Sequential Attention algorithm for linear regression models by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit algorithm [PRK1993], and thus inherits all of its provable guarantees. Lastly, our theoretical and empirical analyses provide new explanations towards the effectiveness of attention and its connections to overparameterization, which might be of independent interest.


page 9

page 16

page 18

page 20


Neural Greedy Pursuit for Feature Selection

We propose a greedy algorithm to select N important features among P inp...

Using Feature Weights to Improve Performance of Neural Networks

Different features have different relevance to a particular learning pro...

Binary Stochastic Filtering: feature selection and beyond

Feature selection is one of the most decisive tools in understanding dat...

Fast Feature Selection with Fairness Constraints

We study the fundamental problem of selecting optimal features for model...

Shapley values for feature selection: The good, the bad, and the axioms

The Shapley value has become popular in the Explainable AI (XAI) literat...

Parameter and Feature Selection in Stochastic Linear Bandits

We study two model selection settings in stochastic linear bandits (LB)....

Can I Trust the Explainer? Verifying Post-hoc Explanatory Methods

For AI systems to garner widespread public acceptance, we must develop m...

1 Introduction

Feature selection is a classic problem in machine learning and statistics where one is asked to find a subset of features from a larger set of features, such that the prediction quality of the model trained using the subset of features is approximately as good as training on all features. Finding such a feature subset is desirable for a number of reasons: improving model interpretability, reducing inference latency, decreasing model size, and regularizing models by removing redundant or noisy features to reduce overfitting and improve generalization. We direct the reader to LCWMTTL2017 for a comprehensive survey on the role of feature selection in machine learning.

The widespread success of deep learning has prompted an intense study of feature selection algorithms for

neural networks, especially in the supervised setting. While many methods have been proposed, we concentrate on a particular line of work centered around using attention

for feature selection. The attention mechanism in machine learning roughly refers to applying a trainable softmax mask to the input of a layer. This allows the model to “focus on” important parts of the input during training. Attention has recently led to major breakthroughs in computer vision, natural language processing, and several other areas of machine learning 

(VSPUJGKP2017). For feature selection, the works of WZSZ2014; GGH2019; SDLP2020; WC2020; LLY2021 all present new approaches for feature attribution, ranking, and selection that are inspired by attention.

One problem with naively using attention for feature selection is that it can ignore the residual values of features, i.e., the marginal contribution a feature has on the loss conditioned on the presence of the previously-selected features. This can lead to several problems such as selecting redundant features or ignoring features that are uninformative in isolation but valuable in the presence of others.

This work introduces the Sequential Attention

algorithm for supervised feature selection. Our algorithm addresses the shortcomings above by using attention-based selection adaptively over multiple rounds. Sequential Attention simplifies earlier attention-based feature selection algorithms by directly training one global feature mask instead of aggregating many instance-wise feature masks that are the outputs of different subnetworks. This technique reduces the overhead of our algorithm, removes the need to tune unnecessary hyperparameters,

works directly with any model architecture

, facilitates a streaming implementation, and gives higher-quality estimates for the marginal gains in prediction quality for the unselected features. Empirically, Sequential Attention achieves state-of-the-art feature selection results for neural networks on standard benchmarks.

Figure 1: Sequential attention applied to model . At each step, the selected features are used as direct inputs to the model and the unselected features are downscaled by the scalar value , where

is the vector of learned attention weights.

Sequential Attention.

Our starting point for Sequential Attention is the well-known greedy forward selection algorithm, which iteratively selects the feature with the largest marginal improvement in model loss when added to the set of currently selected features (see, e.g., DK2011; EKDN2018). Greedy forward selection is known to select high-quality features, but requires training  models and is therefore impractical for many modern machine learning problems. To reduce this cost, one natural idea is to only train models, where the model trained in each step approximates the marginal gains of all unselected features. These approximate marginal gains are used as the criteria for greedy selection. Said another way, we relax the greedy algorithm to fractionally consider all feature candidates simultaneously rather than computing their exact marginal gains one-by-one with separate model trainings. We implement this idea by introducing a new set of trainable variables that represent feature importance, or attention weights. Then, in each step, we select the feature with maximum importance score and add it to the selected set. To ensure the score-augmented models in each step (1) have differentiable architectures and (2) are encouraged to hone in on the best unselected feature, we take the softmax of the importance weights and multiply each input feature value by its corresponding softmax value as illustrated in Figure 1.

Formally, given a dataset represented as a matrix with rows of examples and feature columns, suppose we want to select features. Let be a differentiable model, e.g., a neural network, that outputs the predictions . Let be the labels, be the loss between the model’s predictions and the labels, and be the Hadamard product. Sequential Attention outputs a set of feature indices, and is presented below in Algorithm 1.

1:function SequentialAttention(dataset , labels , model , loss , size )
2:     Initialize
3:     for  to  do
4:         Let , where for
5:         Set unselected feature with largest attention weight
6:         Update      
7:     return
Algorithm 1 Sequential Attention for feature selection.

1.1 Our contributions

Theoretical guarantees.

We give provable guarantees for Sequential Attention for least squares linear regression by analyzing a variant of the algorithm called regularized linear Sequential Attention.

This variant (1) uses Hadamard product overparameterization directly between the attention weights and feature values without normalizing the attention weights via , and (2) adds regularization to the objective, hence the “linear” and “regularized” terms. Note that regularization, or weight decay, is common practice when using gradient-based optimizers (tibshirani2021equivalences). We give theoretical and empirical evidence that replacing the by different overparameterization schemes leads to similar results (Section 4.2) while offering more tractable analysis. In particular, our main result shows that regularized linear Sequential Attention has the same provable guarantees as the celebrated Orthogonal Matching Pursuit (OMP) algorithm of PRK1993 for sparse linear regression, without making any assumptions on the design matrix or response vector.

For linear regression, regularized linear Sequential Attention is equivalent to OMP.

We prove this equivalence using a novel two-step argument. First, we show that regularized linear Sequential Attention is equivalent to a greedy version of LASSO (Tib1996), which LC2014 call Sequential LASSO. However, prior to our work, Sequential LASSO was only analyzed in a restricted “sparse signal plus noise” setting, offering limited insight into its success in practice. Second, we prove that Sequential LASSO is equivalent to OMP in the fully general setting for linear regression by analyzing the geometry of the associated polyhedra. This ultimately allows us to transfer the guarantees of OMP to Sequential Attention.

For linear regression, Sequential LASSO (LC2014) is equivalent to OMP.

We present the full argument for our results in Section 3. This analysis takes significant steps towards explaining the success of attention in feature selection and the various theoretical phenomena at play.

Towards understanding attention.

An important property of OMP is that it provably approximates the marginal gains of features—DK2011 showed that for any subset of features, the gradient of the least squares loss at its minimizer approximates the marginal gains up to a factor that depends on the sparse condition numbers of the design matrix. This suggests that Sequential Attention could also approximate some notion of the marginal gains for more sophisticated models when selecting the next-best feature. We observe this phenomenon empirically in our marginal gain experiments in Section 4.2. These results also help refine the widely-assumed conjecture that attention weights correlate with feature importances by specifying an exact measure of “importance” at play. Since a countless number of feature importance definitions are used in practice, it is important to determine which best explains how the attention mechanism works.

Connections to overparameterization.

In our analysis of regularized linear Sequential Attention for linear regression, we do not use the presence of the softmax in the attention mechanism—rather, the crucial ingredient in our analysis is the Hadamard product parameterization of the learned weights. We conjecture that the empirical success of attention-based feature selection is primarily due to the explicit overparamterization. Indeed, our experiments in Section 4.2 verify this claim by showing that if we substitute the softmax in Sequential Attention with a number of different (normalized) overparamterized expressions, we achieve nearly identical performance. This line of reasoning is also supported in the recent work of YHWL2021, who claim that attention largely owes its success to the “smoother and stable [loss] landscapes” that Hadamard product overparameterization induces.

1.2 Related work

We discuss recent advances in supervised feature selection for deep neural networks (DNNs), as these works are the most relevant for our empirical results.

The group LASSO

has been used in deep learning to achieve structured sparsity by pruning neurons


and even filters or channels in convolutional neural networks 

(LL2016; WWWCL2016; LKDSG2017). For feature selection in particular, the LASSO and group LASSO were applied to neural networks in ZHW2015; LCW2016; SCHU2017; LRAT2021.

The LASSO is the most widely-used method for relaxing the sparsity constraint in feature selection, but several recent works have proposed new relaxations based on stochastic gates (SSB2017; LWK2018; ABZ2019; TP2020; YLNK2020)

. This approach introduces (learnable) Bernoulli random variables for each feature during model training and minimizes the expected loss over realizations of the 0-1 variables (accepting or rejecting features).

There are several other recently suggested approaches to feature selection for DNNs. RMM2015 use the magnitudes of weights in the first hidden layer to select features. LFLN2018 proposed the DeepPINK architecture, extending the idea of knockoffs (BC2015) to neural networks. Here, each feature is paired with a “knockoff” version that competes with the original feature; if the knockoff wins, the feature is removed. BHK2019 introduced the CancelOut

DNN layer, which suppresses irrelevant features via independent per-feature activation functions, i.e., sigmoids, that act as (soft) bitmasks.

In contrast to the aforementioned works, combinatorial optimization is rich with sequential algorithms that are applied in machine learning

(zadeh2017scalable; fahrbach2019submodular; fahrbach2019non; chen2021feature; HSL2022; Bil2022). In fact, most influential feature selection algorithms from this literature are sequential, e.g., greedy forward and backward selection (YS2018; DJGEC2022), Orthogonal Matching Pursuit (PRK1993), and information-theoretic methods (Fle2004; bennasar2015feature). However, these methods are often not tailored to neural networks, and suffer from quality, efficiency, or both.

Finally, we study global feature selection, i.e., selecting the same subset of important features across all training examples, but there are also many important works that consider local (or instance-wise) feature selection. This problem is often categorized as model interpretability, and is also known as computing feature attribution or saliency maps. Instance-wise feature selection has been explored using a variety of techniques, including gradients (STKVW2017; STY2017; SF2019), attention (AP2021; YHWL2021), mutual information (CSWJ2018)

, and Shapley values from cooperative game theory


2 Preliminaries

Before discussing our theoretical guarantees for Sequential Attention in Section 3, we present several known results about feature selection for linear regression, also called sparse linear regression. Recall that in the least squares linear regression problem, we have


We work in the most challenging setting for obtaining relative error guarantees for this objective by making no distributional assumptions on , i.e., we seek such that


for some , where is not assumed to follow any particular input distribution. This is far more applicable in practice than, e.g., assuming the entries of are i.i.d. Gaussian. In large-scale applications, the number of examples, , often greatly exceeds the number of features, , resulting in an optimal loss that is nonzero. Therefore, we focus on the overdetermined regime and refer to (PSZ2022) for an excellent discussion on the long history of this problem.


Let denote the design matrix with unit columns and let denote the label vector, also assumed to be an unit vector.111These assumptions are without loss of generality by scaling. For , let denote the matrix consisting of the columns of indexed by . For singleton sets , we simply write for . Let denote the projection matrix onto the column span of , where denotes the pseudoinverse of . Let denote the projection matrix onto the orthogonal complement of .

Feature selection algorithms for linear regression.

Perhaps the most natural algorithm for sparse linear regression is greedy forward selection, which was shown to have guarantees of the form of (3) in the breakthrough works of DK2011; EKDN2018, where depends on sparse condition numbers of , i.e., the spectrum of restricted to a subset of its columns. Greedy forward selection can be expensive in practice, but these works also prove analogous guarantees for the more efficient Orthogonal Matching Pursuit algorithm, which we present formally in Algorithm 2.

1:function OMP(design matrix , response , size constraint )
2:     Initialize
3:     for  to  do
4:         Set
5:         Let maximize maximum correlation with residual
6:         Update      
7:     return
Algorithm 2 Orthogonal Matching Pursuit (PRK1993).

The LASSO algorithm (Tib1996) is another popular feature selection method, which simply adds -regularization to the objective in Equation (2). Theoretical guarantees for LASSO are known in the underdetermined regime (DE2003; CT2006), but it is an open problem whether LASSO has the guarantees of Equation (3). Sequential LASSO is a related algorithm that uses the LASSO to select features one by one. LC2014 analyzed this algorithm in a specific parameter regime, but until our work, no relative error guarantees were known in full generality (e.g., the overdetermined regime). We present the Sequential LASSO in Algorithm 3.

1:function SequentialLASSO(design matrix , response , size constraint )
2:     Initialize
3:     for  to  do
4:         Let denote the optimal solution to
5:         Set largest with nonzero LASSO on
6:         Let
7:         Select any non-empty by Lemma 3.2
8:         Update      
9:     return
Algorithm 3 Sequential LASSO (LC2014).

Note that Sequential LASSO as stated requires a search for the optimal in each step. In practice, however, can simply be set to a large enough value to obtain similar results, since beyond a critical value of , the feature ranking according to LASSO coefficients does not change (EHJT2004).

3 Equivalence for least squares: OMP and Sequential Attention

In this section, we show that the following algorithms are equivalent for least squares linear regression: regularized linear Sequential Attention, Sequential LASSO, and Orthogonal Matching Pursuit.

3.1 Regularized linear Sequential Attention and Sequential LASSO

We start by formalizing our modified version of Sequential Attention, for which we obtain provable guarantees.

[Regularized linear Sequential Attention] Let be the set of currently selected features. We define the regularized linear Sequential Attention objective by removing the normalization in Algorithm 1 and introducing regularization on the importance weights and the model parameters restricted to . That is, we consider the objective


where denotes the Hadamard product, is restricted to indices in , and

By a simple argument due to Hof2017, the objective function in (5) is equivalent to


It follows that attention (or more generally overparameterization by trainable weights ) can be seen as a way to implement regularization for least squares linear regression, i.e., the LASSO (Tib1996). This connection between overparameterization and regularization has also been observed in several other recent works (VKR2019; ZYH2022; tibshirani2021equivalences).

By this transformation and reasoning, regularized linear Sequential Attention can be seen as iteratively using the LASSO with regularization applied only to the unselected features—which is precisely the Sequential LASSO algorithm in (LC2014). If we instead use as in (1), then this only changes the choice of regularization, as shown in Lemma 3.1 (proof in Section A.3).

Let be the function defined by , for . Denote its range and preimage by and , respectively. Moreover, define the functions and by

Then, the following two optimization problems with respect to are equivalent:


We present contour plots of for in Figure 2. These plots suggest that is a concave regularizer when , which would thus approximate the regularizer and induce a sparse solution of (ZZ2012), as regularization does (Tib1996).

Figure 2: Contour plot of for at different zoom-levels of .

3.2 Sequential LASSO and OMP

This connection between Sequential Attention and Sequential LASSO gives us a new perspective about how Sequential Attention works. The only known guarantee for Sequential LASSO, to the best of our knowledge, is a statistical recovery result when the input is a sparse linear combination with Gaussian noise in the ultra high-dimensional setting (LC2014). This does not, however, fully explain why Sequential Attention is such an effective feature selection algorithm.

To bridge our main results, we prove a novel equivalence between Sequential LASSO and OMP. For the remainder of this section, let .

Let be a design matrix with unit vector columns, and let denote the response, also an unit vector. The Sequential LASSO algorithm maintains a set of features such that, at each feature selection step, it selects a feature such that

where is the matrix given formed by the columns of indexed by , and is the projection matrix onto the orthogonal complement of the span of .

Note that this is extremely close to saying that the Sequential LASSO and OMP select the exact same set of features. The only difference appears when there are multiple features witnessing the norm

. In this case, it is possible that Sequential LASSO chooses the next feature from a set of features that is strictly smaller than the set of features from which OMP chooses, so the “tie-breaking” can differ between the two algorithms. In practice, this rarely happens; for instance, if only one feature is selected at each step, which is the case with probability 1 if random continuous noise is added to the data, then Sequential LASSO and OMP will select the exact same set of features.

It was shown in (LC2014) that Sequential LASSO is equivalent to OMP in the statistical recovery regime, i.e., when for some true sparse weight vector and i.i.d. Gaussian noise , under an ultra high-dimensional regime where the dimension  is exponential in the number of examples . We prove this equivalence in the fully general setting.

The argument below shows that Sequential LASSO and OMP are equivalent, thus establishing that regularized linear Sequential Attention and Sequential LASSO offer the same approximation guarantees as OMP.

Geometry of Sequential LASSO.

We first study the geometry of optimal solutions to Equation (4). Let be the set of currently selected features. Following work on the LASSO in (TT2011), we rewrite (4) as the following constrained optimization problem:

subject to

It can then be shown that the dual problem is equivalent to finding the projection, i.e., closest point in Euclidean distance, of onto the polyhedral section , where

and denotes the orthogonal complement of . See Appendix A.1 for the full details. The primal and dual variables are related through by

Selection of features in Sequential LASSO.

Next, we analyze how Sequential LASSO selects its features. Let be the optimal solution for features restricted in . Then subtracting from both sides of (9) gives


Note that if is at least , then the projection of onto is just , so by (10),

meaning that is zero outside of . We now show that for slightly smaller than , the residual is in the span of features that maximize the correlation with .

[Projection residuals of the sequential LASSO] Let denote the projection of onto . There exists such that for all the residual lies on , for

We defer the proof of Lemma 3.2 to Appendix A.2.

By Lemma 3.2 and (10), the optimal when selecting the next feature has the following properties:

  1. if , then is equal to the -th value in the previous solution ; and

  2. if , then can be nonzero only if .

It follows that sequential LASSO selects a feature that maximizes the correlation , just as OMP does. Thus, we have shown an equivalence between Sequential LASSO and OMP without any additional assumptions.

4 Experiments

4.1 Feature selection for neural networks

Small-scale experiments.

We investigate the performance of Sequential Attention through extensive experiments on standard feature selection benchmarks for neural networks. In these experiments, we consider six datasets used in experiments in (LRAT2021; ABZ2019), and select

features using a one-layer neural network with hidden width 67 and ReLU activation (just as in these previous works). For additional points of comparison, we implement the attention-based feature selection algorithm of

(LLY2021) and Group LASSO, which has been considered in many works that aim to sparisfiy neural networks, as discussed in Section 1.2. We also implement natural adaptations of the Sequential LASSO and OMP for neural networks and evaluate their performance.

In Figure 3 we see that Sequential Attention is competitive with or outperforms all of these feature selection algorithms on this benchmark suite. For each algorithm, we report the mean of the prediction accuracies averaged over 5 feature selection trials. We provide additional details about the experimental setup in Section B.2, including specifications about each dataset in Table 1

and the raw mean prediction accuracies with standard deviations in 

Table 2

. We also visualize the selected features on MNIST in Figure


Figure 3: Feature selection results on small-scale datasets. Here, SA = Sequential Attention, LLY = (LLY2021), GL = Group LASSO, SL = Sequential LASSO, and OMP = OMP.
Large-scale experiments.

To demonstrate the scalability of our algorithm, we perform large-scale feature selection experiments on the Criteo click dataset, which consists of 39 features and over three billion examples for predicting click-through rates (DiemertMeynet2017). Our results in Figure 4

show that Sequential Attention outperforms these other methods when at least 15 features are selected. In particular, these plots highlight the fact that Sequential Attention excels at finding valuable features once a few features are already in the model, and that it has substantially less variance than LASSO-based feature selection algorithms. See Appendix 

B.3 for further discussion.

Figure 4: AUC and log loss when selecting features for Criteo dataset.

4.2 Adaptivity, overparameterization, and connections to marginal gains


One of the key messages of this work is that adaptivity is critical for high-quality feature selection. In Figure 5 and in Appendix B.4, we empirically verify this claim by studying the quality of Sequential Attention as the number of features it selects in each iteration increases.

Figure 5: Sequential Attention with varying levels of adaptivity. We select 64 features for each model, but take feature in each round for increasing values of . We plot model accuracy as a function of .
Hadamard product parameterization.

In Section 1.1, we argue that Sequential Attention has provable guarantees for least squares linear regression by showing that a version that removes the softmax and adds regularization results in an algorithm that is equivalent to OMP. Thus, there is a gap between the implementation of Sequential Attention in Algorithm Algorithm 1 and our theoretical analysis. We empirically bridge this gap by showing that regularized linear Sequential Attention yields results that are almost indistinguishable to the original version. In Figure 12 (Section B.5), we compare the following Hadamard product overparameterization schemes:

  • softmax: as described in Section 1

  • : for , which captures the provable variant discussed in Section 1.1

  • : for

  • normalized: for

  • normalized: for

Further, for each of the benchmark datasets, all of these variants outperform LassoNet and the other baselines considered in (LRAT2021). See Appendix B.5 for more details.

Correlation with marginal gains.

We also investigate the relationship between attention weights and marginal gains, which measure the marginal contribution of a feature with respect to a currently selected set of features. That is, if is the loss achieved by a set of features , then for each , the marginal gain of the feature with respect to is . The marginal gains provide good feature scores in practice and comes with strong theoretical guarantees (DK2011; EKDN2018), but they are expensive to compute for large-scale datasets. In Appendix B.6, we empirically evaluate the correlation between the learned attention weights in Sequential Attention and the true marginal gains, showing that these scores are highly correlated.

5 Conclusion

This work introduces Sequential Attention, an adaptive attention-based feature selection algorithm designed in part for DNNs. Empirically, Sequential Attention improves significantly upon previous methods on widely-used benchmarks. Theoretically, we show that a relaxed variant of Sequential Attention is equivalent to the Sequential LASSO algorithm (LC2014). In turn, we prove a novel connection between Sequential LASSO and Orthogonal Matching Pursuit, thus transferring its provable guarantees to Sequential Attention and shedding light on our empirical results. Our analysis also provides new insight into the the role of attention for feature selection via adaptivity, overparameterization, and connections to marginal gains.


Appendix A Missing proofs from Section 3

a.1 Lagrangian dual of Sequential LASSO

We will first show that the Lagrangian dual of (8) is equivalent to the following problem:

subject to

We will then use the Pythagorean theorem to replace by .

We first consider the Lagrangian dual problem:


Note that the primal problem is strictly feasible and convex, so strong duality holds (see, e.g., Section 5.2.3 of BV2004). Considering just the terms involving the variable in (12), we have that

which is minimized at as varies over . On the other hand, consider just the terms involving the variable in (12), that is,


Note that if is nonzero on any coordinate in , then (13) can be made arbitrarily negative by setting to be zero and appropriately. Similarly, if , then (13) can also be made to be arbitrarily negative. On the other hand, if and , then (13) is minimized at . This gives the dual in Equation (11).

We now show that by the Pythagorean theorem, we can project in (11) rather than . In (11), recall that is constrained to be in . Then, by the Pythagorean theorem, we have

since is orthogonal to , and both and are in . The first term in the above does not depend on and thus we may discard it. Our problem therefore reduces to projecting onto , rather than .

a.2 Proof of Lemma 3.2

Proof of Lemma 3.2.

Our approach is to reduce the projection of onto the polytope defined by to a projection onto an affine space.

We first argue that it suffices to project onto the faces of specified by . For , feature indices , and signs , we define the faces

of . Let , for to be chosen sufficiently small. Then clearly


In fact, lies on the intersection of faces for an appropriate choice of signs and . WLOG, we assume that these faces are just for . Note also that for any ,

(Cauchy–Schwarz, )

For all , for small enough, this is larger than .

Thus, for small enough, is closer to the faces for than any other face. Therefore, we set .

Now, by the complementary slackness of the KKT conditions for the projection of onto , for each face of we either have that lies on the face or that the projection does not change if we remove the face. For , note that by the above calculation, the projection cannot lie on , so is simply the projection onto

By reversing the dual problem reasoning from before, the residual of the projection onto  must lie on the column span of . ∎

a.3 Parameterization patterns and regularization

Proof of Lemma 3.1.

The optimization problem on the left-hand side of Equation (7) with respect to  is equivalent to


If we define

then the LHS of (7) and (14) are equivalent to . Re-parameterizing the minimization problem in the definition of (by setting ), we obtain . ∎

Appendix B Additional experiments

b.1 Visualization of selected features on MNIST

In Figure 6, we provide a visualization of the features selected by Sequential Attention as well as other baseline algorithms, to provide intuition on the nature of features selected by the algorithms. Similar visualizations on MNIST can be found in works such as GGH2019; WC2020; LRAT2021; LLY2021. Notably, Sequential Attention selects highly diverse pixels owing to its sequential feature selection process. Sequential LASSO also selects extremely similar pixels, as suggested by our theoretical analysis of Section 3. Curiously, OMP does not provide a good feature subset, demonstrating that OMP does not generalize well from least squares regression and generalized linear models to deep neural networks.

Figure 6: Visualizations of the pixels selected by the feature selection algorithms on MNIST.

b.2 Further details on small-scale experiments

We start by presenting details about each of the datasets used for neural network feature selection in ABZ2019; LRAT2021.

Dataset # Examples # Features # Classes Type
Mice 1,080 77 8 Biology
MNIST 60,000 784 10 Image
MNIST-Fashion 60,000 784 10 Image
ISOLET 7,797 617 26 Speech
COIL-20 1,440 400 20 Image
Activity 5,744 561 6 Sensor
Table 1: Benchmark datasets in ABZ2019; LRAT2021.

In Figure 3, the error bars are generated as the standard deviation over running the algorithm five times with different random seeds. The values used to generate the plot are provided in Table 2.

Mice Protein 0.993 (0.008) 0.981 (0.005) 0.985 (0.005) 0.984 (0.008) 0.994 (0.008)
MNIST 0.956 (0.002) 0.944 (0.001) 0.937 (0.003) 0.959 (0.001) 0.912 (0.004)
MNIST-Fashion 0.854 (0.003) 0.843 (0.005) 0.834 (0.004) 0.854 (0.003) 0.829 (0.008)
ISOLET 0.920 (0.006) 0.866 (0.012) 0.906 (0.006) 0.920 (0.003) 0.727 (0.026)
COIL-20 0.997 (0.001) 0.994 (0.002) 0.997 (0.004) 0.988 (0.005) 0.967 (0.014)
Activity 0.931 (0.004) 0.897 (0.025) 0.933 (0.002) 0.931 (0.003) 0.905 (0.013)
Table 2: Feature selection experimental results on small-scale datasets (see Figure 3 for a key).
Dataset Fisher HSIC-Lasso PFA LassoNet Sequential Attention
Mice Protein 0.944 0.958 0.939 0.958 0.993
MNIST 0.813 0.870 0.873 0.873 0.956
MNIST-Fashion 0.671 0.785 0.793 0.800 0.854
ISOLET 0.793 0.877 0.863 0.885 0.920
COIL-20 0.986 0.972 0.975 0.991 0.997
Activity 0.769 0.829 0.779 0.849 0.931
Table 3: Feature selection experimental results on small-scale datasets from LRAT2021.

b.2.1 Model accuracy on all features

To adjust for the differences between the values reported in LRAT2021 and ours due to factors such as the implementation framework, we list the accuracies obtained from training a model with all features in Figure 7.

Dataset All Features (LRAT2021) All Features (this paper)
Mice Protein 0.990 0.963
MNIST 0.928 0.953
MNIST-Fashion 0.833 0.869
ISOLET 0.953 0.961
COIL-20 0.996 0.986
Activity 0.853 0.954
Figure 7: Accuracy of models trained on all features.

b.2.2 The Generalization of OMP to Neural Networks

As stated in Algorithm 2, it may be difficult to see exactly how OMP generalizes from a linear regression model to neural networks. To do this, first observe that OMP naturally generalizes to generalized linear models (GLMs) via the gradient of the link function, as shown in EKDN2018. Then, to extend this to neural networks, we view the neural network as a GLM for any fixing of the hidden layer weights, and then we use the gradient of this GLM with respect to the inputs as the feature importance scores.

b.3 Large-Scale Experiments

In this section, we give additional details and discussion on our Criteo large dataset results. In Figure 4, the error bars are generated as the standard deviation over running the algorithm three times with different random seeds. The values used to generate the plot are provided in Figures 8 and 9.

We first note that this dataset is so large that it is expensive to make multiple passes through the dataset. Therefore, we modify the algorithms we use, both Sequential Attention and the other baselines, to make only one pass through the data by using separate fractions of the data for different “steps” of the algorithm. Hence, we select features while only “training” one model.

5 0.67232 0.63950 0.68342 0.50161 0.60278 0.67710 0.58300
(0.00015) (0.00076) (0.00585) (0.00227) (0.04473) (0.00873) (0.06360)
10 0.70167 0.69402 0.71942 0.64262 0.62263 0.70964 0.68103
(0.00060) (0.00052) (0.00059) (0.00187) (0.06097) (0.00385) (0.00137)
15 0.72659 0.72014 0.72392 0.65977 0.66203 0.72264 0.69762
(0.00036) (0.00067) (0.00027) (0.00125) (0.04319) (0.00213) (0.00654)
20 0.72997 0.72232 0.72624 0.72085 0.70252 0.72668 0.71395
(0.00066) (0.00103) (0.00330) (0.00106) (0.01985) (0.00307) (0.00467)
25 0.73281 0.72339 0.73072 0.73253 0.71764 0.73084 0.72057
(0.00030) (0.00042) (0.00193) (0.00091) (0.00987) (0.00070) (0.00444)
30 0.73420 0.72622 0.73425 0.73390 0.72267 0.72988 0.72487
(0.00046) (0.00049) (0.00081) (0.00026) (0.00663) (0.00434) (0.00223)
35 0.73495 0.73225 0.73058 0.73512 0.73029 0.73361 0.73078
(0.00040) (0.00024) (0.00350) (0.00058) (0.00509) (0.00037) (0.00102)
Figure 8: AUC of Criteo large experiments. SA is Sequential Attention, GL is generalized LASSO, and SL is Sequential LASSO. The values in the header for the LASSO methods is the regularization strength used for each method.
5 0.14123 0.14323 0.14036 0.14519 0.14375 0.14073 0.14415
(0.00005) (0.00010) (0.00046) (0.00000) (0.00163) (0.00061) (0.00146)
10 0.13883 0.13965 0.13747 0.14339 0.14263 0.13826 0.14082
(0.00009) (0.00008) (0.00015) (0.00019) (0.00304) (0.00032) (0.00011)
15 0.13671 0.13745 0.13693 0.14227 0.14166 0.13713 0.13947
(0.00007) (0.00008) (0.00005) (0.00021) (0.00322) (0.00021) (0.00050)
20 0.13633 0.13726 0.13693 0.13718 0.13891 0.13672 0.13806
(0.00008) (0.00010) (0.00057) (0.00004) (0.00187) (0.00035) (0.00048)
25 0.13613 0.13718 0.13648 0.13604 0.13760 0.13628 0.13756
(0.00013) (0.00009) (0.00051) (0.00004) (0.00099) (0.00010) (0.00043)
30 0.13596 0.13685 0.13593 0.13594 0.13751 0.13670 0.13697
(0.00001) (0.00004) (0.00015) (0.00005) (0.00095) (0.00080) (0.00015)
35 0.13585 0.13617 0.13666 0.13580 0.13661 0.13603 0.13635
(0.00002) (0.00006) (0.00073) (0.00012) (0.00096) (0.00010) (0.00005)
Figure 9: Log loss of Criteo large experiments. SA is Sequential Attention, GL is generalized LASSO, and SL is Sequential LASSO. The values in the header for the LASSO methods is the regularization strength used for each method.

b.4 The role of adaptivity

We show in this section the effect of varying adaptivity on the quality of selected features in Sequential Attention. In the following experiments, we select 64 features on six datasets by selecting

features at a time over a fixed number of epochs of training, for

. That is, we investigate the following question: for a fixed budget of training epochs, what is the best way to allocate the training epochs over the rounds of the feature selection process? For most datasets, we find that feature selection quality decreases as we select more features at once. An exception is the mice protein dataset, which curiously exhibits the opposite trend, perhaps indicating that the features in the mice protein dataset are less redundant than in other datasets. Our results are summarized in Figures 5 and 10. We also illustrate the effect of adaptivity for Sequential Attention on MNIST in Figure 11. One observes that the selected pixels “clump together” as increases, indicating a greater degree of redundancy.

Overall, our empirical results in this section suggest that adaptivity greatly enhances the quality of features selected by Sequential Attention, and more broadly, in feature selection algorithms.

Mice Protein 0.990 0.990 0.989 0.989 0.991 0.992 0.990
(0.006) (0.008) (0.006) (0.006) (0.005) (0.006) (0.007)
MNIST 0.963 0.961 0.956 0.950 0.940 0.936 0.932
(0.001) (0.001) (0.001) (0.003) (0.007) (0.001) (0.004)
MNIST-Fashion 0.860 0.856 0.852 0.852 0.847 0.849 0.843
(0.002) (0.002) (0.003) (0.004) (0.002) (0.002) (0.003)
ISOLET 0.934 0.930 0.927 0.919 0.893 0.845 0.782
(0.005) (0.003) (0.005) (0.004) (0.022) (0.021) (0.022)
COIL-20 0.998 0.997 0.999 0.998 0.995 0.972 0.988
(0.002) (0.005) (0.001) (0.003) (0.005) (0.012) (0.009)
Activity 0.938 0.934 0.928 0.930 0.915 0.898 0.913
(0.008) (0.007) (0.010) (0.008) (0.004) (0.010) (0.010)
Figure 10: Sequential Attention with varying levels of adaptivity. We select 64 features for each model, but take feature in each round for increasing values of . We show model accuracy as a function of .
Figure 11: Sequential Attention with varying levels of adaptivity on the MNIST dataset. We select 64 features for each model, but take feature in each round for increasing values of .

b.5 Variations on Hadamard product parameterization

We provide evaluations of various variations on the Hadamard product parameterization pattern, as described in Section 4.2. In Figure 12, we provide the numerical values of the accuracies achieved.

Dataset Softmax Norm. Norm.
Mice Protein 0.990 (0.006) 0.993 (0.010) 0.993 (0.010) 0.994 (0.006) 0.988 (0.008)
MNIST 0.958 (0.002) 0.957 (0.001) 0.958 (0.002) 0.958 (0.001) 0.957 (0.001)
MNIST-Fashion 0.850 (0.002) 0.843 (0.004) 0.850 (0.003) 0.853 (0.001) 0.852 (0.002)
ISOLET 0.920 (0.003) 0.894 (0.014) 0.908 (0.009) 0.921 (0.003) 0.921 (0.003)
COIL-20 0.997 (0.004) 0.997 (0.004) 0.995 (0.006) 0.996 (0.005) 0.996 (0.004)
Activity 0.922 (0.005) 0.906 (0.015) 0.908 (0.012) 0.933 (0.010) 0.935 (0.007)
Figure 12: Accuracies achieved by Sequential Attention with various Hadamard product parameterization variations.
Figure 13: Accuracies achieved by Sequential Attention with various Hadamard product parameterization variations. Here, SM = softmax, L1 = , L2 = , L1N = normalized, L2N = normalized.

b.6 Approximation of marginal gains

In Figure 14, we present our experimental results which show the correlations between true marginal gains and our computed sequential attention weights. In this experiment, we first compute the top features selected by Sequential Attention for . We then compute the marginal gains as well as the attention weights according to Sequential Attention conditioned on these features, and compare the two sets of scores. The marginal gains are computed by explicitly training a model for each candidate feature to be added to the preselected features. In the first and second rows of Figure 14, we see that the top 50 pixels according to the marginal gains and attention weights are visually similar, avoiding previously selected regions and finding new areas which are now important. In the third row, we quantify the similarity via the Spearman correlation between the feature rankings. Although the correlations degrade as we select more features, this is expected, as one expects marginal gains to become similar among remaining features, after removing the most important features.

Figure 14: Marginal gain experiments. The first and second rows show that top 50 features according to marginal gain and Sequential Attention, respectively. The third row shows the Spearman correlation between the computed sets of scores.