Feature Selection is one of the main contemporary problems in Machine Learning and has been approached from many directions. One modern approach to feature selection in linear models consists in minimizing an regularized empirical risk. This particular risk encourages the model to have a good balance between a low classification error and high sparsity (where only a few features are used for classification). As the regularized problem is combinatorial, many approaches such as the LASSO  try to address the combinatorial problem by using more practical norms such as . These approaches have been developed with two main goals in mind: restricting the number of features for improving classification speed, and limiting the used features to the most useful to prevent overfitting. These classical approaches to sparsity aim at finding a sparse representation of the features space that is global to the entire dataset.
We propose a new approach to sparsity where the goal is to limit the number of features per datapoint, thus datum-wise sparse classification (DWSC). This means that our approach allows the choice of features used for classification to vary relative to each datapoint; data points that are easy to classify can be inferred on without looking at very many features, and more difficult datapoints can be classified using more features. The underlying motivation is that, while classical approaches balance between accuracy and sparsity at the dataset level, our approach optimizes this balance at the individual datum level, thus resulting in equivalent accuracy at higher overall sparsity. This kind of sparsity is interesting for several reasons: First, simpler explanations are always to be preferred as per Occam’s Razor. Second, in the knowledge extraction process, such datum-wise sparsity is able to provide unique information about the underlying structure of the data space. Typically, if a dataset is organized onto two different subspaces, the datum-wise sparsity principle will allows the model to automatically choose to classify using only the features of one or another of the subspace.
DWSC considers feature selection and classification as a single sequential decision process. The classifier iteratively chooses which features to use for classifying each particular datum. In this sequential decision process, datum-wise sparsity is obtained by introducing a penalizing reward when the agent chooses to incorporate an additional feature into the decision process. The model is learned using an algorithm inspired by Reinforcement Learning.
The contributions of the paper are threefold: (i.) We propose a new approach where classification is seen as a sequential process where one has to choose which features to use depending on the input being inferred upon. (ii.) This new approach results in a model that obtains good performance in terms of classification while maximizing datum-wise sparsity, i.e. the mean number of features used for classifying the whole dataset. It also naturally handles multi-class classification problems, solving them by using as few features as possible for all classes combined. (iii.) We perform a series of experiments on 14 different corpora and compare the model with those obtained by the LARS , and a -regularized SVM, thus providing a qualitative study of the behaviour of our algorithm.
The paper is organized as follow: First, we define the notion of datum-wise sparse classifiers and explain the interest of such models in Section 2. We then describe our sequential approach to classification and detail the learning algorithm and the complexity of such an algorithm in Section 3. We describe how this approach can be extended to multi-class classification in Section 4. We detail experiments on 14 datasets, and also give a qualitative analysis of the behaviour of this model in Section 6. The related work is given in Section 7.
2 Datum-Wise Sparse Classifiers
We consider the problem of supervised multi-class classification111Note that this includes the binary supervised classification problem as a special case. where one wants to learn a classification function to associate one category
to a vector, where , being the dimension of the input vectors. is the set of parameters learned from a training set composed of input/output pairs . These parameters are commonly found by minimizing the empirical risk defined by:
where is the loss associated to a prediction error.
This empirical risk minimization problem does not consider any prior assumption or constraint concerning the form of the solution and can result in overfitting models. Moreover, when facing a very large number of features, obtained solutions usually need to perform computations on all the features for classifying any input, thus negatively impacting the model’s classification speed. We propose a different risk minimization problem where we add a penalization term that encourages the obtained classifier to classify using on average as few features as possible. In comparison to classical or regularized approaches where the goal is to constraint the number of features used at the dataset level, our approach performs sparsity at the datum level, allowing the classifier to use different features when classifying different inputs. This results in a datum-wise sparse classifier that, when possible, only uses a few features for classifying easy inputs, and more features for classifying difficult or ambiguous ones.
We consider a different type of classifier function that, in addition to predicting a label given an input , also provides information about which features have been used for classification. Let us denote . We define a datum-wise classification function of parameters as:
where is the predicted output and is a -dimensional vector , where implies that feature has been taken into consideration for computing label on datum . By convention, we denote the predicted label as and the corresponding -vector as . Thus, if , feature has been used for classifying into category .
This definition of data-wise classifiers has two main advantages: First, as we will see in the next section, because can explain its use of features with , we can add constraints on the features used for classification. This allows us to encourage datum-wise sparsity which we define below. Second, while this is not the main focus of our article, analysis of gives a qualitative explanation of how the classification decision has been made, which we study in Section 6. Note that the way we define datum-wise classification is an extension to the usual definition of a classifier.
2.1 Datum-Wise Sparsity
Datum-wise sparsity is obtained by adding a penalization term to the empirical loss defined in equation (1) that limits the average number of features used for classifying:
The term is the norm 222The ’norm’ is not a proper norm, but we will refer to it as the norm in this paper, as is common in the sparsity community. of , i.e. the number of features selected for classifying , that is, the number of elements in equal to 1. In the general case, the minimization of this new risk results in a classifier that on average selects only a few features for classifying, but may use a different set of features w.r.t to the input being classified. We consider this to be the crux of the DWSC model: the classifier takes each datum into consideration differently during the inference process.
3 Datum-Wise Sparse Sequential Classification
We consider a Markov Decision Problem (MDP, )333The MDP is deterministic in our case. to classify an input . At the beginning, we have no information about , that is, we have no attribute/feature values. Then, at each step, we can choose to acquire a particular feature of , or to classify . The act of classifying in the category ends an “episode” of the sequential process. The classification process is a deterministic process defined by:
A set of states , where state corresponds to the state where the agent is currently classifying datum and has selected features specified by . The number of currently selected features is thus .
A set of actions where denotes the set of possible actions in state . We consider two types of actions:
is the set of feature selection actions such that, for corresponds to choosing feature . Action corresponds to a vector with only the element equal to 1, i.e. . Note that the set of possible feature selection actions on state , denoted , is equal to the subset of currently unselected features, i.e. .
is the set of classification actions , that correspond to assigning a label to the current datum. Classification actions stop the sequential decision process.
A transition function defined only for feature selection actions (since classification actions are terminal):
where is an updated version of such that .
We define a parameterized policy , which, for each state , returns the best action as defined by a scoring function :
The policy decides which action to take by applying the scoring function to every action possible from state and greedily taking the highest scoring action. The scoring function reflects the overall quality of taking action in state , which corresponds to the total reward obtained by taking action in and thereafter following policy 444This corresponds to the classical Q-function in Reinforcement Learning.:
Here corresponds to the reward obtained at step while having started in state and followed the policy with parameterization for steps. Taking the sum of these rewards gives us the total reward from state until the end of the episode. Since the policy is deterministic, we may refer to a parameterized policy using simply . Note that the optimal parameterization obtained after learning (see Sec. 3.3) is the parameterization that maximizes the expected reward in all state-action pairs of the process.
In practice, the initial state of such a process for an input corresponds to an empty vector where no feature has been selected. The policy sequentially picks, one by one, a set of features pertinent to the classification task, and then chooses to classify once enough features have been considered.
The reward function reflects the immediate quality of taking action in state relative to the problem at hand. We define a reward function over the training set : which reflects how good of a decision taking action on state for input is relative to our classification task. This reward is defined as follows555Note that we can add to the reward at the end of the episode, and give a constant intermediate reward of . These two approaches are interchangeable.:
If corresponds to a feature selection action, then the reward is .
If corresponds to a classification action i.e. , we have:
In practice, we set to avoid situations where classifying incorrectly is a better decision than choosing multiple features.
3.1 Reward Maximization and Loss Minimization
As explained in section 2, our ultimate goal is to find the parameterization that minimizes the datum-wise empirical loss defined in equation (2). The training process for the MDP described above is the maximization of a reward function. Let us therefore show that maximizing the reward function is equivalent to minimizing the datum-wise empirical loss.
where is the action taken at time by the policy for the training example .
Such an equivalence between risk minimization and reward maximization shows that the optimal classifier corresponds to the optimal policy in the MDP defined previously. This equivalence allows us to use classical MDP resolution algorithms in order to find the best classifier. We detail the learning procedure in Section 3.3.
3.2 Inference and Approximated Decision Processes
Due to the infinite number of possible inputs , the number of states is also infinite. Moreover, the reward function is only known for the values of that are in the training set and cannot be computed for any other input. For these two reasons, it is not possible to compute the score function for all state-action pairs in a tabular manner, and this function has to be approximated.
The scoring function that underlies the policy is approximated with a linear model666 Although non-linear models such as neural networks may be used, we have chosen to restrict ourselves to a linear model to be able to properly compare performance with that of other state-of-the-art linear sparse models.
Although non-linear models such as neural networks may be used, we have chosen to restrict ourselves to a linear model to be able to properly compare performance with that of other state-of-the-art linear sparse models.:
and the policy defined by such a function consists in taking in state the action that maximizes the scoring function i.e .
Due to their infiniteness, the state-action pairs are represented in a feature space. We note the featurized representation of the state-action pair. Many definitions may be used for this feature representation, but we propose a simple projection: we restrict the representation of to only the selected features. Let be the restriction of according to :
To be able to differentiate between an attribute of that is not yet known, and an attribute that is simply equal to 0, we must keep the information present in . Let be the intermediate representation that corresponds to the concatenation of with . Now we simply need to keep the information present in in a manner that allows each action to be easily distinguished by a linear classifier. To do this we use the block-vector trick  which consists in projecting into a higher dimensional space such that the position of inside the global vector is dependent on action :
In , the block is at position where is the index of action in the set of all the possible actions. Thus, is offset by an amount dependent on the action .
The goal of the learning phase is to find an optimal policy parameterization which maximizes the expected reward, thus minimizing the datum-wise regularized loss defined in (2). As explained in Section 3.2, we cannot exhaustively explore the state space during training, and therefore we use a Monte-Carlo approach to sample example states from the learning space. We use the Approximate Policy Iteration (API) algorithm with rollouts . Sampling state-action pairs according to a previous policy , API consists in iteratively learning a better policy by way of the Bellman equation. The API With Rollouts algorithm is composed of three main steps that are iteratively repeated:
For each state in the sampled state, the policy is used to compute the expected reward of choosing each possible action from that state. We now have a feature vector for each state-action pair in the sampled set, and the corresponding expected reward denoted .
After a certain number of iterations, the parameterized policy converges to a final policy which is used for inference.
4 Preventing Overfitting in the Sequential Model
In section 3.0.1, we explain the process by which, at each step, we either choose a new feature or classify the current datum. This process is at the core of DWSC but can suffer from overfitting if the number of features is larger than the number of training examples. In such a case, DWSC would tend to learn to select the more specific features for each training example. In classical regularization models that are not datum-wise, the classifier must use the same set of features for classifying any data and thus overly specific features are not chosen because they usually appear in only a few training examples.
We propose a very simple variant of the general model that allows us to avoid overfitting. We still allow DWSC to choose how many features to use before classifying an input , but we constrain it to choose the features in the same order for all the inputs. For that, we constrain the score of the feature selection actions to depend only on the vector of the state . An example of the effect of such a constraint is presented in Fig. 2. This constraint is handled in the following manner:
where implies that the score is computed using only the values of and — is ignored. This corresponds to having two different types of state-action feature vectors depending on the type of action:
|Example||Features Selected||Example||Features Selected|
|Unconstrained Model||Constrained Model|
Although this constraint forces DWSC to choose the features in the same order, it will still automatically learn the best order in which to choose the features, and when to stop adding features and classify. However, it will avoid choosing very different features sets for classifying different inputs (the first features chosen will be common to all the inputs being classified) and thus avoid the overfitting problem.
5 Complexity Analysis
As explained in section 3.3, the learning method is based on Reinforcement Learning with Rollouts. Such an approach is expensive in term of computations because it needs — at each iteration of the algorithm — to simulate trajectories in the decision process, and then to learn the scoring function based on these trajectories. Without giving the details of the computation, the complexity of each iteration is , where is the number of states used for rollouts (which in practice is proportional to the number of training examples), is the number of features and is the number of possible categories. This implies a learning method which is quadratic w.r.t. the number of features; the proposed approach is not able to deal with problems with thousands of possible features. Breaking this complexity is an active research perspective with some leads.
Inference on an input consists in sequentially choosing features, and then classifying . At step , one has to perform linear computations in order to choose the best action, where is the number of possible actions when features have already been acquired. The inference complexity is thus , where is the mean number of features chosen by the system before classifying. In fact, due to the shape of the function presented in Section 3.2 and the linear nature of , the score of the actions can be efficiently incrementally computed at each step of the process by just adding the contribution of the newly added feature. The complexity is thus reduced to . Moreover, the constrained model which results in ordering the features, has a lower complexity of because in that case, the model does not have to choose between the different remaining features, and has only the choice to classify or get the next feature w.r.t. to the learned order.
If the learning complexity of our model is higher than baseline global linear methods, the inference speed is very close for the unconstrained model, and equivalent for the constrained one. In practice, most of the baseline methods choose a subset of variables in a couple seconds to a couple minutes, whereas our method takes from a dozen minutes to an hour, depending on the number of features and categories. In practice inference is indeed of the same speed, which is in our opinion the important factor.
|Name||Number of Features||Number of examples||Number of Classes||Task|
|Svm Guide 3||21||1,284||2||Binary|
Experiments were run on 14 different datasets obtained from the LibSVM Website777http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Ten of these datasets correspond to a binary classification task, four to a multi-class problem. The datasets are described in Table 1. For each dataset, we randomly sampled different training sets by taking from to of the examples as training examples, with the remaining examples being kept for testing. We performed experiments with three different models: L1-SVM was used as a baseline linear model with regularization888Using LIBLINEAR .. LARS was used to obtain the optimal solution of the LASSO problem for all values of the regularization coefficient at once999We use the implementation from the authors of the LARS, available in R.. Datum-Wise Sequential Model (DWSM) was tested with the two versions presented above: (i) DWSM-Un is the original unconstrained model and (ii) DWSM-Con is the constrained model for preventing overfitting.
For the evaluation, we used a classical accuracy measure which corresponds to on the test set of each dataset. We perform 3 training/testing set splits of a given dataset to obtain averaged figures. The sparsity has been measured as the proportion of features not used for -SVM and LARS in binary classification, and the mean proportion of features not used to classify testing examples in DWSM. For multi-class problems where one LARS/SVM model is learned for each category, the sparsity is the proportion of features that have not been used in any of the models.
For the sequential experiments, the number of rollout states (step 1 of the learning algorithm) has been set to 2,000 and the number of policy iterations has been fixed to 10. Note that experiments with more rollout states and/or more iterations give similar results. Experiments were made using an alpha mixture policy with to ensure the stability of the learning process. We tested the different models with different values of which controls the sparsity. Note that even with a value, contrary to the baseline models, the DWSM model does not use all of the features for classification.
|Corpus||Train Size||Sparsity = 0.8||Sparsity = 0.6||Sparsity = 0.4|
|DWSM-Un||DWSM-Con||SVM L1||DWSM-Un||DWSM-Con||SVM L1||DWSM-Un||DWSM-Con||SVM L1|
) using different training sizes. The accuracy has been linearly interpolated from curves like the ones given in Figure3.
For each corpus and each training size, we have computed sparsity/accuracy curves showing the performance of the different models w.r.t. to the sparsity of the solution. Only two representative curves are given in Figure 3. To summarize the performances over all the datasets, we give the accuracy of the different models for three levels of sparsity in tables 2 and 3. Due to a lack of space, these tables do not present the LARS’ performance, which are equivalent to the performances of the -SVM. Note that in order to obtain the accuracy for a given level of sparsity, we have computed a linear interpolation on the different curves obtained for each corpus and each training size. This linear interpolation allows us to compare the baseline sparsity methods — that choose a fixed number of features — with the average number of features chosen by DWSC. This compares the average amount of information considered by each classifier. We believe this approach still provides a good appreciation of the algorithm’s capacities.
|Corpus||Train Size||Sparsity = 0.8||Sparsity = 0.6||Sparsity = 0.4|
Table 2 shows that, for a sparsity level of 80%, the DWSM-Un and the DWSM-Con models outperform the baseline -SVM classifier. This is particularly true for 7 of the 10 datasets while the results are more ambiguous on the three others datasets: breast, ionosphere and sonar. For a sparsity of 40%, similar results are obtained. Depending on the corpus and the training size, different configurations are observed. Some datasets can be easily classified using only a few features, such as australian for example. In that case, our approach gives similar results in comparison to methods (see Figure 3–left). For some other datasets, our method clearly outperforms baseline methods (Figure 3–right). On the splice dataset, our model is better than the best (non-sparse) SVM using only less than 20% of the features on average. This is due to the fact that our sequential process, which solves a different classification problem, is more appropriate for some particular datasets, particularly when the distribution of the data is split up amongst distinct subspaces.
In this case, our model is able to choose more appropriate features for each input.
When using small training sets with some datasets — sonar or ionosphere — where overfitting is observed (accuracy decreases with more features used), the DWSM-Con seems to be a better choice than the unconstrained version and thus is a version of the algorithm that is well-suited when the number of learning examples is small.
Concerning the multi-class problems, similar effects can be observed (see Table 3). The model seems particularly interesting when the number of categories is high, as in segment and vowel. This is due to the fact that the average sparsity is optimized by the sequential model for the multi-class problem while -SVM and LARS, which need to learn one model for each category, perform separate sparsity optimizations for each class.
Figure 4 gives some qualitative results. First, from the left histogram, one can see that some features are used in 100% of the decisions. This illustrates the ability of the model to detect important features that must be used for decision. Note that many of these features are also used by the -SVM and the LARS models. The sparsity gain in comparison to the baseline model is obtained through the features 1 and 9 that are only used in about 20% of decisions. From the right histogram, one can see that the DWSM model mainly classifies using 1, 2, 3 or 10 features, showing that the model is able to adapt its behaviour to the difficulty of classifying a particular input. This is confirmed by the green and violet histograms that show that for incorrect decisions (i.e. very difficult inputs) the classifier almost always acquires all the features before classifying. These difficult inputs seem to have been identified, but the set of features is not sufficient for a good understanding. This behaviour opens appealing research directions concerning the acquisition and creation of new features (see Section 8).
7 Related Work
Feature selection comes in three main flavors : wrapper, filter, or embedded approaches. Wrapper approaches involve searching the feature space for an optimal subset of features that maximize classifier performance. The feature selection step wraps around the classifier, using the classifier as a black-box evaluator of the selected feature subset. Searching the entire feature space is very quickly intractable and therefore various approaches have been proposed to restrict the search (see [9, 10]). The advantage of the wrapper approaches is that the feature subset decision can take into consideration feature inter-dependencies and avoid redundant features, however the problem remains of the exponential size of the search space. Filter approaches rank the features by some scoring function independent of their effect on the associated classifier. Since the choice of features is not influenced by classifier performance, filter approaches rely purely on the adequacy of their scoring functions. Filtering methods are susceptible to not discriminating redundant features, and missing feature inter-dependencies (since each feature is scored individually). Filter approaches are however easier to compute and more statistically stable relative to changes in the dataset. Embedded approaches include feature selection as part of the learning machine. These include algorithms solving the LASSO problem , and other linear models involving a regularizer based on a sparsity inducing norm (-norms 
, group LASSO, …). Kernel machines provide a mixture of feature selection and construction as part of the classification problem. Decision trees are also considered embedded approaches although they are also similar to filter approaches in their use of heuristic scores for tree construction. The main critique of embedded approaches is two-fold: they are susceptible to include redundant features, and not all the techniques described are easily applied to multi-class problems.In brief, both filtering and embedded approaches have their drawbacks in terms of their ability to select the best subset of features, whereas wrapper methods have their main drawback in the intractability of searching the entire feature space. Furthermore, all existing methods perform feature selection based on the whole training set, the same set of features being used to represent any data.
Our sequential decision problem defines both feature selection and classification tasks. In this sense, our approach resembles an embedded approach. In practice, however, the final classifier for each single datapoint remains a separate entity, a sort of black-box classifying machine upon which performance is evaluated. Additionally, the learning algorithm is free to navigate over the entire combinatorial feature space. In this sense our approach resembles a wrapper method.
There has been some work using similar formalisms , but with different goals and lacking in experimental results. Sequential decision approaches have been used for cost-sensitive classification with similar models . There have also been applications of Reinforcement Learning to optimize anytime classification . We have previously looked at using Reinforcement Learning for finding a stopping point in feature quantity during text classification .
Finally, in some sense, DWSC has some similarity with decision trees as each new datapoint that is labeled is following a different path in the feature space. However, the underlying mechanism is quite different both in term of inference procedure and learning criterion. There has been some work in using RL for generating decision trees , but that approach is still tied to decision tree construction heuristics and the end product remains a decision tree.
In this article we introduced the concept of datum-wise classification, where we learn both a classifier and a sparse representation of the data that is adaptive to each new datum being classified. We took an approach to sparsity that considers the combinatorial space of features, and proposed a sequential algorithm inspired by Reinforcement Learning to solve this problem. We showed that finding an optimal policy for our Reinforcement Learning problem is equivalent to minimizing the regularized loss of our classification problem. Additionally we showed that our model works naturally on multi-class problems, and is easily extended to avoid overfitting on datasets where the number of features is larger than the number of examples. Experimental results on 14 datasets showed that our approach is indeed able to increase sparsity while maintaining equivalent classification accuracy.
-  R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B, no. January 1994, 1994.
-  R. Sutton and A. Barto, Reinforcement Learning. MIT Press, 1998.
-  B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least-angle regression,” Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004.
-  M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994.
-  S. Har-Peled, D. Roth, and D. Zimak, “Constraint classification: A new approach to multiclass classification,” Algorithmic Learning Theory, pp. 1 – 11, 2002.
-  M. G. Lagoudakis and R. Parr, “Reinforcement learning as classification: Leveraging modern classifiers,” ICML, 2003.
-  R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “LIBLINEAR: A library for large linear classification,” JMLR, vol. 9, pp. 1871–1874, 2008.
-  I. Guyon and A. Elisseefi, “An Introduction to Variable and Feature Selection,” Journal of Machine Learning Research, vol. 3, no. 7-8, pp. 1157–1182, Oct. 2003.
S. Girgin and P. Preux, “Feature discovery in reinforcement leanring using genetic programming,” inProc. Euro-GP
, ser. Lecture Notes in Artificial Intelligence (LNCS), vol. 4971. Springer, Mar. 2008, pp. 218–229.
-  R. Gaudel and M. Sebag, “Feature Selection as a One-Player Game,” ICML, 2010.
-  Z. Xu, H. Zhang, Y. Wang, X. Chang, and Y. Liang, “L1/2 regularization,” SCIENCE CHINA Information Sciences, vol. 53, no. 6, pp. 1159–1169, 2010.
-  E. Ertin, “Reinforcement learning and design of nonparametric sequential decision networks,” in Proceedings of SPIE. Spie, 2002, pp. 40–47.
-  S. Ji and L. Carin, “Cost-sensitive feature acquisition and classification,” Pattern Recognition, vol. 40, no. 5, pp. 1474–1485, May 2007.
-  B. Póczos, Y. Abbasi-Yadkori, C. Szepesvári, R. Greiner, and N. Sturtevant, “Learning when to stop thinking and do something!” ICML ’09, pp. 1–8, 2009.
-  G. Dulac-Arnold, L. Denoyer, and P. Gallinari, “Text Classification: A Sequential Reading Approach,” ECIR, pp. 411–423, 2011.
-  M. Preda, “Adaptive building of decision trees by reinforcement learning,” in Proceedings of the 7th WSEAS, 2007, pp. 34–39.