Feature Selection based on the Local Lift Dependence Scale

11/11/2017 ∙ by Diego Marcondes, et al. ∙ Universidade de São Paulo 0

This paper uses a classical approach to feature selection: minimization of a cost function applied on estimated joint distributions. However, the search space in which such minimization is performed is extended. In the original formulation, the search space is the Boolean lattice of features sets (BLFS), while, in the present formulation, it is a collection of Boolean lattices of ordered pairs (features, associated value) (CBLOP), indexed by the elements of the BLFS. In this approach, we may not only select the features that are most related to a variable Y, but also select the values of the features that most influence the variable or that are most prone to have a specific value of Y. A local formulation of Shanon's mutual information is applied on a CBLOP to select features, namely, the Local Lift Dependence Scale, an scale for measuring variable dependence in multiple resolutions. The main contribution of this paper is to define and apply this local measure, which permits to analyse local properties of joint distributions that are neglected by the classical Shanon's global measure. The proposed approach is applied to a dataset consisting of student performances on a university entrance exam, as well as on undergraduate courses. The approach is also applied to two datasets of the UCI Machine Learning Repository.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let

be an m-dimensional feature vector and

a single variable. Let be a feature vector, whose features are also in , and denote as the set of all feature vectors whose features are also in . In this scenario, we define the classical approach to feature selection, in which the search space is the Boolean lattice of features sets (BLFS).

Definition 1

Given a variable , a feature vector and a cost function calculated from the estimated joint distribution of and , the classical approach to feature selection consists in finding a subset of features such that is minimum.

In light of Definition 1, we note that some families of feature selection algorithms may be considered as classical approaches. In fact, according to the taxonomy of feature selection, as presented in [6] for example, feature selection algorithms may be divided into three families, filters, wrappers and embedded methods, being the last two classical approaches to feature selection. Indeed, in the wrappers methods, the feature selection algorithm exists as a wrapper around a learning machine (or induction algorithm), so that a subset of features is chosen by evaluating its performance on the machine [9], while in the embedded methods, a subset of features is also chosen based on its performance on a learning machine, although the feature selection and the learning machine can not be separated [12]. Therefore, both wrappers and embedded methods satisfy Definition 1, as the performance on the learning machine may be established by a cost function, so that these methods are special cases of the classical approach to feature selection. For more details about these methods see [8, 9, 7, 4, 6, 30, 12].

The main goal of the classical approach is to select the features that are most related to according to a metric defined by the cost function. Although useful in many scenarios, this approach may not be suitable in some applications in which it is of interest to select not only the features that are most related to , but also the features values that most influence , or that are most prone to have a specific value of . Therefore, it would be relevant to extend the search space of the classical approach to an extended space that also contemplates the range of the features, so that we may select features and subsets of their range. This extended space is a collection of Boolean lattices of ordered pairs (features,associated values) (CBLOP) indexed by the elements of the BLFS. In other words, for each we have the Boolean lattice that represents the powerset of its range , that is denoted by , and the CBLOP is the collection of these Boolean lattices, i.e., . If are Boolean features, then its CBLOP is as the one in Figure 1. Note that the circle nodes and solid lines form a BLFS, that around each circle node there is an associated Boolean lattice that represents the powerset of , for a , and that the whole tree is a CBLOP.

A downside of this extension is that the sample size needed to perform feature selection at the extended space is greater than the one needed at the associated BLFS, what demands more refined optimal and sub-optimal algorithms in order to select the features and subsets of their range. On the other hand, the extended space brings advances to the state-of-art in feature selection, as it expands the method to a new variety of applications. As an example of such applications, we may cite market segmentation. Suppose it is of interest to segment a market according to the products that each market share is most prone to buy. Denote as a discrete variable that represents the products sold by a company444

is the probability of an individual of the market buying the product

sold by the company. and as the socio-economic and demographic characteristics of the people that compose the market. In this framework, it is not enough to select the characteristics (features) that are most related to : we need to select, for each product (value of ), the characteristics and their values (the profile of the people) that are prone to buy a given product, so that feature selection must be performed on a CBLOP instead of a BLFS.

We name the approach to feature selection in which the search space is a CBLOP multi-resolution, for we may choose the features based on a global cost function calculated for each (low resolution); or choose the features and a subset of their range based on a local cost function calculated for each and (medium resolution); or choose the features and a point of their range based on a local cost function calculated for each and (high resolution). Formally, the multi-resolution approach to feature selection may be defined as follows.

Definition 2

Given a variable , a feature vector and cost functions , calculated from the estimated joint distribution of and , the multi-resolution approach to feature selection consists in finding a subset of features and a such that is minimum.

The cost functions considered in this paper are local measures of dependence such that, for each of length and subset , measures the local dependence between and restricted to the subset , i.e., for . More specifically, our cost functions are based on the Local Lift Dependence Scale, that is a scale for measuring variable dependence in multiple resolutions. In this scale we may measure variable dependence globally and locally. On the one hand, global dependence is measured by a coefficient, that summarises it. On the other hand, local dependence is measured for each subset of the feature vector range, again by a coefficient. Therefore, if the cardinality of the feature vector range is , we have dependence coefficients: one global and local, each one measuring the influence of the feature vector in restricted to a subset of its range. Furthermore, the Local Lift Dependence Scale also provides a propensity measure for each point of the joint range of the feature vector and . Note that the dependence is indeed measured in multiple resolutions: globally, for each subset of the feature vector range and pointwise.

Thus, in this paper, we extend the classical approach to feature selection in order to select not only the features, but also their values that are most related to in some sense. In order to do so, we extend the search space of the feature selection algorithm from the BLFS to the CBLOP and use cost functions based on the Local Lift Dependence Scale

and on classical dependence measures, such as the Mutual Information, Cross Entropy and Kullback-Leibler Divergence. The feature selection algorithms proposed in this paper are applied to a dataset consisting of student performances on a university’s entrance exam and on undergraduate courses in order to select exam’s subjects, and the performances on them, that are most related to undergraduate courses, considering student performance on both. The method is also applied to two datasets publicly available at the UCI Machine Learning Repository

[13], namely, the Congressional Voting Records and Covertype datasets. We first present the main concepts related to the Local Lift Dependence Scale. Then, we propose feature selection algorithms based on the Local Lift Dependence Scale and apply them to the performances and UCI datasets.

2 Local Lift Dependence Scale

The Local Lift Dependence Scale

(LLDS) is a set of tools for assessing the dependence between a random variable

and a random vector (also called feature vector). Although very simple, and consisting of well known mathematical objects, there does not seem to exist any literature that thoroughly defines and study the properties of the LLDS, even though it is highly used in marketing [3] and data mining [28, Chapter 10], for example. Therefore, we present an unprecedented characterization of the LLDS, despite the fact that much of it is well known and established in the theory.

The LLDS analyses the raw dependence between the variables, as it does not make any assumption about its kind, nor restrict itself to the study of a specific kind of dependence, e.g., linear dependence. Among LLDS tools, there are three measures of dependence, one global and two local, but with different resolutions, that assess variable dependence on multiple levels. The global measure and one of the local are based on well known dependence measures, namely, the Mutual Information and the Kullback-Leibler Divergence. In the following paragraphs we present the main concepts of the LLDS and discuss how they can be applied to the classical and multi-resolution

approaches to feature selection. The main concepts are presented for discrete random variables

and defined on , with range , although, with simple adaptations, the continuous case follows from it.

The Mutual Information (MI), proposed by [24], is a classical dependence quantifier that measures the mass concentration of a joint probability function. As more concentrated the joint probability probability function is, the more dependent the random variables are and greater is their MI. In fact, the MI is a numerical index defined as

in which , and for all . An useful property of the MI is that it may be expressed as

(1)

in which is the Entropy of and is the Conditional Entropy (CE) of given . The form of the MI in (1) is useful because, if we fix the variable , and consider the features , we may determine which one of them is the most dependent with by observing only the CE of given each one of the features, as the feature that maximizes the MI is the one that minimizes the CE. In this paper, we consider the normalized MI that is given by

(2)

if and if555If then , as , so that and it is intuitive to define . . We have that , that if, and only if, and are independent and that if, and only if, there exists a function such that , i.e., is a function of .The , MI and CE are global and general measures of dependence, that summarize to an index a variety of dependence kinds that are expressed by mass concentration.

On the other hand, we may define a LLDS local and general measure of dependence that expands the global dependence measured by the MI into local indexes and that enables local interpretation of the dependence between variables. As the MI is an index that measures the dependence between random variables by measuring the mass concentration incurred in one variable by the observation of another, it may only give evidences about the existence of a dependence, but can not assert what kind of dependence is being observed. Therefore, it is relevant to break down the MI by region, so that it can be interpreted in an useful manner and the kind of dependence outlined by it may be identified. The Lift Function (LF) is responsible for this break down, as it may be expressed as

in which . When there is no doubt about which variables the LF refers to, it is denoted simply by .

The MI is the expectation on of the LF, so that the LF presents locally the mass concentration measured by the MI. As the LF may be written as the ratio between the conditional probability of given and the marginal probability of , the main interest in its behaviour is in determining for which points and for which . If then the fact of being equal to increases the probability of being equal to , as the conditional probability is greater than the marginal one. Therefore, we say that the event lifts the event or that instances with profile are prone to be of the class . In the same way, if , we say that the event inhibits the event , for . If , then the random variables are independent. Note that the LF is symmetric: lifts if, and only if, lifts . Therefore, the LF may be interpreted as lifting or lifting . From now on, we interpret it as lifting , even though it could be the other way around.

An important property of the LF is that it can not be greater than one nor lesser than one for all points666Indeed, if , then what implies the absurd for . With an argument analogous we see that can not be lesser than one for all . . Therefore, if there are LF values greater than one, then there must be values lesser than one, what makes it clear that the values of the LF are dependent and that the lift is a pointwise characteristic of the joint probability function and not a global property of it. Thus, the study of the behaviour of the LF gives the full view of the dependence between the variables, without restricting it to a specific type nor making assumptions about it.

Although the LF presents a wide picture of the variables dependence, it may present it in a too high resolution, making it complex to interpret it. Therefore, instead of measuring the variables dependence for each point in the range , we may measure it for a window . The dependence between and in the window , i.e., for , may be measured by the coefficient defined as

(3)

in which is the Kullback-Leibler divergence [10] and is the cross-entropy777 means the cross-entropy between the conditional distribution of given and the marginal distribution of . [5]. The coefficient (3) compares the conditional probability of given , , with the marginal probability of , so that as greater the coefficient, as distant the conditional probability is from the marginal one and, therefore, greater is the influence of the event in . Note that, analogously to the MI, we may write888 means the Entropy of the conditional distribution of given .

and we have that , that if, and only if, , and that if, and only if, there exists a function such that . Observe that the coefficient of a window is also a local dependence quantifier, although its resolution is lower than that of the LF if the cardinality of is greater than one. Also note that the coefficient (3) is an extension of (2) to all subsets (windows) of , as is a window.

The three dependence coefficients presented, when analysed collectively, measure variable dependence in all kinds of resolutions: since the low resolution of the MI, through the middle resolutions of the windows , until the high resolution of the LF. Indeed, the coefficients and the LF define a dependence scale in , that we call LLDS, that gives a dependence measure for each subset . This scale may be useful for various purposes and we outline some of them in the following paragraphs.

2.1 Potential applications of the Local Lift Dependence Scale

The LLDS, more specifically the LF, is relevant in frameworks in which we want to choose a set of elements, e.g, people, in order to apply some kind of treatment to them, obtaining some kind of response , and are interested in maximizing the number of elements with a given response . In this scenario, given the features , the LF provides the set of elements that must be chosen, that is the set whose elements have profile such that is greatest. Formally, we must choose the elements whose profile is

Indeed, if we choose elements randomly from our population, we expect that of them will have the desired response. However, if we choose elements from the population of all elements with profile , then we expect that of them will have the desired response, what is

more elements when comparing with the whole population sampling framework. Observe that this framework is the exact opposite of the classification problem. In the classification problem, we want to classify an instance given its profile

into a class , that may be, for example, the class such that is maximum. On the other hand, in this framework, we are interested in, given a , finding the profile such that is maximum. In the applications section we further discuss the differences between this framework and the classification problem, and how the LLDS may be applied to both.

Furthermore, the coefficient is relevant in scenarios in which we want to understand the influence of in by region, i.e., for each subset of . As an example of such framework, consider an image in the grayscale, in which represents the pixels of the image and is the random variable whose distribution is the distribution of the colors in the picture, i.e., in which is the number of pixels whose color is and is the total number of pixels in the image. If we define the distribution of properly for all , we may calculate , in order to determine the regions that are a representation of the whole picture, i.e., whose color distribution is the same of the whole image, and the regions whose color distribution differs from that of the whole image. The coefficient may be useful for identifying textures and recognizing patterns in images.

Lastly, the LLDS may be used for feature selection, when we are not only interested in selecting the features that are most related to , but also want to determine the features whose levels most influence . In the same manner, we may want to select the features whose level maximizes , for a given , so that we may sample from the population of elements with profile in order to maximize the number of elements of the class . Feature selection based on the LLDS is a special case of the classical and multi-resolution approaches to feature selection as presented next.

3 Feature Selection Algorithms based on the Local Lift Dependence Scale

In this section we present the characteristics of feature selection algorithms based on the LLDS. We first outline the special case of the classical approach to feature selection that is based on the LLDS, and then propose multi-resolution feature selection algorithms that are also based on the LLDS.

3.1 Classical Feature Selection Algorithm

Let and be random variables. We call the random variables in features and note that , the set of all feature vectors whose features are also in , may be seen as a BLFS, in which each vector represents a subset of features. In this scheme, feature selection is given by the minimization, in the BLFS, of a cost function applied on the estimated joint probability of a feature vector and . In fact, the subset of features selected by this approach is given by

in which is a cost function. The estimated error of a predictor as presented in [18, Chapter 2], for example, is a classical cost function. Another classical cost function is the CE as defined in (1). A pseudo-code for such algorithm is presented in Algorithm 1. The Algorithm 1 is naive, performs an exhaustive search on the BLFS and is known to be NP-hard [1]. However, some other algorithms may be applied to find a sub-optimal solution to this problem, as sequential selection algorithms and floating search methods [15, 29, 20, 27, 26, 16], or the search space may be restricted to a subspace of . Nevertheless, there are algorithms, as the branch-and-bound [17] and the u-curve [22, 23, 2], that does not perform an exhaustive search, but ensure that the selected subset of features is optimal.

0:  
0:  
1:  for  do
2:     if  then
3:        
4:        
5:     end if
6:  end for
7:  return  
Algorithm 1 Select that minimizes .

As an example of the classical approach to feature selection, suppose that , in which and are Boolean features. Then, the search space may be represented by a tree, i.e., a BLFS, as the one displayed in Figure 1, considering only the circle nodes and solid lines. Algorithm 1 may be performed by walking through this tree seeking the minimum of .

Figure 1: Example of multi-resolution tree (CBLOP) for feature selection. The circle nodes and solid lines form a BLFS. The rectangular nodes and dashed lines around each circle node form a Boolean lattice. The whole tree is a CBLOP.

3.2 Multi-resolution Feature Selection based on the Local Lift Dependence Scale

Feature selection based on the LLDS may be performed in three distinct resolutions. As a low resolution approach, we may select the features that are most globally related to , that are given by

(4)

Note that, in this resolution, the feature selection approach is the classical one, with in (2) as the cost function, i.e., Algorithm 1 may be applied to determine (4) taking as . The use of the MI as a cost function in classical feature selection algorithms is quite common in the literature (see [19, 25, 11] for example) and is not original of this paper. The search space of (4) may be restricted, sub-optimal algorithms may be applied or the discretization of the continuous features may be performed jointly, so that the subset selected in (4) is not always the subset of all features. In the applications section we show how the continuous features may be discretized jointly.

Increasing the resolution, we may be interested in finding not the features most related to , but the features levels that most influence . In this approach the selected features and their levels are

(5)

A pseudo-code for this feature selection approach is presented in Algorithm 2. Note that the space in which the exhaustive search is conducted in Algorithm 2, i.e., the CBLOP, is even greater than the one in Algorithm 1. However, optimal algorithms that do not exhaustively search the space, and sub-optimal algorithms, may also be applied in this scenario, saving some computational time. Note that this approach is not suitable for the case in which the features in are continuous, as , , is uncountable, although the continuous features may be discretized allowing the application of the algorithm. Furthermore, as is further discussed in the applications section, this algorithm is subjected to overfitting if the sample size is not relatively great, 999As is the majority of statistical models and feature selection algorithms. so that it may be needed to restrict its search space to a subset of the CBLOP.

0:  
0:  
0:  
1:  for  do
2:     for  do
3:        if  then
4:           
5:           
6:           
7:        end if
8:     end for
9:  end for
10:  return  
Algorithm 2 Select and that maximizes .

Finally, as a higher resolution approach, we may fix an and then look for the features levels that maximize the LF at the point . Formally, the selected features and levels are given by

(6)

A pseudo-code to perform (6) is presented in Algorithm 3, that is analogous to Algorithm 2. Note that the search space of Algorithm 3 is greater than that of Algorithm 1 and smaller than that of Algorithm 2. Nevertheless, it has the same general characteristics of Algorithm 2: optimal algorithms that do not search all the space, and sub-optimal algorithms, may be applied; it can not be applied to continuous features; and it is subjected to overfitting.

0:  
0:  
0:  
0:  
1:  for  do
2:     for  do
3:        if  then
4:           
5:           
6:           
7:        end if
8:     end for
9:  end for
10:  return  
Algorithm 3 Select and that maximizes for some fixed .

As an example of a multi-resolution approach to feature selection based on the LLDS, suppose again that are Boolean features. Then, for all the proposed resolutions, the selection of the features and their levels, i.e., Algorithms 1, 2 and 3, may be performed by walking through the tree (CBLOP) in Figure 1. Indeed, we may calculate the global at the circle nodes, the on all windows at the rectangular nodes and the LF at the leaves, where we may determine its maximum for a fixed value . Therefore, we call a tree as the one in Figure 1 a multi-resolution tree for feature selection, where we may apply feature selection algorithms for all the resolutions of the LLDS, i.e., Algorithms 1, 2 and 3.

4 Applications

The multi-resolution approach proposed in the previous sections is now applied to three different datasets. First, we apply it to the performances dataset, that consists of student performances on entrance exams and undergraduate courses. Then, we apply the algorithms to two UCI Machine Learning Repository datasets: the Congressional Voting Records and Covertype datasets [13].

4.1 Performances dataset

A recurrent issue in universities all over the world is the framework of their recruitment process, i.e., the manner of selecting their undergraduate students. In Brazilian universities, for example, the recruitment of undergraduate students is solely based on their performance on exams that cover high school subjects, called vestibulares, so that knowing which subjects are most related to the performance on undergraduate courses is a matter of great importance to universities admission offices, as it is important to optimize the recruitment process in order to select the students that are most likely to succeed. Therefore, is this scenario, the algorithm presented in the previous sections may be an useful tool in determining the entrance exam subjects, and the performances on them, that are most related to the performance on undergraduate courses, so that students may be selected based on their performance on these subjects.

The recruitment of students to the University of São Paulo is based on an entrance exam that consists of an essay and questions of eight subjects: Mathematics, Physics, Chemistry, Biology, History, Geography, English and Portuguese. The selection of students is entirely based on this exam, although the weights of the subjects differ from one course to another. In the exact sciences courses, as Mathematics, Statistics, Physics, Computer Science and Engineering, for example, the subjects with greater weights are Portuguese, Mathematics and Physics, as those are the subjects that are qualitatively most related to what is taught at these courses. Although weights are given to each subject in a systematic manner, it is not known what subjects are indeed most related to the performance on undergraduate courses. Therefore, it would be of interest to measure the relation between the performance on exam subjects and undergraduate courses and, in order to do so, we apply the algorithms proposed on the previous sections.

The dataset to be considered consist of 8,353 students that enrolled in 28 courses of the University of São Paulo between 2011 and 2016. The courses are those of its Institute of Mathematics and Statistics, Institute of Physics and Polytechnic School, and are in general Mathematics, Computer Science, Statistics, Physics and Engineering courses. The variable of interest (Y) is the weighted mean grade of the students on the courses they attended in their first year at the university (the weights being the courses credits), and is a number between zero and ten. The features, that are denoted , are the performances on each one of the eight entrance exam subjects, that are numbers between zero and one, and the performance on the essay, that is a number between zero and one hundred.

In order to apply the proposed algorithm to the data at hand, it is necessary to conveniently discretize the variables and, to do so, we take into account an important characteristic of the data: the scale of the performances. The scale of the performances, both on the entrance exam and the undergraduate courses, depend on the course and the year. Indeed, the performance on the entrance exam of students of competitive courses is better, as only the students with high performance are able to enrol in these courses. In the same way, the performances differ from one year to another, as the entrance exam is not the same every year and the teachers of the first year courses also change from one year to another, what causes the scale of the grades to change. Therefore, we discretize all variables by tertiles inside each year and course, i.e., we take the tertiles considering only the students of a given course and year. Furthermore, we do not discretize each variable by itself, but rather discretize the variables jointly, by a method based on distance tertiles, as follows.

Suppose that at a step of the algorithm we want to measure the relation between Y and the features . In order to do so, we discretize Y by the tertiles inside each course and year, e.g., a student is in the third tertile if he is on the top one third students of his class according to the weighted mean grade, and discretize the performance on jointly, i.e., by discretizing the distance between the performance of each student on these subjects and zero by its tertiles. Indeed, students whose performance is close to zero have low joint performance on the subjects , while those whose performance is far from zero have high joint performance on the subjects . Therefore, we take the distance between each performance and zero, and then discretize the distance inside each course and year, e.g., a student is at the first tertile if he is on the bottom students of his class according to the joint performance on the subjects . The Mahalanobis distance [14]

is used, as it takes into account the variance and covariance of the performance on the subjects

.

As an example, suppose that we want to measure the relation between the performances on Mathematics and Physics and the weighted mean grade of students that enrolled in the Statistics undergraduate course in 2011 and 2012. In order to do so, we discretize the weighted mean grade by year and the performance on Mathematics and Physics by the Mahalanobis distance between it and zero, also by year, as is displayed in Figure 2. Observe that each year has its own ellipses that partition the performance on Mathematics and Physics in three and the tertile of a student depends on the ellipses of his year. The process used in Figure 2 is extended to the case in which there are more than two subjects and one course. When there is only one subject, the performance is discretized in the usual manner inside each course and year. The LF between the weighted mean grade and the joint performance on Mathematics and Physics is presented in Table 1. From this table we may search for the maximum lift or calculate the coefficient for its windows. In this example, we have101010. .

Figure 2: Discretization of the joint performance on Mathematics and Physics of Statistics students that enrolled at the University of São Paulo in 2011 and 2012 by the tertiles of the Mahalanobis distance inside each year.
Mathematics Weighted Mean Grade Relative
and Physics Tertile 1 Tertile 2 Tertile 3 Frequency
Tertile 1 0.975 (9) 1.46 (13) 0.563 (5) 0.34
Tertile 2 1.01 (9) 0.935 (8) 1.05 (9) 0.33
Tertile 3 1.01 (9) 0.584 (5) 1.4 (12) 0.33
Relative Frequency 0.342 0.329 0.329 1
Table 1: The Lift Function between the weighted mean grade, discretized by year, and the joint performance on Mathematics and Physics, discretized by the Mahalanobis distance inside each year, of Statistics students that enrolled at the University of São Paulo in 2011 and 2012. The numbers in parentheses represent the quantity of students in each category.

The proposed algorithm is applied to the discretized variables using three cost functions. First, we use the coefficient on the window that represents the whole range of the features in order to determine what are the subjects (features) that are most related to the weighted mean grade, i.e., the features (4). Then, we apply the algorithm using as cost function the coefficient for all windows in order to determine the subjects performances (features and window) that are most related to the weighted mean grade, i.e., the subjects and performances (5). Finally, we determine what are the subjects and their performance that most lift the weighted mean grade third tertile, i.e., the subjects and performances (6) with .

The subjects that are most related to the weighted mean grade, according to the proposed discretization process and the coefficient (2), are111111. and . The LF between the weighted mean grade and is presented in Table 2. The features are the ones that are in general most related to the weighted mean grade, i.e., are the output of the classical feature selection algorithm that employs the inverse of the global coefficient as the cost function (Algorithm 1). Therefore, the recruitment of students could be optimized by taking into account only the subjects .

Performance Weighted Mean Grade Relative
in Tertile 1 Tertile 2 Tertile 3 Frequency
Tertile 1 1.33 (1,277) 1.1 (1,018) 0.566 (533) 0.34
Tertile 2 0.992 (921) 1.06 (951) 0.954 (871) 0.33
Tertile 3 0.669 (630) 0.848 (775) 1.49 (1,377) 0.33
Relative Frequency 0.339 0.329 0.333 1
Table 2: The Lift Function between the weighted mean grade and the joint performance on . The numbers in parentheses represent the quantity of students in each category.

Applying Algorithms 2 and 3 we obtain the same result, that the performance, i.e., window, that is most related to the weighted mean grade and that most lifts the third tertile of the weighted mean grade is the third tertile in Mathematics, for which121212M = Mathematics. and . The LF between the weighted mean grade and the performance on Mathematics is presented in Table 3.

Performance Weighted Mean Grade Relative
in Mathematics Tertile 1 Tertile 2 Tertile 3 Frequency
Tertile 1 1.3 (1,398) 1.06 (1,111) 0.631 (667) 0.38
Tertile 2 0.935 (843) 1.11 (972) 0.956 (847) 0.32
Tertile 3 0.689 (587) 0.8 (661) 1.51 (1,267) 0.30
Relative Frequency 0.339 0.329 0.333 1
Table 3: The Lift Function between the weighted mean grade and the performance on Mathematics. The numbers in parentheses represent the quantity of students in each category.

The output of the algorithms provides relevant informations to the admission office of the University. Indeed, it is now known that the subjects that are most related to the performance on the undergraduate courses are Mathematics, Physics, Chemistry, Biology and Portuguese. Furthermore, in order to optimize the number of students that will succeed in the undergraduate courses, the office must select those that have high performance on Mathematics, as it lifts by more than 50% the probability of the student having also a high performance on the undergraduate course, i.e., students with high performance on Mathematics are prone to have high performance on the undergraduate course. Although the subjects that are most related to the performance on the courses are obtained from the classical feature selection algorithm, only the LLDS identifies what is the performance on the entrance exam that is most related to the success on the undergraduate course, that is high performance on Mathematics. Therefore, feature selection algorithms based on the LLDS provide more information than the classical feature selection algorithm, as they have a greater resolution and take into account the local relation between the variables.

4.2 Congressional Voting Records dataset

The Congressional Voting Records dataset consists of 435 instances of 16 Boolean features and a Boolean variable that indicates the party of the instance (democrat or republican). The features indicate how the instance voted (yes or no) in the year of 1984 about each one of 16 matters, that are displayed in Table 4. Algorithm 3 is applied to this dataset in order to determine what are the voting profiles that are most prone to be that of a republican and that of a democrat.

ID Matter (Feature)
HI Handicapped infants
WP Water project cost sharing
AB Adoption of the budget resolution
PF Physician fee freeze
SA El Salvador aid
RG Religious groups in schools
ST Anti satellite test ban
AN Aid to nicaraguan contras
MM MX missile
IM Immigration
SC Synfuels corporation cutback
ES Education spending
SR Superfund right to sue
CR Crime
DF Duty Free exports
EA Export administration act South Africa
Table 4: Features of the Congressional Voting Records dataset.

As the number of instances is relatively small, we perform Algorithm 3 under a restriction that avoids overfitting. Indeed, if we apply the algorithm without the restriction, then the chosen profiles are those in which all the instances are of the same party. If there is only a couple of instances with some profile, and all of them are of the same party, then this profile is chosen as a prone one for the party. However, we do not know if the profile is really prone, i.e., everybody with it is in fact of the same party, or if the fact of everybody with that profile being of the same party is just a sample deviation. In other words, without the restriction, the estimation error of the LF is too great as some profiles have low frequency in the sample and the feature selection algorithm overfits.

Therefore, we restrict the search space to the profiles with a relative frequency in the sample of at least . In other words, we select the profiles

for , in which , is estimated by the relative frequency of the profile. The selected profiles, their LF value and the sample size considered are presented in Table 5. At each iteration of the algorithm, only the instances that have no missing data in the features being considered are taken into account when calculating the LF, so that the sample size used at each iteration is not the same.

The profiles with maximum LF lifts by 94% the probability of democrat and by around 165% the probability of republican. This difference in the lift is due to the fact that there are more democrats than republicans, so that the probability of democrat is greater and, therefore, can not be lifted as much as the probability of republican can. The profiles in Table 5 present a wide view of the voting profile of democrats and republicans, what allows an understanding of what differentiates a democrat from a republican regarding their vote.

Party Features () LF Profile () Sample Size
democrat (AB,PF,SA,RG,MM,ES,SR,EA) 1.94 (y,n,n,n,y,n,n,y) 277
(HI,AB,PF,SA,RG,MM,ES,SR,EA) 1.94 (y,y,n,n,n,y,n,n,y) 275
(AB,PF,RG,ST,MM,ES,SR,EA) 1.94 (y,n,n,y,y,n,n,y) 279
(HI,AB,PF,RG,ST,MM,ES,SR,EA) 1.94 (y,y,n,n,y,y,n,n,y) 277
(AB,PF,SA,RG,ST,MM,ES,SR,EA) 1.94 (y,n,n,n,y,y,n,n,y) 276
(HI,AB,PF,SA,RG,ST,MM,ES,SR,EA) 1.94 (y,y,n,n,n,y,y,n,n,y) 274
(AB,PF,SA,RG,MM,ES,SR,CR,EA) 1.94 (y,n,n,n,y,n,n,n,y) 275
(AB,PF,RG,ST,MM,ES,SR,CR,EA) 1.94 (y,n,n,y,y,n,n,n,y) 276
(AB,PF,SA,RG,ST,MM,ES,SR,CR,EA) 1.94 (y,n,n,n,y,y,n,n,n,y) 274
(AB,PF,RG,ST,MM,ES,SR,DF,EA) 1.94 (y,n,n,y,y,n,n,y,y) 269
(AB,PF,SA,RG,ST,MM,ES,SR,DF,EA) 1.94 (y,n,n,n,y,y,n,n,y,y) 266
republican (WP,PF,SC,ES,CR) 2.65 (n,y,n,y,y) 342
(AB,PF,AN,SC,CR,DF) 2.64 (n,y,n,n,y,n) 369
(PF,AN,IM,ES,CR,DF) 2.64 (y,n,y,y,y,n) 361
(PF,AN,SC,CR,DF) 2.64 (y,n,n,y,n) 373
(AB,PF,AN,SC,ES) 2.63 (n,y,n,n,y) 376
(HI,AB,PF,AN,SC,ES) 2.63 (n,n,y,n,n,y) 373
(AB,PF,AN,SC,ES,CR) 2.63 (n,y,n,n,y,y) 368
(HI,AB,PF,AN,SC,ES,CR) 2.63 (n,n,y,n,n,y,y) 365
(PF,AN,SC,DF) 2.63 (y,n,n,n) 380
(AB,PF,AN,SC,DF) 2.63 (n,y,n,n,n) 376
(PF,AN,IM,ES,DF) 2.63 (y,n,y,y,n) 368
(AB,PF,AN,SC,ES,DF) 2.63 (n,y,n,n,y,n) 360
(HI,AB,PF,AN,SC,CR,DF) 2.63 (n,n,y,n,n,y,n) 365
(PF,AN,SC,ES,CR,DF) 2.63 (y,n,n,y,y,n) 356
(AB,PF,AN,SC,ES,CR,DF) 2.63 (n,y,n,n,y,y,n) 353
(HI,AB,PF,AN,SC,ES,CR,DF) 2.63 (n,n,y,n,n,y,y,n) 350
y = yes; n = no.
Table 5: Selected profiles obtained applying Algorithm 3 to the Congressional Voting Records dataset with the restriction that only the profiles with relative frequency greater than are considered. The instances with missing data were excluded at each iteration of the algorithm, i.e., is calculated using only the instances that have all the observations on the features .

This application to the Congressional Voting Records dataset shed light on two interesting properties of the LLDS approach to feature selection in its higher resolution. First, this approach is indeed local, as we are not interested in selecting the features that best classify the representatives accordingly to their party, but rather the voting profiles that are most prone to be that of a democrat or republican. Secondly, the problem treated here is the opposite of a classification problem. Indeed, in the classification problem, we are interested in classifying a representative according to his party, given his voting profile. On the other hand, the problem treated here is the exact opposite: given a party, we want to know what are the profiles of the representatives that are most prone to be of that party. In other words, in the classification problem we want to determine the party given the voting profile, while on the LLDS problem we want to determine the voting profile given the party.

4.3 Covertype dataset

The Covertype dataset consists of 581,012 instances (terrains) of 54 features (10 continuous and 44 discrete) and a variable that indicates the cover type of the terrain (7 types). We apply Algorithms 1, 2 and 3 to select features among the continuous ones that are displayed in Table 6

. The features are discretized in the same way they were in the performances dataset: by taking sample quantiles of the Mahalanobis distance between the features and zero. However, we now consider the quantiles

and as cutting points, i.e., quintiles, instead of tertiles.

ID Feature
E Elevation
A Aspect
S Slope
HH Horizontal distance to hydrology
HR Horizontal distance to roadways
HF Horizontal distance to fire points
H9 Hillshade 9am
HN Hillshade Noon
H3 Hillshade 3pm
VH Vertical distance to hydrology
Table 6: Features of the Covertype dataset that are considered in this application.

Applying Algorithm 1 we select the features , with a coefficient and the LF in Table 7. We see that being in the first quintile of the selected features lifts classes 3, 4, 5 and 6; being in the second quintile lifts classes 2 and 5; being in the third quintile lifts class 2; being in the fourth quintile lifts class 1; and being in the fifth quintile lifts classes 1 and 7. From Table 7 we may interpret the relation between the selected features and the cover type. For example, we see that terrains with cover types 3, 4, 5 and 6 tend to have low joint values in the selected features, while terrains with cover 7 tend to have great joint values in them. This example shows how the proposed approach allows not only to select the features, but also understand why these features were selected, i.e., what is the relation between them and the cover type, by analysing the local dependence between the variables.

Cover type Relative
1 2 3 4 5 6 7 Frequency
Quintile 1 0.0766 (3,244) 0.961 (54,473) 4.94 (35,344) 5 (2,747) 1.78 (3,385) 4.9 (17,010) 0 (0) 0.20
Quintile 2 0.444 (18,816) 1.6 (90,872) 0.0573 (410) 0 (0) 2.98 (5,663) 0.103 (357) 0.0205 (84) 0.20
Quintile 3 0.949 (40,195) 1.33 (75,562) 0 (0) 0 (0) 0.234 (445) 0 (0) 0 (0) 0.20
Quintile 4 1.66 (70,427) 0.8 (45,314) 0 (0) 0 (0) 0 (0) 0 (0) 0.112 (461) 0.20
Quintile 5 1.87 (79,158) 0.301 (17,080) 0 (0) 0 (0) 0 (0) 0 (0) 4.87 (19,965) 0.20
Relative Frequency 0.365 0.488 0.0615 0.00473 0.0163 0.0299 0.0353 1
Table 7: The Lift Function between the cover type of the terrain and the features Elevation, Horizontal distance to hydrology and Horizontal distance to fire points discretized by the sample quintiles of the Mahalanobis distance to zero. The numbers in parentheses represent the sample size of each category.

Applying Algorithm 2 to this dataset we obtain the windows displayed in Table 8. We see that the window that seems to most influence the cover type is the first and fifth quintile of the features Elevation and Horizontal distance to hydrology. Indeed, all the top ten windows contain those two features, and either their first or fifth quintile. As we can see in Table 7, the influence of the fifth quintile of , the top window, is given by the fact that no terrain of the types 3, 4, 5 and 6 is in this quintile. Note that, again, our approach allows a better interpretation of the selected features by the analysis of the local dependence between the features and the cover type.

Features () Window ()
(E,HH,HF) Quintile 5 0.38
(E,A,HH,HF) Quintile 5 0.38
(E,HH) Quintile 5 0.37
(E,A,HH) Quintile 5 0.37
(E,HH,VH,HF) Quintile 5 0.37
(E,A,HH,VH,HF) Quintile 5 0.36
(E,HH,HF) Quintiles 1 & 5 0.36
(E,A,HH,VH) Quintile 5 0.36
(E,HH,VH) Quintile 5 0.36
(E,HH) Quintiles 1 & 5 0.36
Table 8: Features selected applying Algorithm 2 to the Covertype dataset.

Finally, applying Algorithm 3 we choose the profiles displayed in Table 9 for . We see, for example, that the profile most prone to be of type 1 is and of type 3 is . Note that it does not mean that most of the terrains with these profiles are of type 1 and 3, but rather that the probability of a terrain with these profiles being of types 1 and 3, respectively, is 87% and 396% greater than the probability of a terrain for which we do not know the profile. Therefore, we see again the difference between the LLDS approach and the classification problem. In the LLDS approach, given a profile, we are interested in determining the type of which the conditional probability given the profile is greater than the marginal probability, while in the classification problem, given a profile, we are interested in determining the type for which the conditional probability given the profile is the greatest.

As an example, suppose the joint distribution that generated the LF of Table 7 and the profile Quintile 1. We have that the maximum conditional probability given this profile is the probability of type 2 (), while the maximum lift is that of type ,although its conditional probability is only . However, the conditional probability of type given the profile, even though absolutely small, is relatively great: it is 5 times the marginal probability . Therefore, on the one hand, if there is a new terrain whose profile is , we classify it as being of type 2. On the other hand, if we want to sample terrains from a population and are interested in maximizing the number of terrains of type 4, we may sample from the population with profile instead of the whole population, expecting to sample four times more terrains of type 4.

Cover type Features LF Maximum Profile
1 (E,HH,HF) 1.87 Quintile 5
2 (E,HH,HR) 1.63 Quintile 2
3 (E,HH,HR,HF) 4.96 Quintile 1
4 (E,HH,HF) 5 Quintile 1
5 (E,HR,HF) 3.31 Quintile 2
6 (E,HH,HF) 4.90 Quintile 1
7 (E,HH) 4.89 Quintile 5
Among other profiles.
Table 9: Profiles selected applying Algorithm 3 to the Covertype dataset for .

5 Final Remarks

The feature selection algorithms based on the LLDS extend the classical approach to feature selection to a higher resolution one, as it takes into account the local dependence between the features and the variable of interest. Indeed, classical feature selection may be performed by walking through a tree in which each node is a vector of features, i.e., a BLFS, while feature selection based on the LLDS is established by walking through an extended tree, i.e., a CBLOP, in which inside each node there is another tree, that represents the windows of the variables, as displayed in the example in Figure 1. Therefore, feature selection based on the LLDS increases the reach of feature selection algorithms to a new variety of applications.

The LLDS may treat a problem that is the opposite of that of classification, i.e., when we are interested in, given a class , finding the profile of which we may sample from its population in order to maximize the number of instances of class . Indeed, in the classification problem we want to do the exact opposite: classify a instance with known profile into a class of . Therefore, although LLDS tools may also be applied to the classification problem (as they are in the literature), they are of great importance in problems that we may call the reverse engineering of the classification one. Thus, our approach broadens the application of features selection algorithms to a new set o problems by the extension of their search spaces from BLFs to CBLOPs.

The algorithm proposed in this paper may be optimized in order to not walk through the entire CBLOP, as its size increases exponentially with the number of features, so that the algorithm may not be computable for a great number of features. Moreover, the algorithms may be subjected to overfitting if the sample size is relatively small, so that their search space may be restricted. The methods of [6, 9, 12, 8, 7, 4, 30, 15, 29, 20, 27, 26, 16, 17, 22, 23, 2], for example, may be adapted to the multi-resolution algorithms in order to optimize them. Furthermore, the properties of the coefficients and the LF must be studied in a theoretical framework, in order to establish their variances, sample distributions and develop statistical methods to estimate and test hypothesis about them.

The LLDS adapts classical measures, such as the MI and the Kullback-Leibler Divergence, into coherent dependence coefficients that assess the dependence between random variables in multiple resolutions, presenting a wide view of the dependence between the variables. As it does not make any assumption about the dependence kind, the LLDS measures the raw dependence between the variables and, therefore, may be relevant for numerous purposes, being feature selection just one of them. We believe that the algorithms proposed in this paper, and the LLDS in general, bring advances to the state-of-art in dependence measuring and feature selection, and may be useful in various frameworks.

Supplementary Material

The following are available online at www.mdpi.com/link: an R [21] package called localift that performs the algorithms proposed by this paper; an R object that contains the results of the algorithms used to analyse the performances dataset; and an R code that apply the algorithms to the Congressional Voting Records and Covertype datasets.

Acknowledgements

We would like to thank A. C. Hernandes who kindly provided the performances dataset used in the application section. The Covertype dataset is Copyrighted 1998 by Jock A. Blackard and Colorado State University.

References

  • [1] Edoardo Amaldi and Viggo Kann. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209(1-2):237–260, 1998.
  • [2] Esmaeil Atashpaz-Gargari, Marcelo S Reis, Ulisses M Braga-Neto, Junior Barrera, and Edward R Dougherty. A fast branch-and-bound algorithm for u-curve feature selection. Pattern Recognition, 73:172–188, 2018.
  • [3] David S Coppock. Why lift? data modeling and mining. Information Management Online, 2002.
  • [4] Sanmay Das. Filters, wrappers and a boosting-based hybrid for feature selection. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 74–81. Morgan Kaufmann Publishers Inc., 2001.
  • [5] Lih-Yuan Deng.

    The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning

    .
    Taylor & Francis, 2006.
  • [6] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
  • [7] Mark A Hall. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 359–366. Morgan Kaufmann Publishers Inc., 2000.
  • [8] George H John, Ron Kohavi, and Karl Pfleger. Irrelevant features and the subset selection problem. In Machine learning: proceedings of the eleventh international conference, pages 121–129, 1994.
  • [9] Ron Kohavi and George H John. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324, 1997.
  • [10] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • [11] Nojun Kwak and Chong-Ho Choi. Input feature selection by mutual information based on parzen window. IEEE transactions on pattern analysis and machine intelligence, 24(12):1667–1671, 2002.
  • [12] Thomas Lal, Olivier Chapelle, Jason Weston, and André Elisseeff. Embedded methods. Feature extraction, pages 137–165, 2006.
  • [13] M. Lichman. UCI machine learning repository, 2013.
  • [14] Prasanta Chandra Mahalanobis. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta), 2:49–55, 1936.
  • [15] Thomas Marill and D Green. On the effectiveness of receptors in recognition systems. IEEE transactions on Information Theory, 9(1):11–17, 1963.
  • [16] Songyot Nakariyakul and David P Casasent. An improvement on floating search algorithms for feature subset selection. Pattern Recognition, 42(9):1932–1940, 2009.
  • [17] Patrenahalli M. Narendra and Keinosuke Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, 9(C-26):917–922, 1977.
  • [18] Ulisses M. Braga Neto and Edward R. Dougherty. Error Estimation for Pattern Recognition. Wiley, 2015.
  • [19] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8):1226–1238, 2005.
  • [20] Pavel Pudil, Jana Novovičová, and Josef Kittler. Floating search methods in feature selection. Pattern recognition letters, 15(11):1119–1125, 1994.
  • [21] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2016.
  • [22] Marcelo Ris, Junior Barrera, and David C Martins. U-curve: A branch-and-bound optimization algorithm for u-shaped cost functions on boolean lattices applied to the feature selection problem. Pattern Recognition, 43(3):557–568, 2010.
  • [23] Marcelo S Ris. Minimization of decomposable in U-shaped curves functions defined on poset chains–algorithms and applications. PhD thesis, Institute of Mathematics and Statistics, University of Sao Paulo, Brazil (in Portuguese), 2012.
  • [24] Claude E Shannon and Warren Weaver. The mathematical theory of communication. Urbana: University of Illinois Press, 29, 1949.
  • [25] Marek Śmieja and Dawid Warszycki. Average information content maximization—a new approach for fingerprint hybridization and reduction. PloS one, 11(1):e0146666, 2016.
  • [26] Petr Somol, Jana Novovičová, and Pavel Pudil. Flexible-hybrid sequential floating search in statistical feature selection. Structural, Syntactic, and Statistical Pattern Recognition, pages 632–639, 2006.
  • [27] Petr Somol, Pavel Pudil, Jana Novovičová, and Pavel Paclık. Adaptive floating search methods in feature selection. Pattern recognition letters, 20(11):1157–1163, 1999.
  • [28] Stéphane Tufféry and Rod Riesco. Data mining and statistics for decision making. Wiley, 2011.
  • [29] A Wayne Whitney. A direct method of nonparametric measurement selection. IEEE Transactions on Computers, 100(9):1100–1103, 1971.
  • [30] Lei Yu and Huan Liu.

    Feature selection for high-dimensional data: A fast correlation-based filter solution.

    In Proceedings of the 20th international conference on machine learning (ICML-03), pages 856–863, 2003.