A branch-and-bound feature selection algorithm for U-shaped cost functions

by   Marcelo Ris, et al.

This paper presents the formulation of a combinatorial optimization problem with the following characteristics: i.the search space is the power set of a finite set structured as a Boolean lattice; ii.the cost function forms a U-shaped curve when applied to any lattice chain. This formulation applies for feature selection in the context of pattern recognition. The known approaches for this problem are branch-and-bound algorithms and heuristics, that explore partially the search space. Branch-and-bound algorithms are equivalent to the full search, while heuristics are not. This paper presents a branch-and-bound algorithm that differs from the others known by exploring the lattice structure and the U-shaped chain curves of the search space. The main contribution of this paper is the architecture of this algorithm that is based on the representation and exploration of the search space by new lattice properties proven here. Several experiments, with well known public data, indicate the superiority of the proposed method to SFFS, which is a popular heuristic that gives good results in very short computational time. In all experiments, the proposed method got better or equal results in similar or even smaller computational time.



There are no comments yet.


page 5

page 10

page 13


The U-curve optimization problem: improvements on the original algorithm and time complexity analysis

The U-curve optimization problem is characterized by a decomposable in U...

Feature Selection based on the Local Lift Dependence Scale

This paper uses a classical approach to feature selection: minimization ...

Exploring search space trees using an adapted version of Monte Carlo tree search for combinatorial optimization problems

In this article, a novel approach to solve combinatorial optimization pr...

MACS: An Agent-Based Memetic Multiobjective Optimization Algorithm Applied to Space Trajectory Design

This paper presents an algorithm for multiobjective optimization that bl...

An iterative feature selection method for GRNs inference by exploring topological properties

An important problem in bioinformatics is the inference of gene regulato...

Exact and Approximate Hierarchical Clustering Using A*

Hierarchical clustering is a critical task in numerous domains. Many app...

Efficient and Extensible Policy Mining for Relationship-Based Access Control

Relationship-based access control (ReBAC) is a flexible and expressive f...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A combinatorial optimization algorithm chooses the object of minimum cost over a finite collection of objects, called search space, according to a given cost function. The simplest architecture for this algorithm, called full search, access each object of the search space, but it does not work for huge spaces. In this case, what is possible is to access some objects and choose the one of minimum cost, based on the observed measures. Heuristics and branch-and-bound are two families of algorithms of this kind. An heuristic algorithm does not have formal guaranty of finding the minimum cost object, while a branch-and-bound algorithm has mathematical properties that guarantee to find it.

Here, it is studied a combinatorial optimization problem such that the search space is composed of all subsets of a finite set with points (i.e., a search space with objects), organized as a Boolean lattice, and the cost function has a U-shape in any chain of the search space or, equivalently, the cost function has a U-shape in any maximal chain of the search space.

This structure is found in some applied problems such as feature selection in pattern recognition [5, 7] and W-operator window design in mathematical morphology [8]. In these problems, a minimum subset of features, that is sufficient to represent the objects, should be chosen from a set of features. In W-operator design, the features are points of a finite rectangle of

called window. The U-shaped functions are formed by error estimation of the classifiers or of the operators designed or by some measures, as the entropy, on the corresponding estimated join distribution. This is a well known phenomenon in pattern recognition: for a fixed amount of training data, the increasing number of features considered in the classifier design induces the reduction of the classifier error by increasing the separation between classes until the available data becomes too small to cover the classifier domain and the consequent increase of the estimation error induces the increase of the classifier error. Some known approaches for this problem are heuristics. A relatively well succeeded heuristic algorithm is SFFS

[11], which gives good results in relatively small computational time.

There is a myriad of branch-and-bound algorithms in the literature that are based on monotonicity of the cost-function [6, 10, 14, 15]. For a detailed review of branch-and-bound algorithms, refer to [13]

. If the real distribution of the joint probability between the patterns and their classes were known, larger dimensionality would imply in smaller classification errors. However, in practice, these distributions are unknown and should be estimated. A problem with the adoption of monotonic cost-functions is that they do not take into account the estimation errors committed when many features are considered (“curse of dimensionality” also known as “U-curve problem” or “peaking phenomena”


This paper presents a branch-and-bound algorithm that differs from the others known by exploring the lattice structure and the U-shaped chain curves of the search space.

Some experiments were performed to compare the SFFS to the U-curve approach. Results obtained from applications such as W-operator window design, genetic network architecture identification and eight UCI repository data sets show encouraging results, since the U-curve algorithm beats (i.e., finds a node with smaller cost than the one found by SFFS) the SFFS results in smaller computational time for 27 out of 38 data sets tested. For all data sets, the U-curve algorithm gives a result equal or better than SFFS, since the first covers the complete search space.

Though the results obtained with the application of the method developed to pattern recognition problems are exciting, the great contribution of this paper is the discovery of some lattice algebra properties that lead to a new data structure for the search space representation, that is particularly adequate for updates after up-down lattice interval cuts (i.e., cuts by couples of intervals [0,X] and [X,W]). Classical tree based search space representations does not have this property. For example, if the Depth First Search were adopted to represent the Boolean lattice only cuts in one direction could be performed.

Following this introduction, Section 2 presents the formalization of the problem studied. Section 3 describes structurally the branch-and-bound algorithm designed. Section 4 presents the mathematical properties that support the algorithm steps. Section 5 presents some experimental results comparing U-curve to SFFS. Finally, Conclusion discusses the contributions of this paper and proposes some next steps of this research.

Ii The Boolean U-curve optimization problem

Let be a finite subset, be the collection of all subsets of , be the usual inclusion relation on sets and, denote the cardinality of . The search space is composed by objects organized in a Boolean lattice.

The partially ordered set is a complete Boolean lattice of degree such that: the smallest and largest elements are, respectively, and ; the sum and product are, respectively, the usual union and intersection on sets and the complement of a set in is its complement in relation to , denoted by .

Subsets of will be represented by strings of zeros and ones, with meaning that the point does not belong to the subset and meaning that it does. For example, if , the subset will be represented by . In an abuse of language, means that is the set represented by .

A chain is a collection such that . A chain is maximal in if there is no other chain such that contains properly .

Let be a cost function defined from to . We say that is decomposable in U-shaped curves if, for every maximal chain , the restriction of to is a U-shaped curve, i. e., for every , .

Figure 1 shows a complete Boolean lattice of degree with a cost function decomposable in U-shaped curves. In this figure, it is emphasized a maximal chain in and its cost function. Figure 2 presents the curve of the same cost function restricted to some maximal chains in and in . Note the U-shape of the curves in Figure 2.

Fig. 1: A complete Boolean lattice of degree and the cost function decomposable in U-shaped curves. is a poset obtained from . A maximal chain in is emphasized. The element is the global minimum element and is the local minimum element in the maximal chain.
Fig. 2: The four possible representaion of the cost function restricted to some maximal chains in (a) and in (b-d) of Figure 1.

Our problem is to find the element (or elements) of minimum cost in a Boolean lattice of degree . The full search in this space is an exponential problem, since this space is composed by elements. Thus, for moderately large , the full search becomes unfeasible.

Iii The U-curve algorithm

The U-shaped format of the restriction of the cost function to any maximal chain is the key to develop a branch-and-bound algorithm, the U-curve algorithm, to deal with the hard combinatorial problem of finding subsets of minimum cost.

Let and be elements of the Boolean lattice . An interval of is the subset of given by . The elements and are called, respectively, the left and right extremities of . Intervals are very important for characterizing decompositions in Boolean lattices [2, 4].

Let be an element of . In this paper, intervals of the type and are called, respectively, lower and upper intervals. The right extremity of a lower interval and the left extremity of an upper interval are called, respectively, lower and upper restrictions. Let and denote, respectively, collections of lower and upper intervals. The search space will be the poset obtained by eliminating the collections of lower and upper restrictions from , i. e., . In cases in which only the lower or the upper intervals are eliminated, the resulting search space is denoted, respectively, by and and given, respectively, by and .

The search space is explored by an iterative algorithm that, at each iteration, explores a small subset of , computes a local minimum, updates the list of minimum elements found and extends both restriction sets, eliminating the region just explored. The algorithm is initiated with three empty lists: minimum elements, lower and upper restrictions. It is executed until the whole space is explored, i. e., until becomes empty. The subset of eliminated at each iteration is defined from the exploration of a chain, which may be done in down-up or up-down direction. Algorithm 1 describes this process. The direction selection procedure (line 5) can use a random or an adaptative method. The random method states a static probability to select the down-up or up-down direction. The adaptative method calculates a new probability to each direction giving more probability to down-up direction if most of the local minima is closest to the bottom of the lattice and up-down otherwise.

4:  while  do
5:     direction Select-Direction()
6:     if direction is UP then
7:        Down-Up-Direction(, )
8:     else
9:        Up-Down-Direction(, )
10:     end if
11:  end while
Algorithm 1 U-curve-algorithm()

An element of the poset is called a minimal element of , if there is no other element of with . In Figure 1, the minimal elements of are: , and . If the down-up direction is chosen, the Down-Up-Direction procedure is performed (algorithm 2):

  • Minimal-Element procedure calculates a minimal element of the poset . Only the lower restriction set is used to calculate the minimal element . An element is said to be covered by the lower restriction set , if , and is said to be covered by the upper restriction set , if . When the calculated is covered by an upper restriction, it is discarded, i.e., the lower restriction set is updated with and a new iteration begins (lines 1-5).

  • The down-up direction chain exploration procedure begins with a minimal element and flows by random selection of upper adjacent elements from the current poset until it finds the U-curve condition, i. e., the last element selected () has cost bigger than the previous one () (lines 7-11).

  • At this point, the element is the minimum element of the chain explored, and are, respectively, the lower and upper adjacent elements of (i.e., and ) by construction, . It can be proved that any element of , with , has cost bigger than and, any element of , with , has cost bigger than . By using this property, the lower and upper restrictions can be updated, respectively, by and (lines 12-17). Figure 3 shows a schematic representation of the first iteration of the algorithm and the elements contained in the intervals and .

  • The result list can be updated with (line 18) , i. e., will be included in the result list if it has cost lower than the elements already saved in the list. The result list can save a pre-defined number of elements with low costs or only elements with the overall minimum cost.

  • In order to prevent visiting the element more than once, a recursive procedure called minimum exhausting procedure is performed (line 19)

1:   Minimal-Element()
2:  if  is covered by  then
3:     Update-Lower-Restriction(, )
4:     return  
5:  end if
6:   null
7:  repeat
10:      Select-Upper-Adjacent(, , )
11:  until  or null
12:  if  null then
13:     Update-Lower-Restriction(, )
14:  end if
15:  if  null then
16:     Update-Upper-Restriction(, )
17:  end if
18:  Update-Results()
19:  Minimum-Exhausting(, , )
Algorithm 2 Down-Up-Direction(ElementSet , ElementSet )
Fig. 3: A schematic representation of a step of the algorithm, the detached areas represents the elements contained in a lower and upper restrictions.

An element is called a minimum exhausted element in if all its adjacents elements (upper and lower) have cost bigger than it. This definition can be extended to the poset , i. e., all its adjacent elements (upper and lower) in have cost bigger than it. In Figure 1 we can see that the elements , and are minimum exhauted elements in , but is not a minimum exhauted element in . In this paper, the term minimum exhausted will be applied always refering to a poset .

1:  Push to
2:  while  is not empty do
3:      Top()
4:     MinimumExhausted true
5:     for all  adjacent of in and  do
6:        if c() c(then
7:           Push to
8:           MinimumExhausted false
9:        else
10:           if  is upper adjacent of  then
11:              Update-Upper-Restriction(, )
12:           else
13:              Update-Lower-Restriction(, )
14:           end if
15:        end if
16:     end for
17:     if MinimumExhausted then
18:        Pop from
19:        Update-Results()
20:        Update-Lower-Restriction(, )
21:        Update-Upper-Restriction(, )
22:     end if
23:  end while
24:  return  
Algorithm 3 Minimum-Exhausting(Element , ElementSet , ElementSet )

The minimum exhausting procedure (Algorithm 3) is a recursive process that visit all the adjacent elements of a given element and turn all of them into minimum exhausted elements in the resulting poset . It uses a stack to perform the recursive process. is initialized by pushing to it and the process is performed while is not empty (lines 2-22). At each iteration, the algorithm processes the top element of : all the adjacent elements (upper and down) of in and not in are checked. If the cost of an adjacent element is lower (or equal) than the cost of then is pushed to . If the cost of is bigger than the cost of then one of the restriction sets can be updated with , lower restriction set if is lower adjacent of and upper restriction set if is upper adjacent of (lines 5-16). If is a minimum exhausted element in , i. e., there is no adjacent element in with cost lower than , then is removed from and, also, the restriction sets and the result list are updated with (lines 19-21). At the end of this procedure all the elements processed are minimum-exhausted elements in .

Figure 4 shows a graphical representation of the minimum exhausting process. 4-A shows a chain construction process in up direction, the chain has its edges emphasized. The element (orange-colored) has the minimum cost over the chain. The elements in black are the elements eliminated from the search space by the restrictions obtained by the lower and upper adjacent elements of the local minimum . The stack begins with the element . Figure 4-B shows the first iteration of the minimum exhausting process. The arrows in red and the elements in red indicates the adjacents elements of (top of the stack) that have cost lower (or equal) than it. These elements and are pushed to the stack. The adjacent elements of with cost bigger than it can update the restriction sets, i. e., the lower adjacent element updates the lower restriction set and the upper adjacent element updates the upper restriction set. Figure 4-C shows the second iteration: the adjacent elements and with cost lower (or equal) than the new top element are pushed to the stack and the other adjacent elements and with cost bigger than update, respectively, the lower and upper restriction sets. In Figure 4-D the element is a minimum exhausted element (grey color) in and it is is removed from stack. In Figure 4-E the elements eliminated by the new interval and are turned into black color. At this point, is a minimum exhausted (grey color) in and it is removed from stack. From Figure 4-F to Figure 4-H all the elements are removed from stack and the elements removed by the new restrictions are turned into black color. Figure 4-H shows all the elements removed from a single minimum exhausted process.

Fig. 4: Representation of the minimum exhausting process.

The procedures to calculate minimal and maximal elements and the procedure to update lower and upper restriction sets will be discussed in the next section.

Iv Mathematical foundations

This section introduces mathematical foundations of some modules of the algorithm.

Iv-a Minimal and Maximal Construction Procedure

Each iteration of the algorithm requires the calculation of a minimal element in or a maximal element in . It is presented here a simple solution for that. The next theorem is the key for this solution.

Theorem 1. For every ,


(in Appendix Section)

Algorithm 4 implements the minimal construction procedure. It builds a minimal element of the poset . The process begins with and and executes a -loop (lines 3-16) trying to remove components from . At each step, a component is chosen exclusively from ( prevents multi-selecting). If the element resulted from by removing the component is contained in then is updated with (lines 7-15).

3:  while  do
4:      random index in where
7:     RemoveElement true
8:     for all  in  do
9:        if  then
10:           RemoveElement false
11:        end if
12:     end for
13:     if RemoveElement then
15:     end if
16:  end while
17:  return  
Algorithm 4 Minimal-Element(ElementSet )

The minimal element calculated is equal to when . At this point, the poset is empty and the algorithm stops in the next iteration.

The next theorem proves the correctness of Algorithm 4 .

Theorem 2. The element of returned by the minimal construction process (Algorithm 4) is a minimal element in .

(in Appendix Section)

The process to calculate a maximal element in is dual to the one to calculate a minimal, i. e., it begins with and, at each step, when the complement of the resulting has not empty interseccion to all the elements of , adds a component to .

Iv-B Lower and Upper Restrictions Update

The restriction sets and represent the search space. Thus, they are updated after each new search by the following rule: an element is added to the lower (or upper) restriction set if all elements of (or ) have costs bigger or equal to .

The next theorem establishes the U-curve condition, that permits to stop the chain construction process and to update the restriction sets.

Theorem 3. Let be the chain constructed by Algorithm 2 (or its dual version). Let be the cost function from to decomposable in U-shaped curves and , then


(in Appendix Section)

By a similar proof to the one of Theorem 3, it can be proved that all the elements in contained in have also cost bigger or equal to it. Figure 3 shows the chain obtained by the chain construction process and the resulted poset. The elements detached have always cost bigger than the elements or .

Algorithm 5 describes the update process of the lower restriction set by an element . If is already covered by , i. e., there exists an element of that contains then the process stops (lines 1-3). Otherwise, all the elements in contained in are removed from and is added to (lines 4-9). This procedure may diminish the cardinality of the restriction set, but does not diminish the cardinality of the resulting poset , since the removed restrictions are contained in .

1:  if there exists from where  then
2:     return  
3:  end if
4:  for all  in  do
5:     if  then
7:     end if
8:  end for
10:  return  
Algorithm 5 Update-Lower-Restriction(Element , ElementSet )

The upper restriction list updating procedure is dual to the lower one, i. e., in this case we look for elements contained in instead of elements that contain .

Iv-C Minimum Exhausting Procedure

The computation of the cost function in general is heavy. Thus, it is desirable that each element be visited (and its cost computed) a single time. A way of preventing this reprocessing is to apply the minimum exhausting procedure. This procedure is a recursive function (Algorithm 3). It uses a stack to process recursively all the neighborhood of a given element contained in the poset . At each recursion, it visits the upper and lower adjacent elements of , the top of , in and not in . The adjacent elements with cost bigger than the cost of are elements satisfying the U-curve condition, so they can update the restriction sets and, consequently, be removed from the search space. The adjacent elements with cost lower or equal to are pushed to to be processed in later iterations. Note that elements are not reprocessed during the exhausting procedure, since this procedure checks if a new element explored is in an interval or in , before computing its cost. If is a minimum exhausted element in then is removed from . After the whole procedure is finished, all elements processed are out of the resulting poset , so they will not be reprocessed in the next iterations. The fact that an element can not be reprocessed along the procedure implies that the cardinality of is an upper limit for the procedure number of steps. In search spaces that are lattices with high degree, this procedure can have to process a huge number of elements and some heuristics should be necessary. For example, to stop the search for adjacent elements smaller than a minimum after some badly succeeded trials.

The minimum exhausting procedure gives another interesting property to the U-curve algorithm. If the cost function on maximal chains are U-shaped curves with oscillations, as illustrated in Figure 5-A, the U-curve algorithm may lose a local minimum element. Note that, in this case, the local minimum element after the oscillation has cost smaller than the cost of one before. However, this minimum is not lost if there is another chain, with a true U-shaped cost function, containing both local minimum elements. Figure 5-B shows an alternative chain (chain in red) that reaches the true minimum element of the chain (element in black). Note that the first local minimum (element in yellow) is contained in both chains. The true minimum, reached by the alternative chain, is obtained exactly by the exhausting of the first minimum found. Hence, the exhausting procedure permits to relax the class of problems approached by the U-curve algorithm.

Fig. 5: Illustration of error curve oscillation and alternative way.

V Experimental Results

In this section, some results of applications of U-curve algorithm to feature selection are given and compared to SFFS [11]. For this study several data sets were used: W-operator window design [8]

, architecture identification in genetic networks and several data sets from the UCI Machine Learning Repository

[1]. In all cases, it was attributed the value 3 for the parameter of SFFS. This parameter is a stop criterion of SFFS. Usually,

in order to avoid that the algorithm stops at the first moment that it reaches the desired dimension. In this way, it performs more feature inclusion and deletion before returning the subset with the desired dimension, alleviating the nesting effect. The value

used as default here is the same default value adopted by the original algorithm implementation [11].

All data sets used and the binary program with some documentation can be found at the supplementary material web page (http://www.vision.ime.usp.br/~davidjr/ucurve).

V-a Cost function adopted: penalized mean conditional entropy

The Information theory was originated from Shannon s works [12] and can be employed on feature selection problems [5]. The Shannon’s entropy

is a measure of randomness of a random variable

given by:


in which

is the probability distribution function and, by convention,


The conditional entropy is given by the following equation:


in which

is a feature vector and

is the conditional probability of given the observation of an instance . Finally, the mean conditional entropy of given all the possible instances is given by:


Lower values of yield better feature subspaces (i.e., the lower , the larger is the information gained about by observing ).

In practice, and

are estimated. A way to embed the error estimation, committed by using feature vectors with large dimensions and insufficient number of samples, is to atribute a high entropy (i.e., penalize) to the rarely observed instances. The penalization adopted here consists in changing the conditional probability distribution of the instances that present just a unique observation to uniform distribution (i.e., the highest entropy). This makes sense because if an instance

has only 1 observation, the value of is fully determined (i.e., ), but the confidence about the real distribution of is very low. Adopting this penalization, the estimation of the mean conditional entropy becomes:


in which is the number of training samples and is the number of instances with (i.e., just one observation). In this formula, it is assumed that the logarithm base is the number of possible classes , thus, normalizing the entropy values to the interval . This cost function exhibits U-shaped curves, since, for a sufficiently large dimension, the number of instances with a single observation starts to increase, increasing the penalization and, consequently, increasing the cost function value (i.e., next features included do not give enough information to compensate the error estimation).

V-B Data sets description

V-B1 W-operator window design

the W-operator window design problem consists in looking for subsets of a size window for which the designed operator has the lowest estimation error (i. e., the transformed images generated by the operator are as similar as possible of the expected images). The training samples were obtained from the images presented in [8]. It is composed by 20 files with 18,432 samples each. There are 16 features assuming binary values and two classes.

V-B2 Biological classification

the biological classification problem studied is the problem of estimating a subset of predictor genes for a specific target gene from a time-course microarray experiment. The data set used for the tests is the one presented in paper [9]. They are normalized and quantized in levels using the same method described in [3]. The subset of predictors is obtained from a set of genes. Thus, there are 27 features assuming three distinct values and three possible classes. It is composed by 10 files with 15 samples each.

V-B3 UCI Machine Learning Repository

UCI Machine Learning Repository data sets considered are: pendigits, votes, ionosphere, dorothea_filtered, dexter_filtered, spambase, sonar and madelon

. For all data sets, the feature values were normalized by subtracting them from their respective means and dividing them by their respective standard deviations. After that, all values were binarized (i.e., associated to 0, if the normalized value is non-positive, and to 1, otherwise). Except for dorothea_filtered and dexter_filtered, all features were taken into account. The

dorothea_filtered and dexter_filtered are files post-processed from dorothea and dexter data sets, respectively. In the dorothea and dexter data sets, most features display null value for almost every sample. So, dorothea_filtered considered only the features with 100 or more non-null values, while dexter_filtered considered the features with 50 or more non-null values.

A description of each data set is presented in the following list:

  • pendigits: composed by 7494 samples, 16 binary features and 10 classes;

  • votes: composed by 435 samples, 16 ternary features and 2 classes;

  • ionosphere: composed by 351 samples, 34 binary features and 2 classes;

  • dorothea_filtered: composed by 800 samples, 38 binary features and 2 classes;

  • dexter_filtered: composed by 300 samples, 48 binary features and 2 classes;

  • spambase: composed by 4601 samples, 57 binary features and 2 classes;

  • sonar: composed by 208 samples, 60 binary features and 2 classes;

  • madelon: composed by 2000 samples, 500 binary features and 2 classes.

V-C Results

The feature selection problem may have cost functions with chains that present oscillations and there is no theoretical guaranty of the existence of alternative chains to achieve the local minima lost because of the oscillations. However, these cases were tested experimentally and in all observed cases the minimum exhausting procedure could find the local minimum elements using alternative chains. We have examined random curves in all data sets studied. For example, in the W-operator window design almost curves () contains oscillatory parts and in the biological classifier design almost curves () contain oscillatory parts. For all these oscillatory curves and also for those found in the UCI data sets, the minimum exhausting procedure got the local minimum by alternative chains.

The results of the U-curve algorithm are divided in two sets: i - until it beats the SFFS result (UC); ii- until the search space is completely processed (UCC). The U-curve algorithm is stochastic and at each test it can reach the best result in different processing time. So, the U-curve was processed times for each test and the quantitative results presented are means of values gotten in these processes. The machine used for the tests was an AMD Turion 64 with 2Gb of RAM.

In the following, each of the three experiments performed is summarized by a table and all these tables have the same structure. The first column presents the winner of the comparison of SFFS with UC. The other columns present the cost in terms of processed nodes and computational time of SFFS, UC and UCC.

Table I shows the results for the W-operator window design experiment. Twenty tests were performed using the available training samples. UC beats SFFS in 8 of the 20 tests and reaches the same result in the remaining ones. In these last cases, both reach the global minimum element. In all cases, UC processes a smaller number of nodes, in a smaller time, than SFFS. The complete search (UCC) frequently needs to process more nodes (), taking more time (), than SFFS.

Test Winner Computed nodes Time(sec.)
4 UC
5 UC
6 UC
8 UC
9 UC
14 UC
16 UC
18 UC
TABLE I: Comparison between SFFS and U-curve results for the W-operator window design.

Table II shows the results for the biological classifier design experiment. Ten tests were performed using different target genes. In these examples, the complete search space is quite big ( nodes). SFFS reaches the best element, equalling UC, only times. The processing of the whole space (UCC) improved the result of UC in times. UC processed many more nodes than SFFS, but their computational times are very similar. This happens because these experiments involve small number of samples and, therefore, the computational time spent to process a node is very small. The pre-processing overhead is the major responsible for the time consuming in this case.

Test Winner Computed nodes Time(sec.)
2 UC
3 UC
4 UC
5 UC
8 UC
9 UC
10 UC
TABLE II: Comparison between SFFS and U-curve results for the biological classification design.

Table III shows the results of 8 tests using public datasets. For each test, the value in parenthesis is the number of features (n) in the data set. For tests with high number of features, the results for the complete search (UCC) are not available. We can see that UC obtained better results than SFFS in of the tests and equal results in two tests with small number of features. In these two cases, SFFS reaches the best result but UC reaches them faster, processing less nodes.

Test Winner Computed nodes Time(sec.)
pendigits (16) EQUAL
votes (16) EQUAL
ionosphere (34) UC NA NA
dorothea_filtered (37) UC NA NA
dexter_filtered (48) UC NA NA
spambase (57) UC NA NA
sonar (60) UC NA NA
madelon (500) UC NA NA
TABLE III: Comparison between SFFS results and U-curve algorithm for the UCI Machine Learning Repository data sets.

These results show that UC is more efficient than SFFS for low order problems, obtaining the same results with less processing. For high order problems, UC is more accurate, but in some cases it process more nodes and takes more time.

Vi Conclusion

This paper introduces a new combinatorial problem, the Boolean U-curve optimization problem, and presents a stochastic branch-and-bound solution for it, the U-curve algorithm. This algorithm gives the optimal elements of a cost function decomposable in U-shaped chains, that may even be oscillatory in a given sense. This model permits to describe the feature selection problem in the context of pattern recognition. Thus, the U-curve algorithm constitutes a new tool to approach feature selection problems.

The U-curve algorithm explores the domain and cost function particular structures. The Boolean nature of the domain permits to represent the search space by a collection of upper and lower restrictions. At each iteration, a beginning of chain node is computed from the search space restrictions. The current explored chain is constructed from this node by choosing upper or lower adjacent nodes. The choice of a beginning of chain and of an adjacent node usually has several options and one of them is taken randomly. The cost function and domain structure permit to make cuts in the search space, when a local minimum is found in a chain. After a local minimum is found, all local minimum nodes connected to it are computed, by the minimum exhausting procedure, and the corresponding cuts, by up-down intervals, executed. The adjacency and connectivity relations adopted are the ones of the search space Hesse diagram, that is a graph in which the connectivity is induced by the partial order relation. The minimum exhausting procedure avoids that a node be visited more than once and generalizes the algorithm to cost functions decomposable in some class of U-shaped oscillatory chain functions. The procedures of the U-curve algorithm are supported by formal results.

In fact, the U-curve optimization technique constitutes a new framework to study a family of optimization problems. The restrictions representation and the intervals cut, based on Boolean lattice properties, constitutes a new optimization structure for combinatorial problems, with properties not found in conventional tree representations.

The U-curve was applied to practical problems and compared to SFFS. The experiments involved window operator design, genetic network identification and six public data sets obtained from the UCI repository. In all experiments, the results of the U-curve algorithm were equal or better than those obtained from SFFS in precision and, in many cases, even in performance. The results of the U-curve algorithm considered for comparison are the mean of several executions for the same input data, since it is a stochastic algorithm that may have different performances at each run.

The efficiency of the U-curve algorithm depends on the relative position of the local minima on the search space. The algorithm is more efficient when the local minima are near the search space extremities. The worst cases are the ones in which the local minima are near the middle of the lattice.

The results obtained until now are encouraging, but the present version of the U-curve algorithm is not a fast solution for high dimension problems with many local minima in the center of the search space lattice. The efficient addressing of these problems in the U-curve optimization approach opens a number of subjects for future researches such as: to develop additional cuts to the branch-and-bound formulation; to design and estimate distributions for the random parameters used in the choice of beginning nodes or adjacent paths in the construction of a chain, with the goal of reaching earlier to the best nodes; to build parallelized versions of the algorithm; and others.


Theorem 1. For every ,


Theorem 2. The element of returned by the minimal construction process (Algorithm 4) is a minimal element in

By looking into the steps of the minimal construction procedure:

  • Lines  7-15 guarantee that at any step of the procedure the resulted is contained in , i. e., it is updated only when the resulted satisfies the condition shown in Theorem 1.

  • Let be the sequence of resulting elements at each step () and be the initial element. As an index is chosen to be removed from (lines 4-6) at each step , it implies that .

  • Proving that the resulting element is mimimal in is equivalent of proving that .

  • Let and be the step of the procedure when the index is chosen to be removed from . and imply that , i. e., cannot be removed from at the end of step . This is avoided by the algorithm (lines 8-12), when there exists an element with . As , then and, by Theorem 1, . This implies that is a minimal element in .

Theorem 3. Let be the chain constructed by Algorithm 2 (or its dual version). Let be the cost function from to decomposable in U-shaped curves and . It is true that,


Suppose that and . It contradicts the hypothesis that is a function decomposable in U-shaped curves, since , but is either or , contradicting .


The authors are grateful to FAPESP (99/12765-2, 01/09401-0, 04/03967-0 and 05/00587-5), CNPq (300722/98-2, 468 413/00-6, 521097/01-0 474596/04-4 and 491323/05-0) and CAPES for financial support. This work was partially supported by grant 1 D43 TW07015-01 from the National Institutes of Health, USA. We also thank Helena Brentani by her helpful in the data for biological analysis and Roberto M. Cesar Jr. by his helpful in SFFS comparisons. The data sets used to generate the Table III results were obtained from UCI Machine Learning Repository [1].


  • [1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.
  • [2] G. J. F. Banon and J. Barrera. Minimal representations for translation-invariant set mappings by mathematical morphology. SIAM J. Appl. Math., 51(6):1782–1798, 1991.
  • [3] J. Barrera, R. M. Cesar-Jr, D. C. Martins-Jr, R. Z. N. Vencio, E. F. Merino, M. M. Yamamoto, F. G. Leonardi, C. A. B. Pereira, and H. A. del Portillo. Constructing probabilistic genetic networks of Plasmodium falciparum from dynamical expression signals of the intraerythrocytic development cycle, chapter 2, pages 11–26. Springer, 2006.
  • [4] J. Barrera and G. P. Salas. Set operations on collections of closed intervals and their applications to the automatic programming of morphological machines. Electronic Imaging, 5(3):335–352, 1996.
  • [5] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, volume 1, pages 1–19. Wiley-Interscience, 2nd edition, 2000.
  • [6] A. Frank, D. Geiger, and Z. Yakhini. A distance-based branch and bound feature selection algorithm. In

    Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI-03)

    , pages 241–248, San Francisco, CA, 2003. Morgan Kaufmann Publishers.
  • [7] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4–37, 2000.
  • [8] D. C. Martins Jr, R. M. Cesar Jr, and J. Barrera. W-operator window design by minimization of mean conditional entropy. Pattern Analysis & Applications, 9:139–153, 2006.
  • [9] C. Lin, A. Ström, V. B. Vega, S. L. Kong, A. L. Yeo, J. S. Thomsen, W. C. Chan, B. Doray, D. K. Bangarusamy, A. Ramasamy, L. A. Vergara, S. Tang, A. Chong, V. B. Bajic, L. D. Miller, J. Gustafsson, and E. T. Liu. Discovery of estrogen receptor target genes and response elements in breast tumor cells. Genome Biology, 5(9):1–18, 2004.
  • [10] S. Nakariyakul and D. P. Casasent. Adaptive branch & bound algorithm for selecting optimal features. Pattern Recognition Letters, (28):1415–1427, 2007.
  • [11] P. Pudil, J. Novovicová, and J. Kittler. Floating search methods in feature selection. Pattern Recognition Letters, 15:1119–1125, 1994.
  • [12] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–656, July, October 1948.
  • [13] P. Somol and P. Pudil. Fast branch & bound algorithms for optimal feature selection. PAMI, 26(7):900–912, July 2004.
  • [14] Z. Wang, J. Yang, and G. Li. An improved branch & bound algorithm in feature selection. In Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing: 9th International Conference, Lecture Notes in Computer Science, pages 549–556, Chongqing, China, May 2003. Springer Berlin / Heidelberg.
  • [15] S. Yang and P. Shi. Bidirectional automated branch and bound algorithm for feature selection. Journal of Shanghai University, 9(3):244–248, 2005.