Probabilistic learning of boolean functions applied to the binary classification problem with categorical covariates

03/20/2020 ∙ by Paulo Hubert, et al. ∙ 0

In this work we cast the problem of binary classification in terms of estimating a partition on Bernoulli data. When the explanatory variables are all categorical, the problem can be modelled using the language of boolean functions. We offer a probabilistic analysis of the problem, and propose two algorithms for learning boolean functions from binary data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a sample

generated by two different Bernoulli distributions with parameters

and , and consider the set as the set of all indices such that

. Assuming that the components of the vector

are conditionally independent given

, the likelihood function is the product of two Binomial distribution functions, and will attain a global maximum at the set

(let’s call this set the on-set of the vector ), with maximum likelihood estimators given by and .

Now consider a design matrix and a function such that , where is the -th row of . Again, if the function is not constrained in any way, the problem is the same and the same trivial solution applies, with function defined only in the set of rows of . In this extreme case, the solution will usually not generalize well, and also will not provide any interesting interpretation (since is just an enumeration based on the on-set of ).

Standard methods for the binary classification problem are concerned with the task of estimating

constraining it in different ways such that this trivial solution (associated with the problem of overfitting) is avoided. This constraints can take many forms; one of the most studied and applied is the use of a linear model with the vector of coefficients constrained to lie in the interior of a ball with a given radius (the LASSO and Ridge regression models). Analysis then proceeds by finding

inside the feasible set such that the distance between the MLE for and is maximized.

In what follows we analyze the case where is itself a binary matrix, that is . In this case, is a boolean function, lying in a discrete and finite space with points. The complete model can be thought of as a table with rows, each row containing one combination of the values of the binary explanatory variables, plus one column that takes either the value or

, representing the probability of the Bernoulli associated with that combination of values. The binary function

assigns to each the value if the corresponding probability is .

One of the main appeals of this setup is the well known fact that any boolean function admits at least one representation as both a disjunction of conjunctions (its disjunctive normal form, DNF)

(1)

and a conjunction of disjunctions (its conjunctive normal form, CNF)

(2)

The conjunction terms in a DNF are called implicants, because . An implicant is said to be prime if no conjunction obtained by eliminating one literal from is itself an implicant. A DNF formed by prime implicants is called a prime DNF.

Now given design matrix and response vector and assuming that no two rows of are the same, one can easily prove that there will always be a boolean function that is a perfect discriminator, i.e., ; to see this, consider a binary vector , and define the on-set of to be the set , that is, the set of positions that take the value . Similarly, define the off-set by the set of positions that take the value . Then a perfect discriminator can be written as

(3)

where is the set of all rows of corresponding to positive values in .

What this function does is to form a disjunction between all rows corresponding to positive values of ; it is actually equivalent to a simple enumeration of all positive individuals (positive individual is any such that ). Of course this function will not generalize well, since it will assign the value for whatever vector that is not a positive row of .

As discussed above, an usual strategy to avoid overfitting and enhance generalization power is to constrain the function . Following Muselli and Quarati (2005), the first restriction we adopt is to monotone boolean functions. Monotone boolean functions have the attractive property that their normal forms representations do not depend on the negation operation; that is, any boolean function of the form

(4)

is monotone, and any monotone boolean function has a representation in the form 4.

This however is not a decisive restriction, since it is always possible to augment the dataset by including the negation of each original variable as a new variable, and any boolean function on the original set can be uniquely mapped to a monotone boolean function on the augmented set. Therefore, it is advisable to restrain the class of boolean functions even more.

In the spirit of the LASSO and Ridge regression models, one can define some norm on the space of (monotone) boolean functions, and constrain to lie inside a ball of a fixed radius, according to this norm. Two obvious norms are the number of terms in the DNF of , and the total number of atomic variables involved in this same normal form.

This paper aims to offer an analysis of this problem taking a probabilistic approach. This means that we incoporate the norms indicated above into kernels of prior distributions, and propose posterior inference over the set of monotone boolean functions.

The biggest challenge in the estimation of these models is the combinatorial nature of all optimization procedures involved. Direct methods suffer from non-polynomial computational complexity; the adoption of a probabilistic setup and simulation algorithms such as MCMC (with polynomial-bounded complexity, see for example Roberts and Rosenthal (2016)) or stochastic search methods are useful resources in these kinds of problems.

1.1 Applications

This model can be applied to the analysis of any binary classification problem with categorical explanatory variables (by the use of a proper encoding to transform each -category variable into binary variables).

Our main motivation, however, comes from the problem of estimating polygenic markers from genomics data. This is a topic that is currently the subject of great research interest (see for instance recent works by Mavaddat et al. (2019); Shang et al. (2019); Yin et al. (2019); Haws et al. (2015)).

In this context, the goal is to find combinations of genetic mutations (usually single nucleotid polymorphisms - SNP) that are jointly associated with a given phenotype (for instance the presence of a disease), while each single mutation taken by itself has little or no marginal effect on the same outcome.

One the most common approaches is to build a polygenic risk score: a predictive estimation of the probability of the phenotype given the genotype. The polygenic part of the score indicates that the estimation model includes interaction terms.

More than the derivation of a polygenic risk score, however, our analysis aims at highlighting specific regions of the genome that together form a polygenic marker. There is to say: besides predictive power, the goal in this work is to allow for direct interpretation of the model’s results in substantive terms, such that biologists and other specialists are able to directly evaluate the biological likelihood of the markers obtained. In this sense, an easily interpretable model is much more useful for the development of treatments than a strictly predictive model with many terms, non-linearities, etc.

This is the reason why we have chosen to adopt the discrete formulation of the problem in terms of boolean function estimation. As discussed in section 1, the DNF form of a boolean function is immediately interpretable as a logic sentence whose terms can be taken as polygenic markers (because they require that the individual is mutated in position and and etc. of the genome).

1.2 Structure of the paper

Section 2 formalizes the probabilistic model. Section 3 introduces the methods for posterior inference for this probabilistic model; section 4 presents experimental results with simulated data, and using UCI’s Mushroom dataset. Section 5 concludes the paper.

2 Probabilistic model

Consider a data set , consisting of binary features measured on individuals, and a single binary outcome for each individual. Given a boolean function , and two parameters , the are conditionally independent Bernoulli variables with probability function given by

If is written in its DNF, and assuming it is monotone, then it is completely defined by a set , , where is the number of conjunction terms, and each is the subset of variables involved in each implicant of (the DNF of) . From now on, we will use and as equivalent formulations of the same mathematical object. We will also call samples with positive (negative) samples, and samples with marked (unmarked) samples.

2.1 Likelihood

The likelihood for this model, given data and assuming conditional independence of the , is

(5)

That is, the likelihood is the product of two Bernoulli likelihoods, one for marked individuals and another for unmarked ones.

After some rearrangement, equation 5 can be written as

(6)

The log-likelihood takes the form

(7)

where , , and . That is,

is the odds for unmarked samples,

is the odds for marked samples, and is the quotient of negative probabilites for marked and unmarked samples. If we define the number of positive and marked individuals, the number of positive and the number of marked individuals, we can finally write:

(8)

The maximum likelihood estimate of probabilities and , given , are given by

(9)

as expected.

2.2 Priors

Describing the boolean function using , the set of the implicants involved in its DNF, we have that

(10)

The dimension of the parametric space in this case is itself a parameter (namely ); inference in this kind of space is a known problem well studied in the literature of Bayesian model selection (see for example Stephens (2000); Carlin and Chib (1995); Green (1995) or Sisson (2005) for a review). The usual approach is to specify a prior for through the factorization

(11)

since if , the cardinality of the set , is different from . The term is a prior over all sets of conjunction terms (i.e., over the set of subsets of with cardinality ), and is a prior on the number of terms. This prior can be used to control the sparsity of the function , penalizing functions with many terms.

The function can be used to encode whatever prior information is available over the form of , specially if this information involves more than one implicant simultaneously (for example, if it is believed that no primitive variable is involved in more than one implicant).

In the simplest of cases, assuming prior independence between all pairs of terms given , we can write

(12)

Now the function is a distribution over the power set of ; again it is possible to factorize this as

(13)

where is the cardinality of the set (i.e., the number of variables involved in implicant of ). The function is a prior over this cardinality, and can also be used to control for sparsity (in this case, the number of terms in each implicant).

This analysis leads to a two-level hierarchical prior

(14)

where it is assumed that and are independent given .

2.2.1 and

The prior model for and can also be used to encode useful information about the problem at hand.

Let us analyze first two degenerate cases: and .

In the first situation there is no actual effect of the marker, and the function loses all relevance. The posterior depends on only, and the problem becomes one of estimating a constant probability in a Binomial model.

When the marker is known a priori to have a decisive111We choose the term decisive and not causal, in order to avoid discussion of possible confounding effects. Such discussion is beyond the scope of this work. effect (in the sense that for unmarked individuals, and for marked ones).

In this case the function becomes a perfect boolean classificator. Part of its truth table is known (the sample ); if is the number of rows of , and assuming no pair of rows are equal, it is possible to determine up to a set of elements (all possible values of on the remaining combinations of ). If of course , is uniquely and immediately determined. Otherwise, techniques for obtaining an optimal representation of exist (see for example the shadow clustering method of Muselli and Quarati (2005)), where optimality is usually a balance between the number of terms and the total number of literals appearing in the representation.

In most practical situations, however, there will be less certainty about the effect size. In the extreme case of no information, a uniform prior in the rectangle can be adopted.

3 Posterior inference

As already noticed above, the space where (the DNF representation of) lies is mixed, i.e., it involves the cartesian product of discrete and continuous spaces. Moreover, the number of terms in the DNF is itself a parameter.

Inference in these kinds of models can be done analytically, by adopting conjugate distributions (Consonni and Veronese (1995)

), or numerically, which is usually done with Reversible Jump Markov Chain algorithms (

Green (1995)).

We start by analyzing the case , i.e., can be represented by a single conjunction.

3.1 Multivariate markers

Suppose now for simplicity that , i.e., the function is a single conjunction and , i.e., in this case the function is completely defined by a subset of . We call this conjunction a multivariate marker.

The prior for can then be factorized as

(15)

Here .

The prior must be chosen to reflect our knowledge over the size (number of primitive variables) of the interaction term.

A value is equivalent to , modelling the case where there is no relation whatsoever between explanatory () variables and the outcome . A value represents the case where a single variable carries all the effect. Values of greater than represent the actual multivariable marker case.

It is possible to choose priors that assign little or no mass to this two points, imposing a priori that there must be a marker and that it must be a multivariable one. It can be useful in situations where marginal effects of single variable markers are not expected.

Of course, whatever is the case it must hold that . Given only this set of conditions ( with probability one, integer values), the most conservative (i.e. maximum entropy) prior would be the uniform on . In most practical cases, however, one expects that will be much smaller than ; the prior can then be adjusted accordingly.

3.1.1 Posterior

The posterior por

can now be obtained through Bayes’ theorem

(16)

We assume that , that is, we assume prior independence between the marker (both its size and contents) and the effect size.

There is however one particular situation that deserves attention, which is the case when there is information available about the relative frequency of positives in the entire population, i.e., information about . According to the above model, the following identity holds

(17)
(18)
(19)

That is, the marginal probability is a convex combination of and , with coefficients and , and in this case prior independence between the marker and the probabilities is lost (conditionally on ).

The posterior then takes the form

(20)

where is a subset of with cardinality .

3.1.2 Posterior inference

In the simple case it is possible to avoid the complications of defining Markov chains that transverse spaces of different dimensions. If the parameter is defined as a collection of indices , then it would live in a space of dimension , and there would be the problem of different dimensions. Another possible parameterization, however, is in terms of a binary vector , where each entry of is if the corresponding index is in , and otherwise. This is a parameterization typical of Bayesian model selection methods such as spike-and-slab regression (Ishwaran and Rao (2005)).

With this parameterization, to obtain samples from model 20 we define a Markov chain with state .

The transition kernel of the chain allows two kinds of moves:

  1. Switch one component of ;

  2. Sample .

Adopting two independent Beta priors for , the conditional posterior on will also be the product of two Betas, and the second step is made into a Gibbs step.

For the first step, a randow walk, Hastings procedure is adopted: a candidate move is obtained by sampling , and switching the corresponding value . If this step is accepted, the value of is adjusted accordingly.

3.2 Multiple multivariate markers

When , i.e., when the DNF of function depends on more than one conjunction term, the problem of posterior inference becomes more involving.

A first approach might be to extend the idea of reparameterizing the space in terms of boolean vectors: since the power set of has cardinality , it is possible to define a transformation from to the integers smaller than . By doing so, the parametric space becomes the space of boolean vectors with dimension or, equivalently, the space of integers from to . The on-bits of would point to the integer representation of each multivariate marker.

For instance, suppose for simplicity that . There are possible conjunction terms involving the three variables. Assuming the usual binary representation of an integer, it is possible to map the empty set to , the set including only the third variable to , the set including the second variable only to and so on. Now the function can be represented by a binary string of length ; suppose . This can be transformed to the set representation of by first forming the set of indices of the conjunction terms, which in this case is given by , and finally by transforming each integer to its -digit binary representation, obtaining , or .

The main issues with this idea are the prohibitive memory requirements if becomes too large, and the computational cost of applying the function , encoded as an integer, to the variables . It is in principle possible, however, to imagine a clever way to implement such an algorithm by resorting to low-level languages working directly on binary values as stored in memory. This picture is illustrative nevertheless, as it gives a good feeling about the size of the space in which lies.

The set representation of writes it as a point where each coordinate represents a conjunction term. In this parameterization, posterior inference will necessarily involve transdimensional methods, i.e., methods that are capable of searching in spaces formed as the union of subspaces with varying dimension.

The Reversible Jump MCMC of Green (Green (1995); Hastie and Green (2012)) is one such method, which has already proven to be effective in many problems. The idea is to build an ergodic Markov Chain on the union space, assuring detailed balance for each move type. The challenging situation is, as expected, to assure the balance between moves between spaces with different dimensions.

Green achieves this balance by proposing to augment both spaces in such a way that the augmented spaces have equal dimensions, and then defining a diffeomorphism between these spaces (for details, please see Hastie and Green (2012) and references therein).

3.3 RJMCMC for boolean function learning

A boolean function whose DNF consists of a single conjunction term does not pose any particular technical difficulties for posterior estimation, as seen in the last sections. When the number of conjunction terms is allowed to vary, however, it is necessary to consider balance between transdimensional moves in the design of the MCMC algorithm.

Conisder a Markov chain with only two types of transdimensional moves: a birth, that increases the space dimension from to , and a death that decreases the space dimension from to .

Suppose then that . There are many different ways to transform this to a lower dimensional version; probably the simplest would be to randomly delete one coordinate of , with the reverse move being given by a random selection from . This move, however, would lead to chains with slow mixing time, for the deletion move would throw away information (since the deleted coordinate is probably located in a region of higher posterior mass).

One could adopt split and merge move types instead: selecting two coordinates of and merging them into one preserve some of the information already accumulated by the chain.

The question is then how to perform the split, and, more importantly, how to define the reverse merging move. Green’s theory (and most applications of RJMCMC in the literature) is originally built to work on the product of continuous spaces such as . In this setting, when moving from to with , say, he proposes augmenting the first space with continuous random variables, and defining a diffeomorphism between and ; the differentiability requirement is needed in order to obtain acceptance probabilites which depend on the Jacobian of the transformation.

The current situation is somewhat different, since the spaces with varying dimensions are discrete. It is possible, however, to adapt the methodology of Green to this problem, as shown below.

3.3.1 RJMCMC for set spaces

Define a Markov chain with state given by , where the subscript indicates that the PDNF of the boolean function is written using conjunction terms. That is, . The transdimensional steps we propose are: adding a new conjunction term (), or removing a conjunction term ().

What is needed next is then a) a way to augment the lower dimensional space, which then becomes ; b) a transformation between the original dimensional space, and the augmented dimensional space.

In the original formulation of RJMCMC, this transformation must have an inverse, and also be a diffeomorphism, in order that the Jacobian can be calculated. This conditions are necessary in the calculation of the acceptance probability for transdimensional moves.

Denoting the full posterior by and considering only the moves between and spaces (either way), the intra-move detailed balance condition can be written as

(21)

for all sets , and where represents the probability of a transition between the states and (here we ommited dependence on and for simplicity of notation). This transition probability is the product of the candidate distribution (which might depend on the current state) and an acceptance probability.

Now define the transformation as , the symmetric difference of sets and . The choice of this transformation is motivated by the fact that the symmetric difference is the only set operation that can be inverted, in the sense that and .

Using this transformation, it is possible to define forward and backward moves as following:

  1. : select two elements of , , and replace them by ;

  2. : select one element of , one element , and replace by .

Consider the case . Since the transformation defines a bijection from into itself, the detailed balance requires equality between the transition probabilities

(22)

Then for each , and each such that , the balance condition becomes

(23)

where is the candidate mass at point , and in this case the acceptance probability

(24)

can be used.

The Markov chain thus defined will be ergodic, and sampling from it will eventually provide samples from the posterior distribution of our model.

The issue, however, is that the symmetric difference might not be a good way to traverse the parameter space, because at each death step we loose information about the intersection of the two selected terms. Also, the acceptance probability and mixing properties of this chain will be far from ideal, since the birth move will select any conjunction term from the entire set , and this will be likely to take the chain to states in regions with lower posterior mass.

These observations, combined with the fact that the state space of this chain is very large, motivates us to propose a different algorithm for posterior optimization in the case of multiple multivariate markers. Instead of defining a Markov chain and aiming at obtaining samples from the full posterior, we propose a stochastic search algorithm to optimize the posterior.

3.4 Stochastic search

Stochastic search algorithms are optimization procedures that take random steps through the set of feasible points. They’re widely adopted in combinatorial problems, where there is no gradient available.

For the optimization of the posterior for the multiple markers case we propose a simulated annealing algorithm, inspired by the MCMC structure from last section.

Simulated annealing is a well known algorithm that links the Metropolis method with combinatorial optimization

Kirkpatrick et al. (1983). Given a representation of a point in the feasible space, a probabilistic dynamic is the defined to allow the algorithm to move between feasible points and explore the configuration space. As in classical Metropolis algorithm, a move to a point with higher posterior (compared to the current point) is always accepted, and a move to a point with lower posterior is randomly accepted with a given probability. In the simulated annealing algorithm, this probability decays over time (i.e. over iterations). In the early stages of the algorithm, then, the configuration space is quickly traversed and explored, and as the acceptance probability falls (cooling) the algorithm will converge to a locally optimal point.

3.4.1 Simulated annealing for boolean function learning

The most important component of a simulated annealing algorithm is the dynamics, i.e., the moves that allow the algorithm to explore the space. For our model of the boolean function estimation problem, there are three groups of moves that must be included: a) changing and ; b) changing one multivariate markers; c) adding a new marker or deleting a current marker.

The key for an efficient algorithm lies in the design of such moves.

For the first group, the (conditionally) conjugate Beta distribution provides an obvious strategy to generate candidate moves. For changing a current marker, the move must consist in choosing one one marker from the current set, and then including (deleting) a variable in (from) the marker. The algorithm we propose will define two moves: one, selecting a marker with probability proportional to its current size, and then randomly removing a variable from the marker. Two, selecting a marker with probability inversely proportional to its current size, and then including a random new variable in the marker.

The third group is the most challenging, as it involves the change of dimension of the current point. For the death move (removing a term from the disjunction) there are one obvious way to proceed, which is to randomly select one marker and thoroughly remove it from the set; another possible approach would be to randomly select a pair of markers, and replace them by ther intersection. This second strategy helps the algorithm to stay in high posterior mass regions, by preserving some of the information condensated in the current point.

Finally, the birth move is the most challenging, since there is no obvious weay to select a good candidate for the new marker; selecting randomly from the entire set of possible markers will be very unefficient, for it will most likely select a marker that is (very) far from the high posterior mass regions.

One possible approach is to select a new marker by taking one row of as the new marker. This strategy guarantees that the algorithm does not wander too much in the configuration space, stopping it for example to include a marker involving a conjunction that is not satisfied by any individual in the dataset.

Our final algorithm, then, consists of move types:

  1. Sample from the conditional posterior (Beta);

  2. Sample from the conditional posterior (Beta);

  3. Select one marker with probability proportional to marker size, flip one bit off;

  4. Select one marker with probability inversely proportional to marker size, flip one bit on;

  5. Include a new marker, taken from the set of affected individuals in the sample;

  6. Select a pair of markers, replace them by their intersection;

  7. Select one marker and remove it from the current function.

Moves are randomly selected at the beginning of each iteration. In this version of the algorithm the selection probabilities remain fixed, but it is possible to develop an adaptive scheme for selecting the move type.

If the posterior is increased, the move is accepted with certainty; otherwise, it will be randomly accepted wtih probability

where is the posterior and is a decreasing sequence of positive numbers. During the initial steps, is used to increase the acceptance probability, even when the proposed state has a much lower posterior value. This is meant to facilitate the exploration of the parameter space in the early stages of the algorithm. As

decreases, moves that take the process to states with lower posterior probability will be more and more unlikely to be accepted. A cooling schedule for

must be adopted; one common choice is to take with the cooling factor.

Since the most difficult moves in this setup are the inclusion of new terms in the current marker, we apply the modification of the acceptance probability for this move only.

4 Experimental results

4.1 Simulated data

To conduct a first test of the proposed algorithms, we simulate binary data in the following way:

  1. Given number of samples , number of variables , number of conjunction terms in , size of each marker , proportion of marked individuals in the sample , and probabilities :

  2. Define ;

  3. Generate from a uniform and independent distribution;

  4. For :

    1. Generate from independent Beta distributions with probabilities given by ;

    2. Calculate ;

    3. If , generate ; otherwise generate .

This procedure emulates random sampling from a population. The final proportion of affected individuals in this sample will depend on the size of the marker and on and .

As a first test, we use a function with a single conjunction term, and take . We ran four parallel chains starting with markers selected randomly from the rows of . After running each chain for iterations, we take the final generated points as the sample from the posterior. The posterior histograms and traceplots for and are shown in figure 1, and the posterior histograms for are shown in figure 2. All runs in this section use a non-informative prior for , and independent priors for both and .

Figure 1: Posterior histograms and traceplots for and

0

Figure 2: Posterior histograms for

The chains converge and correctly attributes high posterior mass to the actual conjunction term in function .

Of course in this toy example both the effect and sample sizes are large. For fainter effects and keeping the sample size constant, the convergence will be slower and the posterior less concentrated.

The sample size has an important effect in this setup, specially because of the random sampling schema, where depending on the size of the marker the marked individuals will appear very rarily. In these cases, it might be best to adopt an informative sampling scheme (for instance sampling affected individuals with higher probability), and adjust the model accordingly.

4.1.1 Multiple markers

Next we simulate a sample in the same conditions as the first one, but now with consisting of two terms, boith with size . Running the MCMC for the single marker case on this data we find that the chains converge to the single marker which happens to be more frequent in the database.

To apply the simulated annealing algorithm we take and . This forces the algorithm to accept a lot of increase dimension moves. Our tests indicate that high values of these parameter are necessary for the algorithm to be able to effectively explore higher dimensions.

To choose an initial point for each algorithm we first choose the number of markers, an then select individuals from the sample, taking their corresponding rows in to be the conjunction terms.

An algorithm with good properties would be insensitive to the inital point. However, given the combinatorial nature of the problem, we suggest to set at high values. In this way the death steps are more effective and the algorithm will have a superior performance.

The hyperparameters involved in the prior for the function

are also important. If running with an entirely non-informative prior for , the algorithm will likely increase the size of (both its number of terms and the size of each term), trying to capture as many affected individuals as it can. In the extreme case, if there is a perfect monotone boolean discriminator for this dataset, the algorithm with non-informative priors will tend to move towards it (since it is the global maximum point of the likelihood).

To verify the effect of the hyperparameters, we run the algorithm on the simulated data using and ; for each value of we run the algorithm from parallel starting points, such that . We repeat these runs three times: one with a completely non-informative prior, one using independent Poisson priors with for the size of each conjunction term and a non-informative prior for (the number of markers), and finally using the Poisson priors for the size and a geometric prior with for . Results appear in table 1. A total number of steps is performed in each trial.

nstart nchains term 1 term 2
1 20 0.0 0.0 0.34 0.73 2.4 45.9 0.2 0.0
1 20 10.0 0.0 0.31 0.67 1.4 16.6 0.2 0.0
1 20 10.0 0.5 0.38 0.68 1.3 10.4 0.1 0.0
10 2 0.0 0.0 0.38 0.74 13.0 516.5 0.0 0.0
10 2 10.0 0.0 0.39 0.64 8.4 272.7 0.0 0.0
10 2 10.0 0.5 0.38 0.58 7.2 212.7 0.0 0.0
20 1 0.0 0.0 0.39 0.59 20.1 826.0 0.0 0.0
20 1 10.0 0.0 0.39 0.53 19.9 803.6 0.0 0.0
20 1 10.0 0.5 0.39 0.49 20.1 806.6 0.0 0.0
Table 1: Simulated annealing on synthetic data: independent datasets, mean of optimal values of and , size of final marker (), total number of atomic terms (), and proportion of runs which detected term 1 and term 2, with

The small number of steps prevented the algorithm from finding the correct function in all cases; however, the effect of the prior hyperparameters is evident. Also it becomes aparent that it is best to start many parallel chains with a single term in the starting point for the function , than starting a single chain with a starting with many terms. To analyze this further, we rerun the simulations, now adopting and throughout, but increasing the number of steps to in each trial. Results appear in table 2.

nstart nchains term 1 term 2
1 1 0.24 0.90 1.1 2.2 0.6 0.5
1 2 0.16 0.90 1.6 3.2 1.0 0.6
1 20 0.10 0.90 2.1 4.7 1.0 1.0
10 1 0.16 0.90 1.6 3.2 0.9 0.7
10 2 0.16 0.90 1.6 3.2 0.7 0.9
10 20 0.12 0.90 1.9 3.8 0.9 1.0
20 1 0.20 0.90 1.4 2.8 0.6 0.8
20 2 0.14 0.90 1.7 3.4 0.9 0.8
20 20 0.10 0.90 2.1 4.5 1.0 1.0
Table 2: Simulated annealing on synthetic data ( independent simulated datasets); average estimated values for and , size of final marker (), total number of atomic terms (), and proportion of runs that correctly included each marker.

The size of the initial point for function is not as important as the number of parallel chains. This is to be expected since the algorithm is built to favour transdimensional steps at the early stages. With a sufficient number of parallel points, out of runs converged to the correct point.

As a last test, we rerun the simulated annealing algorithm with and parallel chains, now varying the sample size of the simulated datasets; we simulate points on the same conditions. The results are in table 3.

term 1 term 2
100 0.23 0.88 1.1 2.2 0.6 0.5
250 0.18 0.90 1.5 3.0 0.7 0.8
500 0.11 0.88 1.9 3.8 0.9 1.0
Table 3: Simulated annealing on synthetic data ( independent simulated datasets); average estimated values for and , size of final marker (), total number of atomic terms (), and proportion of runs that correctly included each marker, for

For a sample size of the algorithm correctly identified the function in out of simulated datasets. For smaller sample sizes, the performance is worse, but even with one of the correct terms in was correctly identified half of the time.

4.2 Mushroom dataset

To test the simulated annealing algorithm in a real dataset, we pick the Mushroom dataset from UCI repository222Available at https://archive.ics.uci.edu/ml/datasets/mushroom. This is a dataset for a binary classification problem with categorical explanatory variables, and our proposed method is well suited to the dataset.

There are individuals with categorical attributes (veil-type

is removed as all individuals have the same value for this variable). After converting the multicategory variables to binary (dummy) variables, the design matrix has

columns.

To select the hyperparameters of our model, we use a cross-validation strategy, splitting the data in training and testing subsets with individuals each (i.e., we use of the data to learn the boolean function, and test the function learned on the other ).

Applying a grid search procedure for the hyperparameters (Poisson priors for the number of atomic variables in each conjunction term with and a geometric prior for the number of conjunction terms with , we find that most combinations of hyperparameters converge to boolean functions with very low generalization error as estimated by the cross-validation procedure. Table 4

shows the average AUC of the boolean function classifier for a

fold repetitions of the sample split. The table also shows the average size of the estimated function and the average number of atomic variables in the estimated function, . Each run of the algorithm takes on average minutes on a desltop with CPUs and of RAM, running Fedora 27.

AUC
2 0.1 6.9 13.4 0.9993
2 0.5 6.9 13.3 0.9992
2 0.9 7.0 13.3 0.9993
5 0.1 6.7 14.9 0.9999
5 0.5 6.5 14.4 0.9991
5 0.9 6.7 15.7 0.9995
10 0.1 6.6 19.9 0.9994
10 0.5 6.3 17.2 0.9988
10 0.9 6.1 16.4 0.9987
30 0.1 6.5 32.1 0.9998
30 0.5 6.5 29.4 0.9999
30 0.9 5.8 24.6 0.9985
Table 4: Cross-validation results for the Mushroom dataset using half of the sample as training set and hal as test set, repeated times.

On this dataset the effect of the geometric prior is weaker than the effect of the Poisson prior for the total number of atomic terms. The best AUC where achieved by and , with AUC of for both.

Table 5 shows the results for each of the repetitions when and .

Run AUC
0 6 14 0.9990
1 7 22 1.0000
2 7 14 1.0000
3 6 15 1.0000
4 7 14 1.0000
5 6 14 1.0000
6 7 15 1.0000
7 7 13 1.0000
8 7 15 1.0000
9 7 13 1.0000
Table 5: Cross-validation results for the Mushroom dataset using half of the sample as training set and hal as test set, repeated times with fixed hyperparameters

In out of runs the algorithm identified a boolean function with . The functions obtained are not the same, which indicates that there are many logical rules that can be used to describe a poisonous mushroom given the features in the dataset.

The logical expressions derived from the boolean function estimated in each case appear in appendix A.

5 Conclusion

The problem of binary classification can be seen as a problem of estimating a partition on Bernoulli data. When the explanatory variables are all categorical, the problem can be cast in the language of boolean function estimation.

In this work we propose a Bayesian algorithm to estimate Bernoulli partitions based on boolean functions. Despite the complexity of the problem, the probabilistic methods and in particular the proposed simulated annealing algorithm show promising results in identifying patterns that have good classification performance and low generalization error. Also this method provides classificators that are immediately interpretable as logical rules, which can be useful in many applications.

The implementation of the algorithms studied in this paper, along with iPython notebooks that can be used to replicate our results are available on http://github.com/paulohubert/babool.

Acknowledgments

This work is made possible by a generous funding from Instituto Paulo Gontijo, a non-profit brazilian organization dedicated to the study of Lateral Amyotrophic Sclerosis. The author is deeply grateful for their support.

Appendix A Logical expressions estimated for the Mushroom dataset

  1. (stalk-root = c AND stalk-surface-below-ring = y) OR (cap-shape = x AND odor = c AND ring-type = p) OR (odor = p AND stalk-root = e) OR (odor = f AND veil-color = w) OR (stalk-surface-below-ring = s AND spore-print-color = r) OR (gill-size = n AND stalk-root = ? AND ring-type = e)

  2. (bruises = f AND stalk-shape = e AND ring-type = n) OR (odor = f AND veil-color = w AND ring-number = o) OR (bruises = t AND gill-size = n AND stalk-shape = e) OR (gill-spacing = c AND stalk-color-above-ring = w AND spore-print-color = r) OR (cap-color = y AND gill-spacing = w AND stalk-color-below-ring = y) OR (odor = c AND gill-size = n AND stalk-surface-above-ring = s) OR (gill-spacing = c AND ring-type = e AND spore-print-color = w AND population = v)

  3. (stalk-surface-above-ring = k AND habitat = d) OR (bruises = t AND gill-size = n AND stalk-shape = e) OR (odor = f AND veil-color = w) OR (stalk-surface-above-ring = s AND spore-print-color = r) OR (stalk-surface-below-ring = y AND ring-type = e AND habitat = l) OR (gill-color = b) OR (odor = c)

  4. (stalk-root = ? AND ring-number = o AND ring-type = e) OR (stalk-surface-below-ring = s AND spore-print-color = r) OR (odor = f AND gill-attachment = f AND veil-color = w) OR (odor = c) OR (bruises = t AND gill-size = n AND stalk-shape = e) OR (bruises = f AND stalk-root = c AND spore-print-color = w)

  5. (stalk-surface-below-ring = s AND spore-print-color = r) OR (gill-spacing = w AND population = c AND habitat = l) OR (odor = c) OR (odor = f AND veil-color = w) OR (gill-spacing = c AND stalk-surface-above-ring = k) OR (gill-color = b) OR (odor = p AND stalk-shape = e AND stalk-surface-below-ring = s)

  6. (bruises = f AND gill-spacing = c AND ring-type = e) OR (stalk-root = c AND spore-print-color = w) OR (bruises = t AND gill-size = n AND stalk-shape = e) OR (odor = f AND veil-color = w) OR (stalk-surface-above-ring = s AND spore-print-color = r) OR (odor = c AND stalk-root = b)

  7. (gill-size = n AND stalk-root = ? AND spore-print-color = w) OR (stalk-shape = e AND stalk-surface-below-ring = s AND habitat = d) OR (gill-spacing = w AND population = c) OR (odor = f) OR (spore-print-color = r) OR (odor = p AND gill-attachment = f AND stalk-color-above-ring = w) OR (odor = m AND stalk-color-below-ring = c)

  8. (gill-spacing = c AND gill-size = n AND spore-print-color = w) OR (bruises = f AND odor = c) OR (odor = m AND spore-print-color = w) OR (spore-print-color = r) OR (odor = p AND stalk-shape = e) OR (odor = f) OR (gill-spacing = w AND population = c)

  9. (gill-size = b AND spore-print-color = h) OR (bruises = f AND stalk-root = b AND stalk-color-below-ring = w) OR (stalk-surface-above-ring = k AND habitat = d) OR (bruises = f AND gill-color = b) OR (gill-spacing = w AND population = c) OR (gill-size = b AND spore-print-color = r) OR (bruises = t AND odor = p)

  10. (odor = p AND gill-attachment = f) OR (odor = c AND stalk-surface-above-ring = s) OR (gill-color = b) OR (odor = f) OR (gill-spacing = w AND population = c) OR (stalk-shape = e AND stalk-surface-above-ring = k AND stalk-surface-below-ring = y AND veil-color = w) OR (spore-print-color = r)

References

  • Carlin and Chib (1995) Carlin, B. P. and Chib, S. (1995). Bayesian model choice via markov chain monte carlo methods. Journal of the Royal Statistical Society. Series B (Methodological), 57(3):473–484.
  • Consonni and Veronese (1995) Consonni, G. and Veronese, P. (1995). A bayesian method for combining results from several binomial experiments. Journal of the American Statistical Association, 90(431):935–944.
  • Green (1995) Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4):711–732.
  • Hastie and Green (2012) Hastie, D. I. and Green, P. J. (2012). Model choice using reversible jump markov chain monte carlo. Statistica Neerlandica, 66(3):309–338.
  • Haws et al. (2015) Haws, D. C., Rish, I., Teyssedre, S., He, D., Lozano, A. C., Kambadur, P., Karaman, Z., and Parida, L. (2015). Variable-selection emerges on top in empirical comparison of whole-genome complex-trait prediction methods. PLoS ONE, 10.
  • Ishwaran and Rao (2005) Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesian strategies. Ann. Statist., 33(2):730–773.
  • Kirkpatrick et al. (1983) Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220(4598):671–680.
  • Mavaddat et al. (2019) Mavaddat, N., Michailidou, K., Dennis, J., Lush, M., Fachal, L., Lee, A., Tyrer, J. P., Chen, T.-H., Wang, Q., Bolla, M. K., Yang, X., Adank, M. A., Ahearn, T., Aittomäki, K., Allen, J., Andrulis, I. L., Anton-Culver, H., Antonenkova, N. N., Arndt, V., Aronson, K. J., Auer, P. L., Auvinen, P., Barrdahl, M., Freeman, L. E. B., Beckmann, M. W., Behrens, S., Benitez, J., Bermisheva, M., Bernstein, L., Blomqvist, C., Bogdanova, N. V., Bojesen, S. E., Bonanni, B., Børresen-Dale, A.-L., Brauch, H., Bremer, M., Brenner, H., Brentnall, A., Brock, I. W., Brooks-Wilson, A., Brucker, S. Y., Brüning, T., Burwinkel, B., Campa, D., Carter, B. D., Castelao, J. E., Chanock, S. J., Chlebowski, R., Christiansen, H., Clarke, C. L., Collée, J. M., Cordina-Duverger, E., Cornelissen, S., Couch, F. J., Cox, A., Cross, S. S., Czene, K., Daly, M. B., Devilee, P., Dörk, T., dos Santos-Silva, I., Dumont, M., Durcan, L., Dwek, M., Eccles, D. M., Ekici, A. B., Eliassen, A. H., Ellberg, C., Engel, C., Eriksson, M., Evans, D. G., Fasching, P. A., Figueroa, J., Fletcher, O., Flyger, H., Försti, A., Fritschi, L., Gabrielson, M., Gago-Dominguez, M., Gapstur, S. M., García-Sáenz, J. A., Gaudet, M. M., Georgoulias, V., Giles, G. G., Gilyazova, I. R., Glendon, G., Goldberg, M. S., Goldgar, D. E., González-Neira, A., Alnæs, G. I. G., Grip, M., Gronwald, J., Grundy, A., Guénel, P., Haeberle, L., Hahnen, E., Haiman, C. A., Håkansson, N., Hamann, U., Hankinson, S. E., Harkness, E. F., Hart, S. N., He, W., Hein, A., Heyworth, J., Hillemanns, P., Hollestelle, A., Hooning, M. J., Hoover, R. N., Hopper, J. L., Howell, A., Huang, G., Humphreys, K., Hunter, D. J., Jakimovska, M., Jakubowska, A., Janni, W., John, E. M., Johnson, N., Jones, M. E., Jukkola-Vuorinen, A., Jung, A., Kaaks, R., Kaczmarek, K., Kataja, V., Keeman, R., Kerin, M. J., Khusnutdinova, E., Kiiski, J. I., Knight, J. A., Ko, Y.-D., Kosma, V.-M., Koutros, S., Kristensen, V. N., Krüger, U., Kühl, T., Lambrechts, D., Marchand, L. L., Lee, E., Lejbkowicz, F., Lilyquist, J., Lindblom, A., Lindström, S., Lissowska, J., Lo, W.-Y., Loibl, S., Long, J., Lubiński, J., Lux, M. P., MacInnis, R. J., Maishman, T., Makalic, E., Kostovska, I. M., Mannermaa, A., Manoukian, S., Margolin, S., Martens, J. W., Martinez, M. E., Mavroudis, D., McLean, C., Meindl, A., Menon, U., Middha, P., Miller, N., Moreno, F., Mulligan, A. M., Mulot, C., Muñoz-Garzon, V. M., Neuhausen, S. L., Nevanlinna, H., Neven, P., Newman, W. G., Nielsen, S. F., Nordestgaard, B. G., Norman, A., Offit, K., Olson, J. E., Olsson, H., Orr, N., Pankratz, V. S., Park-Simon, T.-W., Perez, J. I., Pérez-Barrios, C., Peterlongo, P., Peto, J., Pinchev, M., Plaseska-Karanfilska, D., Polley, E. C., Prentice, R., Presneau, N., Prokofyeva, D., Purrington, K., Pylkäs, K., Rack, B., Radice, P., Rau-Murthy, R., Rennert, G., Rennert, H. S., Rhenius, V., Robson, M., Romero, A., Ruddy, K. J., Ruebner, M., Saloustros, E., Sandler, D. P., Sawyer, E. J., Schmidt, D. F., Schmutzler, R. K., Schneeweiss, A., Schoemaker, M. J., Schumacher, F., Schürmann, P., Schwentner, L., Scott, C., Scott, R. J., Seynaeve, C., Shah, M., Sherman, M. E., Shrubsole, M. J., Shu, X.-O., Slager, S., Smeets, A., Sohn, C., Soucy, P., Southey, M. C., Spinelli, J. J., Stegmaier, C., Stone, J., Swerdlow, A. J., Tamimi, R. M., Tapper, W. J., Taylor, J. A., Terry, M. B., Thöne, K., Tollenaar, R. A., Tomlinson, I., Truong, T., Tzardi, M., Ulmer, H.-U., Untch, M., Vachon, C. M., van Veen, E. M., Vijai, J., Weinberg, C. R., Wendt, C., Whittemore, A. S., Wildiers, H., Willett, W., Winqvist, R., Wolk, A., Yang, X. R., Yannoukakos, D., Zhang, Y., Zheng, W., Ziogas, A., Dunning, A. M., Thompson, D. J., Chenevix-Trench, G., Chang-Claude, J., Schmidt, M. K., Hall, P., Milne, R. L., Pharoah, P. D., Antoniou, A. C., Chatterjee, N., Kraft, P., García-Closas, M., Simard, J., and Easton, D. F. (2019). Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics, 104(1):21–34.
  • Muselli and Quarati (2005) Muselli, M. and Quarati, A. (2005). Reconstructing positive boolean functions with shadow clustering. In Proceedings of the 2005 European Conference on Circuit Theory and Design, 2005., volume 3, pages III/377–III/380 vol. 3.
  • Roberts and Rosenthal (2016) Roberts, G. O. and Rosenthal, J. S. (2016). Complexity bounds for markov chain monte carlo algorithms via diffusion limits. Journal of Applied Probability, 53(2):410–420.
  • Shang et al. (2019) Shang, J., Wang, X., Wu, X., Sun, Y., Ding, Q., Liu, J., and Zhang, H. (2019). A review of ant colony optimization based methods for detecting epistatic interactions. IEEE Access, 7:13497–13509.
  • Sisson (2005) Sisson, S. A. (2005). Transdimensional markov chains: A decade of progress and future perspectives. Journal of the American Statistical Association, 100(471):1077–1089.
  • Stephens (2000) Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods. Ann. Statist., 28(1):40–74.
  • Yin et al. (2019) Yin, B., Balvert, M., van der Spek, R. A. A., Dutilh, B. E., Bohté, S., Veldink, J., and Schönhuth, A. (2019).

    Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype.

    Bioinformatics, 35(14):i538–i547.