Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms

09/14/2018 ∙ by Panagiotis Mandros, et al. ∙ Max Planck Society 11

The reliable fraction of information is an attractive score for quantifying (functional) dependencies in high-dimensional data. In this paper, we systematically explore the algorithmic implications of using this measure for optimization. We show that the problem is NP-hard, which justifies the usage of worst-case exponential-time as well as heuristic search methods. We then substantially improve the practical performance for both optimization styles by deriving a novel admissible bounding function that has an unbounded potential for additional pruning over the previously proposed one. Finally, we empirically investigate the approximation ratio of the greedy algorithm and show that it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Given a data sample

drawn from the joint distribution

of some input variables and an output variable , it is a fundamental problem in data analysis to find variable subsets that jointly influence or (approximately) determine . This functional dependency discovery problem, i.e., to find

(1)

for some real-valued measure that assesses the dependence of on , is a classic topic in the database community [1, Ch. 15]

, but also has many other applications including feature selection

[2] and knowledge discovery [3]. For instance, finding such dependencies can help identify compact sets of descriptors that capture the underlying structure and actuating mechanisms of complex scientific domains (e.g., [4, 5]).

For categoric input and output variables, the measure can be chosen to be the fraction of information [6, 7, 8] defined as

where denotes the Shannon entropy. This score represents the relative reduction of uncertainty about given . It takes on values between and corresponding to independence and exact functional dependency, respectively.

Estimating the score naively with empirical probabilities , however, leads to an overestimation of the actual dependence between and , a behavior known as dependency-by-chance [9]. In particular, since the bias is increasing with the domain size of variables [10], it is unsuitable for dependence discovery where we have to soundly compare different variable sets of varying dimensionality and consequently of widely varying domain sizes (see Fig. 1). In some feature selection approaches (see, e.g., [11]) this problem is mitigated by only considering dependencies of individual variables or pairs. Alternatively, some algorithms from the database literature, e.g., [12, 13], neglect this issue by assuming a closed-world, i.e., the unknown data generation process is considered equal to the empirical  [7].

Figure 1: Dependency-by-chance. Estimated fraction of information for variables of increasing domain size (4 to 2048) to independent (domain size 4) for fixed sample size (1000). Estimated dependency increases for naive estimator , while the corrected-for-chance estimator accurately estimates population value .

Both of these approaches are infeasible in the statistical setting with arbitrary sized variable sets that we are interested in. Instead, here, the fraction of information can be corrected by subtracting its estimated expected value under the hypothesis of independence. This gives rise to the reliable fraction of information [14, 15] defined as

(2)

where is the expected value of under the permutation model [16, p. 214], i.e., under the operation of permuting the empirical values with a random permutation . This estimator can be computed efficiently in time for with domain size (see [17] and appendix). Moreover, the maximization problem (Eq. (1)) can be solved effectively by a simple branch-and-bound scheme: the maximally attainable for supersets of some partial solution can be bounded by the function , which follows from the monotonicity of  [14].

This, however, is a rather simplistic bounding function that leaves room for substantial improvements. Moreover, it is unclear whether one has to rely on exponential-time worst-case branch-and-bound algorithms in the first place. Finally, the option of heuristic optimization has not yet been explored.

To this end, this paper provides the following contributions:

  1. We show that the problem of maximizing the reliable fraction of information is NP-hard. This justifies the usage of worst-case exponential-time algorithms as well as heuristic search methods (Sec. III).

  2. Motivated by this insight, we then greatly improve the practical performance for both of these optimization styles by deriving a novel admissible bounding function . This function is not only tighter than the previously proposed but in particular we have that the supremum of —and thus the potential for additional pruning in search—is unbounded (Sec. IV).

  3. Finally, we report extensive empirical results evaluating the proposed bounding function and the various algorithmic strategies. In particular, we consider the approximation ratio of the greedy algorithm and show that in fact, it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search—motivating further investigation of this fact (Sec. V).

We round up with a concluding discussion (Sec. VI). Before presenting the main contributions, we recall reliable functional dependency discovery and prove some basic results (Sec. II).

Ii Reliable Dependency Discovery

Let us denote by the set of positive integers up to . The symbols and refer to the logarithms of base and

, respectively. We assume a set of discrete random variables

is given along with an empirical sample of their joint distribution. For a variable we denote its domain, called categories (or distinct values), by but we also write instead of whenever clear from the context. We identify a random variable with the labeling it induces on the data sample, i.e., . Moreover, for a set of labelings over

, we define the corresponding vector-valued labeling by

. With for a subset , we denote the map restricted to domain .

We define to be the empirical counts of , i.e., . We further denote with , where , the empirical distribution of . Given another random variable , is the empirical conditional distribution of given , with for . However, we use and respectively whenever clear from the context. These empirical probabilities give rise to the empirical conditional entropy , the empirical mutual information , and the empirical fraction of information .

Recall that the reliable fraction of information is defined as the empirical fraction of information minus its expected value under the permutation model where . Here, denotes the symmetric group of , i.e., the set of bijections from to , and denotes the composition of a map with the permutation , i.e., . We abbreviate the correction term as and the unnormalized version as .

Ii-a Specializations and Labeling Homomorphisms

Since we identified sets of random variables with their corresponding sample-index-to-value map, they are subject to the following general relations of maps with common domains.

Definition 1.

Let and be maps defined on a common domain . We say that is equivalent to , denoted as , if for all it holds that if and only if . We say that is a specialization of , denoted as , if for all with it holds that .

A special case of specializations is given by the subset relation of variable sets, e.g., if then . The specialization relation implies some important properties for empirical probabilities and information-theoretic quantities.

Proposition 1.

Given variables , and , with , the following statements hold:

  1. [label=)]

  2. there is a projection , s.t. for all , it holds that ,

  3. ,

  4. ,

Proof.

Let us denote with and the and distributions respectively. Statement a) follows from the definition. For b), we define for , and similarly for . We show that for all , . The statement then follows from the definition of . We have

where the inequality follows from the monotonicity of the function (and the fact that is positive for all ).
c) Let us first recall the log-sum inequality [18, p. 31]: for non-negative numbers and ,

(3)

with equality if and only if constant. We have:

d) We have following from (c). ∎

In order to analyze monotonicity properties of the permutation model, the following additional definition is useful.

Definition 2.

We call a labeling homomorphic to a labeling (w.r.t. the target variable ), denoted as , if there exists with such that .

See Tab. I for examples of both introduced relations. Importantly, the inequality of mutual information for specializations (Prop. 14 carries over to homomorphic variables and in turn to their correction terms.

Proposition 2.

Given variables , and , with , the following statements hold:

  1. [label=)]

Proof.

Let be a permutation for which and . Property a) follows from

where the inequality holds from Prop. 14. For b), note that for every , it holds from Prop. 14 that . Hence

a a a b a
a b b a b
b c b b b
b c c c b
Table I: Specialization and homomorphism examples. We have , , , , . Note that as there is no that satisfies specialization w.r.t.  and 
1:function Opus()
2:     if  is empty then
3:         return
4:     else
5:         
6:         
7:         
8:         
9:         
10:         
11:         
12:         return      
13:
Algorithm 1 Opus: Given a set of input variables , function , bounding function , and , the algorithm returns the satisfying

Ii-B Search Algorithms

Effective algorithms for maximizing the reliable fraction of information over all subsets are enabled by the concept of bounding functions. A function is called an admissible bounding function for an optimization function if for all candidate solutions , it holds that for all with . Such functions allow to prune all supersets of whenever for the current best solution found during the optimization process.

Branch-and-bound, as the name suggests, combines this concept with a branching scheme that completely (and non-redundantly) enumerates the search space . Here, we consider optimized pruning for unordered search (Opus), an advanced variant of branch-and-bound that effectively propagates pruning information to siblings in the search tree [19]. Algorithm 1 shows the details of this approach.

In addition to keeping track of the best solution seen so far, the algorithm maintains a priority queue of pairs , where is a candidate solution and constitutes the variables that can still be used to augment , e.g., for a . The top element is the one with the smallest cardinality and the highest potential (a combination of breadth-first and best-first order). Starting with , , and a desired approximation guarantee , in every iteration Opus creates all refinements of the top element of and updates accordingly (lines 5-7). Next the refinements are pruned using and (line 8). Following, the pruned list is sorted according to decreasing potential (this is a heuristic that propagates the most augmentation elements to the least promising refinements [19]), the possible augmentation elements are non-redundantly propagated to the refinements of the top element, and finally the priority queue is updated with the new candidates (lines 9-11).

A commonly used alternative to complete branch-and-bound search for the optimization of dependency measures is the standard greedy algorithm (see [11, 20]). This algorithm only refines the best candidate in a given iteration. Moreover, bounding functions can be incorporated as an early termination criterion. For the reliable fraction of information in particular, there is potential to prune many of the higher levels of the search space as it favors solutions that are small in cardinality [14]. The algorithm is presented in Algorithm 2.

The algorithm keeps track of the best solution seen, as well as the best candidate for refinement . Starting with and , the algorithm in each iteration checks whether can be refined further, i.e., if is not empty, or if has potential (the early termination criterion). If not, the algorithm terminates returning (lines 2-3). Otherwise is refined to all possible refinements, and the best one is selected as a candidate to update (lines 5-7).

Concerning the approximation ratio of the greedy algorithm, there exists a large amount of research focused on submodular and/or monotone functions, e.g., [21, 22, 23]. However, we note that is neither submodular nor monotone, and hence these results are not directly applicable. To demonstrate empirically the quality of the results, we perform an evaluation in Sec. V-B. We discuss further on this topic in Sec. VI.

1:function Greedy()
2:     if  is empty or  then
3:         return
4:     else
5:         
6:         
7:         
8:         return      
9:
Algorithm 2 Greedy: Given a set of input variables , function , and bounding function , the algorithm returns the approximating

Iii Hardness of optimization

In this section, we show that the problem of maximizing is NP-hard by providing a reduction from the well-known NP-hard minimum set cover problem: given a finite universe and collection of subsets , find a set cover, i.e., a sub-collection with , that is of minimal cardinality.

The reduction consists of two parts. First, we construct a base transformation that maps a set cover instance to a dataset such that set covers correspond to attribute sets with an empirical fraction of information score of and bias correction terms that are a monotonically decreasing function of their cardinality. In a second step, we then calibrate the terms such that, when considering the corrected score , they cannot change the order between attribute sets with different values but only act as a tie-breaker between attribute sets of equal value. This is achieved by copying the dataset a suitable number of times such that the correction terms are sufficiently small but the overall transformation, denoted , is still of polynomial size.

1 1 a 1 1 a 2 a 2 2 a a 3 3 a a a a 4 4 a 4 a a 5 a 5 a 5 a 6 a a a a b 7 a a a a b 8 a a a a b 9 a a a a b 10 a a a a b 11 b c c c c 12 c b c c c 13 c c b c c 14 c c c b c 15 c c c c c
Figure 2: Base transformation example. Left: a set cover instance and . Right: the resulting using (bold indicates the set cover)

The base transformation is defined as follows. The dataset contains descriptive attributes corresponding to the sets of the set cover instance, and a target variable Y. The sample size is with a logical partition of the sample into the three regions , , and . The target attribute assigns to sample points one of three values corresponding to the three regions, i.e., with

and the descriptive attributes assign up to distinct values dependending on the set of universe elements covered by set , i.e., with

See Fig. 2 for an illustration. This transformation establishes a one-to-one correspondence of partial set covers and variable sets , which we denote as . Let us denote as . The first part of the construction ( and ) couples the amount of uncovered elements to the conditional entropy of given through . The second part () links the size of to the number of distinct values on .

We can note the following central properties.

Lemma 3.

Let be the transformation of a set cover instance and be two partial set covers. Then the following holds.

  1. [label=)]

  2. If then ; in particular, is a set cover, i.e., , if and only if ,

  3. If is a set cover and is not a set cover then .

  4. If and are both set covers then if and only if .

Proof.

Statement 1 follows from the definition of .

To show 2, since and thus are monotone in , it is sufficient to consider the case where . In this case we have

and, moreover, as required

For 3 observe that for a variable set corresponding to a set cover , we have for all that . Thus, for a variable set corresponding to another set cover . Moreover, we trivially have . Finally, let denote the indices belonging to where and take on values different from . Note that all values in these sets are unique. Furthermore, if then and in turn . This means we can find a permutation such that for all it holds that with and for (that is permutes all indices of non- values of in to indices of non- values of ). For such a permutation it holds that and . Therefore, as required. ∎

Now, although set covers correspond to variable sets with the maximal empirical fraction of information value of , due to the correction term, it can happen that for a variable set corresponding to a partial set cover. To prevent this we make use of the following upper bound of the expected mutual information under the permutation model.

Proposition 4 ([15], Thm. 7).

For a sample of size of the joint distribution of variables and having and distinct values, respectively, we have

This result implies that we can arbitrarily shrink the correction terms if we increase the sample size but leave the number of distinct values constant. Thus, we define the extended transformation through simply copying a number of times, i.e., by defining for . With this definition we show our main result.

Theorem 5.

Given a sample of the joint distribution of variables and , the problem of maximizing over all subsets of is NP-hard.

Proof.

We show that there is a number such that w.r.t. transformation we have that for all set covers and their corresponding variable sets , . Since all properties of Lemma 3 transfer from to , this implies that for all variable sets corresponding to non-set-covers , it holds that

where the greater-than follows from Lm. 31 and 32. Thus, any with maximum corresponds to a set cover .

Moreover, we know that must be a minimum set cover as required, because for a smaller set cover , by Lemma 33 and thus from Prop. 22—therefore, would not maximize .

To find the number that defines the final transformation , let and be a set cover of . Since has at most distinct values in and exactly , from Prop. 4 and the monotonicity of we know that

where the last inequality follows from . Thus, for we have as required. The proof is concluded by noting that the final transformation is of size (where ), which is polynomial in the size of the set cover instance. ∎

Iv Refined bounding function

The NP-hardness established in the previous section excludes (unless P=NP) the existence of a polynomial time algorithm for maximizing the reliable fraction of information, leaving therefore exact but exponential search and heuristics as the two options. For both, and particularly the former, reducing the search space can lead to more effective algorithms. For this purpose, we derive in this section a novel bounding function for to be used for pruning.

Figure 3: Evaluating for branch-and-bound optimization. Relative nodes explored difference (left) and relative runtime difference (right) between methods and . Positive (negative) numbers indicate that () is proportionally “better”. The datasets are sorted in decreasing number of attributes.

Recall that an admissible bounding function for effective search is an upper bound to the optimization function value of all supersets of a candidate solution . That is, it must hold that for all with . At the same time, in order to yield optimal pruning, the bound should be as tight as possible. Thus, the ideal function is

(4)

Computing this function is of course equivalent to the original optimization problem and hence NP-hard. We can, however, relax the maximum over all supersets to the maximum over all specializations of . That is, we define a bounding function through

(5)
(6)

While this definition obviously constitutes an admissible bounding function, it is unclear how it can be efficiently evaluated. Let us denote by the operation of joining a labeling with the target attribute , i.e., (see Tab. II for an example). This definition gives rise to a simple constructive form for computing .

Theorem 6.

The function can be efficiently computed as in time .

Proof.

We start by showing that the operation causes a positive gain in , i.e., for an arbitrary labeling it holds that .

Let us define . It is then sufficient to show . We have

since , and from Prop. 12, for every , .

To conclude, let be an arbitrary specialization of . We have by definition of and , that . Moreover, . Thus

as required. Here, the first inequality follows from Prop. 12, the second from the positive gain of over .

For the complexity recall that can be computed in time (see appendix). The complexity follows from . ∎

Note that the operation does not have to be computed explicitly because the non-zero marginal counts for

can simply be obtained as the non-zero counts of the joint contingency table of

and (which has to be computed anyway for