Given a data sample
drawn from the joint distributionof some input variables and an output variable , it is a fundamental problem in data analysis to find variable subsets that jointly influence or (approximately) determine . This functional dependency discovery problem, i.e., to find
for some real-valued measure that assesses the dependence of on , is a classic topic in the database community [1, Ch. 15]
, but also has many other applications including feature selection and knowledge discovery . For instance, finding such dependencies can help identify compact sets of descriptors that capture the underlying structure and actuating mechanisms of complex scientific domains (e.g., [4, 5]).
where denotes the Shannon entropy. This score represents the relative reduction of uncertainty about given . It takes on values between and corresponding to independence and exact functional dependency, respectively.
Estimating the score naively with empirical probabilities , however, leads to an overestimation of the actual dependence between and , a behavior known as dependency-by-chance . In particular, since the bias is increasing with the domain size of variables , it is unsuitable for dependence discovery where we have to soundly compare different variable sets of varying dimensionality and consequently of widely varying domain sizes (see Fig. 1). In some feature selection approaches (see, e.g., ) this problem is mitigated by only considering dependencies of individual variables or pairs. Alternatively, some algorithms from the database literature, e.g., [12, 13], neglect this issue by assuming a closed-world, i.e., the unknown data generation process is considered equal to the empirical .
Both of these approaches are infeasible in the statistical setting with arbitrary sized variable sets that we are interested in. Instead, here, the fraction of information can be corrected by subtracting its estimated expected value under the hypothesis of independence. This gives rise to the reliable fraction of information [14, 15] defined as
where is the expected value of under the permutation model [16, p. 214], i.e., under the operation of permuting the empirical values with a random permutation . This estimator can be computed efficiently in time for with domain size (see  and appendix). Moreover, the maximization problem (Eq. (1)) can be solved effectively by a simple branch-and-bound scheme: the maximally attainable for supersets of some partial solution can be bounded by the function , which follows from the monotonicity of .
This, however, is a rather simplistic bounding function that leaves room for substantial improvements. Moreover, it is unclear whether one has to rely on exponential-time worst-case branch-and-bound algorithms in the first place. Finally, the option of heuristic optimization has not yet been explored.
To this end, this paper provides the following contributions:
We show that the problem of maximizing the reliable fraction of information is NP-hard. This justifies the usage of worst-case exponential-time algorithms as well as heuristic search methods (Sec. III).
Motivated by this insight, we then greatly improve the practical performance for both of these optimization styles by deriving a novel admissible bounding function . This function is not only tighter than the previously proposed but in particular we have that the supremum of —and thus the potential for additional pruning in search—is unbounded (Sec. IV).
Finally, we report extensive empirical results evaluating the proposed bounding function and the various algorithmic strategies. In particular, we consider the approximation ratio of the greedy algorithm and show that in fact, it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search—motivating further investigation of this fact (Sec. V).
Ii Reliable Dependency Discovery
Let us denote by the set of positive integers up to . The symbols and refer to the logarithms of base and
, respectively. We assume a set of discrete random variablesis given along with an empirical sample of their joint distribution. For a variable we denote its domain, called categories (or distinct values), by but we also write instead of whenever clear from the context. We identify a random variable with the labeling it induces on the data sample, i.e., . Moreover, for a set of labelings over
, we define the corresponding vector-valued labeling by. With for a subset , we denote the map restricted to domain .
We define to be the empirical counts of , i.e., . We further denote with , where , the empirical distribution of . Given another random variable , is the empirical conditional distribution of given , with for . However, we use and respectively whenever clear from the context. These empirical probabilities give rise to the empirical conditional entropy , the empirical mutual information , and the empirical fraction of information .
Recall that the reliable fraction of information is defined as the empirical fraction of information minus its expected value under the permutation model where . Here, denotes the symmetric group of , i.e., the set of bijections from to , and denotes the composition of a map with the permutation , i.e., . We abbreviate the correction term as and the unnormalized version as .
Ii-a Specializations and Labeling Homomorphisms
Since we identified sets of random variables with their corresponding sample-index-to-value map, they are subject to the following general relations of maps with common domains.
Let and be maps defined on a common domain . We say that is equivalent to , denoted as , if for all it holds that if and only if . We say that is a specialization of , denoted as , if for all with it holds that .
A special case of specializations is given by the subset relation of variable sets, e.g., if then . The specialization relation implies some important properties for empirical probabilities and information-theoretic quantities.
Given variables , and , with , the following statements hold:
there is a projection , s.t. for all , it holds that ,
Let us denote with and the and distributions respectively. Statement a) follows from the definition. For b), we define for , and similarly for . We show that for all , . The statement then follows from the definition of . We have
where the inequality follows from the monotonicity of the function (and the fact that is positive for all ).
c) Let us first recall the log-sum inequality [18, p. 31]: for non-negative numbers and ,
with equality if and only if constant. We have:
d) We have following from (c). ∎
In order to analyze monotonicity properties of the permutation model, the following additional definition is useful.
We call a labeling homomorphic to a labeling (w.r.t. the target variable ), denoted as , if there exists with such that .
See Tab. I for examples of both introduced relations. Importantly, the inequality of mutual information for specializations (Prop. 14 carries over to homomorphic variables and in turn to their correction terms.
Given variables , and , with , the following statements hold:
Ii-B Search Algorithms
Effective algorithms for maximizing the reliable fraction of information over all subsets are enabled by the concept of bounding functions. A function is called an admissible bounding function for an optimization function if for all candidate solutions , it holds that for all with . Such functions allow to prune all supersets of whenever for the current best solution found during the optimization process.
Branch-and-bound, as the name suggests, combines this concept with a branching scheme that completely (and non-redundantly) enumerates the search space . Here, we consider optimized pruning for unordered search (Opus), an advanced variant of branch-and-bound that effectively propagates pruning information to siblings in the search tree . Algorithm 1 shows the details of this approach.
In addition to keeping track of the best solution seen so far, the algorithm maintains a priority queue of pairs , where is a candidate solution and constitutes the variables that can still be used to augment , e.g., for a . The top element is the one with the smallest cardinality and the highest potential (a combination of breadth-first and best-first order). Starting with , , and a desired approximation guarantee , in every iteration Opus creates all refinements of the top element of and updates accordingly (lines 5-7). Next the refinements are pruned using and (line 8). Following, the pruned list is sorted according to decreasing potential (this is a heuristic that propagates the most augmentation elements to the least promising refinements ), the possible augmentation elements are non-redundantly propagated to the refinements of the top element, and finally the priority queue is updated with the new candidates (lines 9-11).
A commonly used alternative to complete branch-and-bound search for the optimization of dependency measures is the standard greedy algorithm (see [11, 20]). This algorithm only refines the best candidate in a given iteration. Moreover, bounding functions can be incorporated as an early termination criterion. For the reliable fraction of information in particular, there is potential to prune many of the higher levels of the search space as it favors solutions that are small in cardinality . The algorithm is presented in Algorithm 2.
The algorithm keeps track of the best solution seen, as well as the best candidate for refinement . Starting with and , the algorithm in each iteration checks whether can be refined further, i.e., if is not empty, or if has potential (the early termination criterion). If not, the algorithm terminates returning (lines 2-3). Otherwise is refined to all possible refinements, and the best one is selected as a candidate to update (lines 5-7).
Concerning the approximation ratio of the greedy algorithm, there exists a large amount of research focused on submodular and/or monotone functions, e.g., [21, 22, 23]. However, we note that is neither submodular nor monotone, and hence these results are not directly applicable. To demonstrate empirically the quality of the results, we perform an evaluation in Sec. V-B. We discuss further on this topic in Sec. VI.
Iii Hardness of optimization
In this section, we show that the problem of maximizing is NP-hard by providing a reduction from the well-known NP-hard minimum set cover problem: given a finite universe and collection of subsets , find a set cover, i.e., a sub-collection with , that is of minimal cardinality.
The reduction consists of two parts. First, we construct a base transformation that maps a set cover instance to a dataset such that set covers correspond to attribute sets with an empirical fraction of information score of and bias correction terms that are a monotonically decreasing function of their cardinality. In a second step, we then calibrate the terms such that, when considering the corrected score , they cannot change the order between attribute sets with different values but only act as a tie-breaker between attribute sets of equal value. This is achieved by copying the dataset a suitable number of times such that the correction terms are sufficiently small but the overall transformation, denoted , is still of polynomial size.
The base transformation is defined as follows. The dataset contains descriptive attributes corresponding to the sets of the set cover instance, and a target variable Y. The sample size is with a logical partition of the sample into the three regions , , and . The target attribute assigns to sample points one of three values corresponding to the three regions, i.e., with
|and the descriptive attributes assign up to distinct values dependending on the set of universe elements covered by set , i.e., with|
See Fig. 2 for an illustration. This transformation establishes a one-to-one correspondence of partial set covers and variable sets , which we denote as . Let us denote as . The first part of the construction ( and ) couples the amount of uncovered elements to the conditional entropy of given through . The second part () links the size of to the number of distinct values on .
We can note the following central properties.
Let be the transformation of a set cover instance and be two partial set covers. Then the following holds.
If then ; in particular, is a set cover, i.e., , if and only if ,
If is a set cover and is not a set cover then .
If and are both set covers then if and only if .
Statement 1 follows from the definition of .
To show 2, since and thus are monotone in , it is sufficient to consider the case where . In this case we have
and, moreover, as required
For 3 observe that for a variable set corresponding to a set cover , we have for all that . Thus, for a variable set corresponding to another set cover . Moreover, we trivially have . Finally, let denote the indices belonging to where and take on values different from . Note that all values in these sets are unique. Furthermore, if then and in turn . This means we can find a permutation such that for all it holds that with and for (that is permutes all indices of non- values of in to indices of non- values of ). For such a permutation it holds that and . Therefore, as required. ∎
Now, although set covers correspond to variable sets with the maximal empirical fraction of information value of , due to the correction term, it can happen that for a variable set corresponding to a partial set cover. To prevent this we make use of the following upper bound of the expected mutual information under the permutation model.
Proposition 4 (, Thm. 7).
For a sample of size of the joint distribution of variables and having and distinct values, respectively, we have
This result implies that we can arbitrarily shrink the correction terms if we increase the sample size but leave the number of distinct values constant. Thus, we define the extended transformation through simply copying a number of times, i.e., by defining for . With this definition we show our main result.
Given a sample of the joint distribution of variables and , the problem of maximizing over all subsets of is NP-hard.
We show that there is a number such that w.r.t. transformation we have that for all set covers and their corresponding variable sets , . Since all properties of Lemma 3 transfer from to , this implies that for all variable sets corresponding to non-set-covers , it holds that
To find the number that defines the final transformation , let and be a set cover of . Since has at most distinct values in and exactly , from Prop. 4 and the monotonicity of we know that
where the last inequality follows from . Thus, for we have as required. The proof is concluded by noting that the final transformation is of size (where ), which is polynomial in the size of the set cover instance. ∎
Iv Refined bounding function
The NP-hardness established in the previous section excludes (unless P=NP) the existence of a polynomial time algorithm for maximizing the reliable fraction of information, leaving therefore exact but exponential search and heuristics as the two options. For both, and particularly the former, reducing the search space can lead to more effective algorithms. For this purpose, we derive in this section a novel bounding function for to be used for pruning.
Recall that an admissible bounding function for effective search is an upper bound to the optimization function value of all supersets of a candidate solution . That is, it must hold that for all with . At the same time, in order to yield optimal pruning, the bound should be as tight as possible. Thus, the ideal function is
Computing this function is of course equivalent to the original optimization problem and hence NP-hard. We can, however, relax the maximum over all supersets to the maximum over all specializations of . That is, we define a bounding function through
While this definition obviously constitutes an admissible bounding function, it is unclear how it can be efficiently evaluated. Let us denote by the operation of joining a labeling with the target attribute , i.e., (see Tab. II for an example). This definition gives rise to a simple constructive form for computing .
The function can be efficiently computed as in time .
We start by showing that the operation causes a positive gain in , i.e., for an arbitrary labeling it holds that .
To conclude, let be an arbitrary specialization of . We have by definition of and , that . Moreover, . Thus
For the complexity recall that can be computed in time (see appendix). The complexity follows from . ∎
Note that the operation does not have to be computed explicitly because the non-zero marginal counts for
can simply be obtained as the non-zero counts of the joint contingency table ofand (which has to be computed anyway for