The Famine of Forte: Few Search Problems Greatly Favor Your Algorithm

09/28/2016
by   George D. Montanez, et al.
Carnegie Mellon University
0

Casting machine learning as a type of search, we demonstrate that the proportion of problems that are favorable for a fixed algorithm is strictly bounded, such that no single algorithm can perform well over a large fraction of them. Our results explain why we must either continue to develop new learning methods year after year or move towards highly parameterized models that are both flexible and sensitive to their hyperparameters. We further give an upper bound on the expected performance for a search algorithm as a function of the mutual information between the target and the information resource (e.g., training dataset), proving the importance of certain types of dependence for machine learning. Lastly, we show that the expected per-query probability of success for an algorithm is mathematically equivalent to a single-query probability of success under a distribution (called a search strategy), and prove that the proportion of favorable strategies is also strictly bounded. Thus, whether one holds fixed the search algorithm and considers all possible problems or one fixes the search problem and looks at all possible search strategies, favorable matches are exceedingly rare. The forte (strength) of any algorithm is quantifiably restricted.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

07/13/2019

The Futility of Bias-Free Learning and Search

Building on the view of machine learning as search, we demonstrate the n...
02/14/2012

Noisy Search with Comparative Feedback

We present theoretical results in terms of lower and upper bounds on the...
09/18/2020

Quantum Search with Prior Knowledge

Search-base algorithms have widespread applications in different scenari...
01/03/2020

Decomposable Probability-of-Success Metrics in Algorithmic Search

Previous studies have used a specific success metric within an algorithm...
11/18/2019

The αμ Search Algorithm for the Game of Bridge

αμ is an anytime heuristic search algorithm for incomplete information g...
08/08/2016

Delta Epsilon Alpha Star: A PAC-Admissible Search Algorithm

Delta Epsilon Alpha Star is a minimal coverage, real-time robotic search...
10/18/2021

Learning Models for Query by Vocal Percussion: A Comparative Study

The imitation of percussive sounds via the human voice is a natural and ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

BIGCITY, United States.

In a fictional world not unlike our own, a sophisticated criminal organization plots to attack an unspecified landmark within the city. Due to the complexity of the attack and methods of infiltration, the group is forced to construct a plan relying on the coordinated actions of several interdependent agents, of which the failure of any one would cause the collapse of the entire plot. As a member of the city’s security team, you must allocate finite resources to protect the many important locations within the city. Although you know the attack is imminent, your sources have not indicated which location, of the hundreds possible, will be hit; your lack of manpower forces you to make assumptions about target likelihoods. You know you can foil the plot if you stop even one enemy agent. Because of this, you seek to maximize the odds of capturing an agent by placing vigilant security forces at the strategic locations throughout the city. Allocating more security to a given location increases surveillance there, raising the probability a conspirator will be found if operating nearby. Unsure of your decisions, you allocate based on your best information, but continue to second-guess yourself.

With this picture in mind, we can analyze the scenario through the lens of algorithmic search. Algorithmic search methods, whether employed in fictional security situations or by researchers in the lab, all share common elements. They have a search space of elements (possible locations) which contains items of high value at unknown locations within that space. These high-value elements are targets of the search, which are sought by a process of inspecting elements of the search space to see if they contain any elements of the target set. (We refer to the inspection of search space elements as sampling from the search space, and each sampling action as a query.) In our fictional scenario, the enemy activity locations constitute the target set, and surveillance at city locations corresponds to a search process (i.e., attempting to locate and arrest enemy agents). The search process can be deterministic or contain some degree of randomization, such as choosing which elements to inspect using a weighted distribution over the search space. The search process is a success if an element of the target set is located during the course of inspecting elements of the space. The history of the search refers to the collection of elements already sampled and any accompanying information (e.g., the intelligence data gathered thus far by the security forces).

There are many ways to search such a space. Research on algorithmic search is replete with proposed search methods that attempt to increase the expected likelihood of search success, across many different problem domains. However, a direct result of the No Free Lunch theorems are that no search method is universally optimal across all search problems [1, 2, 3, 4, 5, 6, 7, 8, 9], since all algorithmic search methods have performance equivalent to uniform random sampling on the search space when uniformly averaged across any closed-under-permutation set of problem functions [4]. Thus, search (and learning) algorithms must trade weaker performance on some problems for improved performance on others [1, 10]. Given the fact that there exists no universally superior search method, we typically seek to develop methods that perform well over a large range of important problems. An important question naturally arises:

Is there a limit to the number of problems for which a search algorithm can perform well?

For search problems in general, the answer is yes, as the proportion of search problems for which an algorithm outperforms uniform random sampling is strictly bounded in relation to the degree of performance improvement. Previous results have relied on the fact that most search problems have many target elements, and thus are not difficult for random sampling [11, 12]. Since the typical search problem is not difficult for random sampling, it becomes hard to find a problem for which an algorithm can significantly and reliably outperform random sampling. While theoretically interesting, a more relevant situation occurs when we consider difficult search problems, which are those having relatively small target sets. We denote such problems as target-sparse problems or following [11], target-sparse functions. Note, this use of sparseness differs from that typical in the machine learning literature, since it refers to sparseness of the target space, not sparseness in the feature space. One focus of this paper is to prove results that hold even for target-sparse functions. As mentioned, Montañez [11] and English [12] have shown that uniform sampling does well on the majority of functions since they are not target-sparse, thus making problems for which an algorithm greatly outperforms random chance rare. If we restrict ourselves to only target-sparse functions, perhaps it would become easier to find relatively favorable problems since we would already know uniform sampling does poorly. The bar would already be low, so we would have to do less to surpass it, perhaps leading to a greater proportion of favorable problems within that set. We show why this intuition incorrect, along with proving several other key results.

Ii Contributions

First, we demonstrate that favorable search problems must necessarily be rare. Our work departs from No Free Lunch results (namely, that the mean performance across sets of problems is fixed for all algorithms) to show that the proportion of favorable problems is strictly bounded in relation to the inherent problem difficulty and the degree of improvement sought (i.e., not just the mean performance is bounded). Our results continue to hold for sets of objective functions that are not

closed-under-permutation, in contrast to traditional No Free Lunch theorems. Furthermore, the negative results presented here do not depend on any distributional assumptions on the space of possible problems, such that the proportion of favorable problems is small regardless of which distribution holds over them in the real world. This directly answers critiques aimed at No Free Lunch results arguing against a uniform distribution on problems in the real world (cf. 

[13]), since given any distribution over possible problems, there are still relatively few favorable problems within the set one is taking the distribution over.

As a corollary, we prove the information costs of finding any favorable search problem is bounded below by the number of bits “advantage” gained by the algorithm on such problems. We do this by using an active information transform to measure performance improvement in bits [14], proving a conservation of information result [6, 14, 11] that shows the amount of information required to locate a search problem giving bits of expected information is at least bits. Thus, to get an expected net gain of information, the true distribution over search problems must be biased towards favorable problems for a given algorithm. This places a floor on the minimal information costs for finding favorable problems, somewhat reminiscent of the entertainingly satirical work on “data set selection” [15].

Another major contribution of this paper is to bound the expected per-query probability of success based on information resources. Namely, we relate the degree of dependence (measured in mutual information) between target sets and external information resources, such as objective functions, noisy measurements111Noisy in the sense of inaccurate or biased, not in the sense of indeterministic. or sets of training data, to the maximum improvement in search performance. We prove that for a fixed target-sparseness and given an algorithm with induced single-query probability of success ,

(1)

where is the mutual information between target set

(as a random variable) and external information resource

,

is the Kullback-Leibler divergence between the marginal distribution on

and the uniform distribution on target sets, and is the baseline information cost for the search problem due to sparseness. This simple equation takes into account degree of dependence, target sparseness, target function uncertainty, and the contribution of random luck. It is surprising that such well-known quantities appear in the course of simply trying to upper-bound the probability of success.

We then establish the equivalence between the expected per-query probability of success for an algorithm and the probability of a successful single-query search under some distribution, which we call a strategy. Each algorithm maps to a strategy, and we prove an upper-bound on the proportion of favorable strategies for a fixed problem. Thus, matching a search problem to a fixed algorithm or a search algorithm to a fixed problem are both provably difficult tasks, and the set of favorable items remains vanishingly small in both cases.

Lastly, we apply the results to several problem domains, some toy and some actual, showing how these results lead to new insights in different research areas.

Iii Search Framework

We begin by formalizing our problem setting, search method abstraction and other necessary concepts.

Iii-a The Search Problem

Fig. 1:

Two example target sets. In one case (left), knowing the location of the black target elements fully determines the location of the red target element. In the second case (right), the locations of the black target elements reveal nothing concerning the location of the red element. Problems are represented as a binary vector (top row), and two-dimensional search space (bottom row).

We abstract the search problem as follows. The search space, denoted , contains the elements to be examined. We limit ourselves to finite, discrete search spaces, which entails no loss of generality when considering search spaces fully representable on physical computer hardware within a finite time. Let the target set be a nonempty subset of the search space. The set can be represented using a binary vector the size of , where an indexed position evaluates to 1 whenever the corresponding element is in and 0 otherwise. Thus, each corresponds to exactly one binary vector of length , and vice versa. We refer to this one-to-one mapped binary vector as a target function and use the terms target function and target set interchangeably, depending on context. These target sets/functions will help us define our space of possible search problems, as we will see shortly.

Figure 1 shows two example target sets, in binary vector and generic target set form. The set on the left has strong (potentially exploitable) structure governing the placement of target elements, whereas the example on the right is more or less random. Thus, knowing the location of some target elements may or may not be able to help one find additional elements, and in general there may be any degree of correlation between the location of target elements already uncovered and those yet to be discovered.

Typically, elements from the search space are evaluated according to some external information resource, such as an objective function

. We abstract this resource as simply a finite length bit string, which could represent an objective function, a set of training data (in supervised learning settings), or anything else. The resource can exist in coded form, and we make no assumption about the shape of this resource or its encoding. Our only requirement is that it can be used as an oracle, given an element (possibly the null element) from

. In other words, we require the existence of two methods, one for initialization (given a null element) and one for evaluating queried points in . We assume both methods are fully specified by the problem domain and external information resource. With a slight abuse of notation, we define an information resource evaluation as , where is an extraction function applied to the information resource and . Therefore, represents the method used to extract initial information for the search algorithm (absent of any query), and represents the evaluation of point under resource . The size of the information resource becomes important for datasets in machine learning, which determines the maximum amount of mutual information available between and .

A search problem is defined as a 3-tuple, , consisting of a search space, a target subset of the space and an external information resource , respectively. Since the target locations are hidden, any information gained by the search concerning the target locations is mediated through the external information resource alone. Thus, the space of possible search problems includes many deceptive search problems, where the external resource provides misleading information about target locations, similar to when security forces are given false intelligence data, and many noisy problems, similar to when imprecise intelligence gathering techniques are used. In the fully general case, there can be any relationship between and . Because we consider any and all degrees of dependence between external information resources and target locations, this effectively creates independence when considering the set as a whole, allowing the first main result of this paper to follow as a consequence.

However, in many natural settings, target locations and external information resources are tightly coupled. For example, we typically threshold objective function values to designate the target elements as those that meet or exceed some minimum. Doing so enforces dependence between target locations and objective functions, where the former is fully determined by the latter, once the threshold is known. This dependence causes direct correlation between the objective function (which is observable) and the target locations (which are not directly observable). We will demonstrate such correlation is exploitable, affecting the upper bound on the expected probability of success.

Iii-B The Search Algorithm

Fig. 2: Black-box search algorithm. At time

the algorithm computes a probability distribution

over the search space , using information from the history, and a new point is drawn according to . The point is evaluated using external information resource . The tuple is then added to the history at position . Note, indices on elements do not correspond to time step in this diagram, but to sampled locations.

An algorithmic search is any process that chooses elements of a search space to examine, given some history of elements already examined and information resource evaluations. The history consists of two parts: a query trace and a resource evaluation trace. A query trace is a record of points queried, indexed by time. A resource evaluation trace is a record of partial information extracted from , also indexed by time. The information resource evaluations can be used to build elaborate predictive models (as is the case of Bayesian optimization methods), or ignored completely (as is done with uniform random sampling). The algorithm’s internal state is updated at each time step (according to the rules of the algorithm), as new query points are evaluated against the external information resource. The search is thus an iterative process, with the history at time represented as . The problem domain defines how much initial information from is given by , as well as how much (and what) information is extracted from at each query.

We allow for both deterministic and randomized algorithms, since deterministic algorithms are equivalent to randomized algorithms with degenerate probability functions (i.e., they place all probability mass on a single point). Furthermore, any population based method can also be represented in this framework, by holding fixed when selecting the elements in the population, then considering the history (possibly limiting the horizon of the history to only steps, creating a Markovian dependence on the previous population) and updating the probability distribution over the search space for the next steps.

Abstracting away the details of how such a distribution is chosen, we can treat the search algorithm as a black box that somehow chooses elements of a search space to evaluate. A search is successful if an element in the target set is located during the search. Algorithm 1 outlines the steps followed by the black-box search algorithm and Figure 2 visually demonstrates the process.

1:  Initialize .
2:  for all  do
3:     Using history , compute , the distribution over .
4:     Sample element according to .
5:     Set .
6:  end for
7:  if an element of is contained in any tuple of  then
8:     Return success.
9:  else
10:     Return failure.
11:  end if
Algorithm 1 Black-box Search Algorithm

Our framework is sufficiently general to apply to supervised learning, genetic algorithms, and sequential model-based hyperparameter optimization, yet is specific enough to allow for precise quantitative statements. Although faced with an inherent balancing act whenever formulating a new framework, we emphasize greater generality to allow for the largest number of domain specific applications.

Iv Measuring Performance

Since general search algorithms may vary the total number of sampling steps performed, we measure performance using the expected per-query probability of success,

(2)

where is the sequence of probability distributions for sampling elements (with one distribution for each time step ), is the search history, and the and make the dependence on the search problem explicit. Because this is an expectation over sequences of history tuples and probability distributions, past information forms the basis for . denotes the length of the sequence , which equals the number of queries taken. The expectation is taken over all sources of randomness, which includes randomness over possible search histories and any randomness in constructing the various from (if such a construction is not entirely deterministic). Taking the expectation over all sources of randomness is equivalent to the probability of success for samples drawn from an appropriately averaged distribution (see Lemma 1). Because we are sampling from a fixed (after conditioning and expectation) probability distribution, the expected per-query probability of success for the algorithm is equivalent to the induced amount of probability mass allocated to target elements. Revisiting our fictional security scenario, the probability of capturing an enemy agent is proportional to the amount of security at his location. In the same way, the expected per-query probability of success is equal to the amount of induced probability mass at locations where target elements are placed. Thus, each demarks an equivalence class of search algorithms mapping to the same averaged distribution; we refer to these equivalence classes as search strategies.

We use uniform random sampling with replacement as our baseline search algorithm, which is a simple, always available strategy and we define as the per-query probability of success for that method. In a natural way, is a measure of the intrinsic difficulty of a search problem [14], absent of any side-information. The ratio quantifies the improvement achieved by a search algorithm over simple random sampling. Like an error quantity, when this ratio is small (i.e., less than one) the performance of the algorithm is better than uniform random sampling, but when it is larger than one, performance is worse. We will often write simply as , when the target set and information resource are clear from the context.

V Main Results

We state our main results here with brief discussion, and with proofs given in the Appendix.

V-a Famine of Forte

Theorem 1.

(Famine of Forte) Define

and let denote any set of binary strings, such that the strings are of length or less. Let

where is the expected per-query probability of success for algorithm on problem . Then for any ,

and

where .

We see that for small (problems with sparse target sets) favorable search problems are rare if we desire a strong probability of success. The larger , the smaller the proportion. In many real-world settings, we are given a difficult search problem (with minuscule ) and we hope that our algorithm has a reasonable chance of achieving success within a limited number of queries. According to this result, the proportion of problems fulfilling such criteria is also minuscule. Only if we greatly relax the minimum performance demanded, so that approaches the scale of , can we hope to easily stumble upon such accommodating search problems. If not, insight and/or auxiliary information (e.g., convexity constraints, smoothness assumptions, domain knowledge, etc.) are required to find problems for which an algorithm excels.

This result has many applications, as will be shown in Section VI.

V-B Conservation of Information

Corollary 1.

(Conservation of Active Information of Expectations) Define and let

Then for any

The active information transform quantifies improvement in a search over a baseline search method [14], measuring gains in search performance in terms of information (bits). This allows one to precisely quantify the proportion of favorable functions in relation to the number bits improvement desired. Active information provides a nice geometric interpretation of improved search performance, namely, that the improved search method is equivalent to a uniform random sampler on a reduced search space. Similarly, a degraded-performance search is equivalent to a uniform random sampler (with replacement) on a larger search space.

Applying this transform, we see that finding a search problem for which an algorithm effectively reduces the search space by bits requires at least bits, so information is conserved in this context. Assuming you have no domain knowledge to guide the process of finding a search problem for which your search algorithm excels, you are unlikely to stumble upon one by chance; indeed, they are exponentially rare in the amount of improvement sought.

V-C Famine of Favorable Strategies

Theorem 2.

(Famine of Favorable Strategies) For any fixed search problem , set of probability mass functions , and a fixed threshold ,

where and is Lebesgue measure. Furthermore, the proportion of possible search strategies giving at least bits of active information of expectations is no greater than .

Thus, not only are favorable problems rare, so are favorable strategies. Whether you hold fixed the algorithm and try to match a problem to it, or hold fixed the problem and try to match a strategy to it, both are provably difficult. Because matching problems to algorithms is hard, seemingly serendipitous agreement between the two calls for further explanation, serving as evidence against blind matching by independent mechanisms (especially for very sparse targets embedded in very large spaces). More importantly, this result places hard quantitative constraints (similar to minimax bounds in statistical learning theory) on information costs for automated machine learning, which attempts to match learning algorithms to learning problems 

[16, 17].

V-D Success Under Dependence

Theorem 3.

(Success Under Dependence) Define

and note that

Then,

where , is the Kullback-Leibler divergence between the marginal distribution on and the uniform distribution on , and is the mutual information. Alternatively, we can write

where .

Thus the bound on expected probability of success improves monotonically with the amount of dependence between target sets and information resources. We quantify the dependence using mutual information under any fixed joint distribution on

and . We see that measures the relative target sparseness of the search problem, and can be interpreted as the information cost of locating a target element in the absence of side-information. is naturally interpreted as the predictability of the target sets, since large values imply the probable occurrence of only a small number of possible target sets. The mutual information is the amount of exploitable information the external resource contains regarding ; lowering the mutual information lowers the maximum expected probability of success for the algorithm. Lastly, the in the numerator upper bounds the contribution of pure randomness, as explained in the Appendix. Thus, this expression constrains the relative contributions of predictability, problem difficulty, side-information, and randomness for a successful search, providing a satisfyingly simple and interpretable upper bound on the probability of successful search.

Vi Examples

Vi-a Binary Classification

It has been suggested that machine learning represents a type of search through parameter or concept space [18]. Supporting this view, we can represent binary classification problems within our framework as follows:

  • - classification algorithm, such as an SVM.

  • - space of possible concepts over an instance space.

  • - Set of all hypotheses with less than 10% classification error on test set, for example.

  • - set of training examples.

    • - full set of training data.

    • - loss on training data for concept .

  • - binary classification learning task.

The space of possible binary concepts is , with the true concept being an element in that space. In our example, let . The target set consists of the set of all concepts in that space that (1) are consistent with the training data (which we will assume all are), and (2) differ from the truth in at most 10% of positions on the generalization held-out dataset. Thus, . Let us assume the marginal distribution on is uniform, which isn’t necessary but simplifies the calculation. The external information resource is the set of training examples. The algorithm uses the training examples (given by ) to produce a distribution over the space of concepts; for deterministic algorithms, this is a degenerate distribution on exactly one element. A single query is then taken (i.e., a concept is output), and we assess the probability of success for the single query. By Theorem 3, the expected chances of outputting a concept with at least 90% generalization accuracy is thus no greater than . The denominator is the information cost of specifying at least one element of the target set and the numerator represents the information resources available for doing so. When the mutual information meets (or exceeds) that cost, success can be ensured for any algorithm perfectly mining the available mutual information. When noise reduces the mutual information below the information cost, the expected probability of success becomes strictly bounded in proportion to that ratio.

Vi-B General Learning Problems

Vapnik presents a generalization of learning that applies to classification, regression, and density estimation 

[19], which we can translate into our framework. Following Vapnik, let be defined on space , and consider the parameterized set of functions . The goal is to minimize for , when is unknown but an i.i.d. sample is given. Let be the empirical risk.

To reduce this general problem to a search problem within our framework, assume is finite, choose , and let

  • ;

  • };

  • ;

  • ; and

  • .

Thus, any finite problem representable in Vapnik’s statistical learning framework is also directly representable within our search framework.

Vi-C Hyperparameter Optimization

Given that sequential hyperparameter optimization is a literal search through a space of hyperparameter configurations, our results are directly applicable. The search space consists of all the possible hyperparameter configurations (appropriately discretized in the case of numerical hyperparameters). The target set is determined by the particular learning algorithm the configurations are applied to, the performance metric used, and the level of performance desired. Let denote a set of points sampled from the space, and let the information gained from the sample become the external information resource . Given that resource, we have the following theorem:

Theorem 4.

Given a search algorithm , a finite discrete hyperparameter configuration space , a set of points sampled from that search space, and information resource that is a function of , let , , and , where is the expected per-query probability of success for algorithm under and . Then,

where

The proof follows directly from Theorem 1.

The proportion of possible hyperparameter target sets giving an expected probability of success or more is minuscule when . If we have no additional information beyond that gained from the points , we have no justifiable basis for expecting a successful search. Thus, we must make some assumptions concerning the relationship of the points sampled to the remaining points in . We can do so by either assuming structure on the search space, such that spatial information becomes informative, or by making an assumption on the process by which was sampled, so that the sample is representative of the space in quantifiable ways. These assumptions allow to become informative of the target set , leading to exploitable dependence under Theorem 3. Thus we see the need for inductive bias in hyperparameter optimization (to expand a term used by Mitchell [20] for classification), which hints at a strategy for creating more effective hyperparameter optimization algorithms (i.e., through exploitation of spatial structure).

Vi-D One-Size-Fits-All Fitness Functions

Theorem 1 gives us an upper bound on the proportion of favorable search problems, but what happens when we have a single, fixed information resource, such as a single fitness function? A natural question to ask is for how many target locations can such a fitness function be useful. More precisely, for a given search algorithm, for what proportion of search space locations can the fitness function raise the expected probability of success, assuming a target element happens to be located at one of those spots?

Applying Theorem 1 with , we find that a fitness function can significantly raise the probability of locating target elements placed on, at most, 1/ search space elements. We see this as follows. Since and the fitness function is fixed, each search problem maps to exactly one element of , giving possible search problems. The number of -favorable search problems is upper-bounded by

Because this expression is independent of the size of the search space, the number of elements for which a fitness function can strongly raise the probability of success remains fixed even as the size of the search space increases. Thus, for very large search spaces the proportion of favored locations effectively vanishes. There can exist no single fitness function that is strongly favorable for many elements simultaneously, and thus no “one-size-fits-all” fitness function.

Vi-E Proliferation of Learning Algorithms

Our results also help make sense of recent trends in machine learning. They explain why new algorithms continue to be developed, year after year, conference after conference, despite decades of research in machine learning and optimization. Given that algorithms can only perform well on a narrow subset of problems, we must either continue to devise novel algorithms for new problem domains or else move towards flexible algorithms that can modify their behavior solely though parameterization. The latter effectively behave like new strategies for new hyperparameter configurations. The explosive rise of flexible, hyperparameter sensitive algorithms like deep learning methods and vision architectures shows a definite trend towards the latter, with their hyperparameter sensitivity being well known 

[21, 22]. Furthermore, because flexible algorithms are highly sensitive to their hyperparameters (by definition), this explains the concurrent rise in automated hyperparameter optimization methods [23, 24, 16, 25].

Vi-F Landmark Security

Returning to our initial toy example, we can now fill in a few details. Our external information resource is the pertinent intelligence data, mined through surveillance. We begin with background knowledge (represented in ), used to make the primary security force placements. Team members on the ground communicate updates back to central headquarters, such as suspicious activity, which are the evaluations used to update the internal information state. Each resource allocated is a query, and manpower constraints limit the number of queries available. Doing more with fewer officers is better, so the hope is to maximize the per-officer probability of stopping the attack.

Our results tell us a few things. First, a fixed strategy can only work well in a limited number of situations. There is little or no hope of a problem being a good match for your strategy if the problem arises independently of it (Theorems 1 and 2). So reliable intelligence becomes key. The better correlated the intelligence reports are with the actual plot, the better a strategy can perform (Theorem 3). However, even for a fixed search problem with reliable external information resource there is no guarantee of success, if the strategy is chosen poorly; the proportion of good strategies for a fixed problem is no better than the proportion of good problems for a fixed algorithm (Theorem 2). Thus, domain knowledge is crucial in choosing either. Without side-information to guide the match between search strategy and search problem, the expected probability of success is dismal in target-sparse situations.

Vii Conclusion

A colleague once remarked that the results presented here bring to mind Lake Wobegone, “where all the children are above average.” In a world where every published algorithm is “above average,” conservation of information reminds us that this cannot be. Improved performance over any subset of problems necessarily implies degraded performance over the remaining problems [10], and all methods have performance equivalent to random sampling when uniformly averaged over any closed-under-permutation set of problems [1, 2, 4].

But there are many ways to achieve an average. An algorithm might perform well over a large number of problems, only to perform poorly on a small set of remaining problems. A different algorithm might perform close to average on all problems, with slight variations around the mean. How large can the subset of problems with improved performance be? We show that the maximum proportion of search problems for which an algorithm can attain -favorable performance is bounded from above by . Thus, an algorithm can only perform well over a narrow subset of possible problems. Not only is there no free lunch, but there is also a famine of favorable problems.

If finding a good search problem for a fixed algorithm is hard, then so is finding a good search algorithm for a fixed problem (Theorem 2). Thus, the matching of problems to algorithms is provably difficult, regardless of which is fixed and which varies.

Our results paint a more optimistic picture once we restrict ourselves to those search problems for which the external information resource is strongly informative of the target set, as is often assumed to be the case. For those problems, the expected per-query probability of success is upper bounded by a function involving the mutual information between target sets and external information resources (like training datasets and objective functions). The lower the mutual information, the lower chance of success, but the bound improves as the dependence is strengthened.

The search framework we propose is general enough to find application in many problem areas, such as machine learning, evolutionary search, and hyperparameter optimization. The results are not just of theoretical importance, but help explain real-world phenomena, such as the need for exploitable dependence in machine learning and the empirical difficulty of automated learning [17]

. Our results help us understand the growing popularity of deep learning methods and unavoidable interest in automated hyperparameter tuning methods. Extending the framework to continuous settings and other problem areas (such as active learning) is the focus of ongoing research.

Acknowledgement

I would like to thank Akshay Krishnamurthy, Junier Oliva, Ben Cowley and Willie Neiswanger for their discussions regarding Lemma 2. I am indebted to Geoff Gordon for help proving Lemma 2, and to Cosma Shalizi for providing many good insights, challenges and ideas concerning this manuscript.

References

  • [1] D. Wolpert and W. Macready, “No free lunch theorems for optimization,”

    IEEE Transactions on Evolutionary Computation

    , no. 1, pp. 67–82, April 1997.
  • [2] J. Culberson, “On the futility of blind search: An algorithmic view of ‘no free lunch’,” Evolutionary Computation, vol. 6, no. 2, pp. 109–127, 1998.
  • [3] T. English, “Evaluation of evolutionary and genetic optimizers: No free lunch,” in Evolutionary Programming V: Proceedings of the Fifth Annual Conference on Evolutionary Programming, 1996, pp. 163–169.
  • [4] C. Schumacher, M. Vose, and L. Whitley, “The no free lunch and problem description length,” in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), 2001, pp. 565–570.
  • [5] D. Whitley, “Functions as permutations: regarding no free lunch, walsh analysis and summary statistics,” in Parallel Problem Solving from Nature PPSN VI.   Springer, 2000, pp. 169–178.
  • [6] T. English, “No more lunch: Analysis of sequential search,” in Evolutionary Computation, 2004. CEC2004. Congress on, vol. 1.   IEEE, 2004, pp. 227–234.
  • [7]

    S. Droste, T. Jansen, and I. Wegener, “Optimization with randomized search heuristics–the (a)nfl theorem, realistic scenarios, and difficult functions,”

    Theoretical Computer Science, vol. 287, no. 1, pp. 131–144, 2002.
  • [8] W. Dembski and R. Marks II, “The search for a search: Measuring the information cost of higher level search,” Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 14, no. 5, pp. 475–486, 2010.
  • [9] W. Dembski, W. Ewert, and R. Marks II, “A general theory of information cost incurred by successful search,” in Biological Information.   World Scientific, 2013, ch. 3, pp. 26–63.
  • [10] C. Schaffer, “A conservation law for generalization performance,” in Proceedings of the Eleventh International Machine Learning Conference, W. W. Cohen and H. Hirsch, Eds.   Rutgers University, New Brunswick, NJ, 1994, pp. 259–265.
  • [11] G. D. Montañez, “Bounding the number of favorable functions in stochastic search,” in Evolutionary Computation (CEC), 2013 IEEE Congress on, June 2013, pp. 3019–3026.
  • [12] T. English, “Optimization is easy and learning is hard in the typical function,” in Evolutionary Computation, 2000. Proceedings of the 2000 Congress on, vol. 2.   IEEE, 2000, pp. 924–931.
  • [13] R. B. Rao, D. Gordon, and W. Spears, “For every generalization action, is there really an equal and opposite reaction? analysis of the conservation law for generalization performance,” Urbana, vol. 51, p. 61801.
  • [14] W. Dembski and R. Marks II, “Conservation of information in search: Measuring the cost of success,” Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, vol. 39, no. 5, pp. 1051 –1061, sept. 2009.
  • [15] D. LaLoudouana and M. B. Tarare, “Data set selection,” Journal of Machine Learning Gossip, vol. 1, pp. 11–19, 2003.
  • [16] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-weka: Combined selection and hyperparameter optimization of classification algorithms,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2013, pp. 847–855.
  • [17] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, T. K. Ho, N. Maci , B. Ray, M. Saeed, A. Statnikov, and E. Viegas, “Design of the 2015 chalearn automl challenge,” in

    2015 International Joint Conference on Neural Networks (IJCNN)

    , July 2015, pp. 1–8.
  • [18] T. M. Mitchell, “Generalization as search,” Artificial intelligence, vol. 18, no. 2, pp. 203–226, 1982.
  • [19] V. N. Vapnik, “An overview of statistical learning theory,” IEEE transactions on neural networks, vol. 10, no. 5, pp. 988–999, 1999.
  • [20] T. M. Mitchell, “The need for biases in learning generalizations,” Rutgers University, Tech. Rep., 1980.
  • [21] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in neural information processing systems, 2012, pp. 2951–2959.
  • [22] J. Bergstra, D. Yamins, and D. Cox, “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures,” in Proc. 30th International Conference on Machine Learning (ICML-13), 2013.
  • [23] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
  • [24] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 281–305, 2012.
  • [25] Z. Wang, F. Hutter, M. Zoghi, D. Matheson, and N. de Feitas, “Bayesian optimization in a billion dimensions via random embeddings,” Journal of Artificial Intelligence Research (JAIR), vol. 55, pp. 361–387, February 2016.
  • [26] R. Fano and D. Hawkins, “Transmission of information: A statistical theory of communications,” American Journal of Physics, vol. 29, no. 11, pp. 793–794, 1961.

Viii Appendix: Proofs

Lemma 1.

(Expected Per Query Performance From Expected Distribution) Let be a target set, the expected per-query probability of success for an algorithm and be the conditional joint measure induced by that algorithm over finite sequences of probability distributions and search histories, conditioned on external information resource . Denote a probability distribution sequence by and a search history by . Let denote a uniform distribution on elements of and define . Then,

where is a probability distribution on the search space.

Proof.

Begin by expanding the definition of , being the average probability mass on element under sequence :

We note that is a proper probability distribution, since

  1. , being the integral of a nonnegative function;

  2. , as

  3. sums to one, because

Finally,

Lemma 2.

(Maximum Number of Satisfying Vectors) Given an integer , a set of all -length -hot binary vectors, a set of discrete -dimensional simplex vectors, and a fixed scalar threshold , then for any fixed ,

where denotes the vector dot product between and .

Proof.

For , the bound holds trivially. For , let be a random quantity that takes values uniformly in the set . Then, for any fixed ,

Let denotes the all ones vector. Under a uniform distribution on random quantity and because does not change with respect to , we have

since must sum to .

Noting that , we use Markov’s inequality to get

Theorem 1.

(Famine of Forte) Define

and let denote any set of binary strings, such that the strings are of length or less. Let

where is the expected per-query probability of success for algorithm on problem . Then for any ,

and

where .

Proof.

We begin by defining a set of all -length target functions with exactly ones, namely, . For each of these, we have external information resources. The total number of search problems is therefore

(3)

We seek to bound the proportion of possible search problems for which for any threshold . Thus,

(4)
(5)

where denotes the arg sup of the expression. Therefore,

where the first equality follows from Lemma 1, means the target function evaluated at is one, and represents the -length probability vector defined by . By Lemma 2, we have

(6)

proving the result for finite external information resources.

To extend to infinite external information resources, let and define

(7)
(8)

We have shown that for each . Thus,

Next, we use the monotone convergence theorem to show the limit exists. First,

(9)

By construction, the successive are nested with increasing , so the sequence of suprema (and numerator) are increasing, though not necessarily strictly increasing. The denominator is not dependent on , so is an increasing sequence. Because it is also bounded above by , the limit exists by monotone convergence. Thus,

Lastly,

Corollary 1.

(Conservation of Active Information) Define active information of expectations as , where is the per-query probability of success for uniform random sampling and is the expected per-query probability of success for an alternative search algorithm. Define

and let denote any set of binary strings, such that the strings are of length or less. Let

Then for any

Proof.

The proof follows from the definition of active information of expectations and Theorem 1. Note,

(10)

implies

(11)

Since implies , the set of problems for which can be no bigger than the set for which . By Theorem 1, the proportion of problems for which is at least is no greater than . Thus,

(12)

Corollary 2.

(Conservation of Expected ) Define

where is the per-query probability of success for uniform random sampling and is the per-query probability of success for an alternative search algorithm. Define

and let denote any set of binary strings, such that the strings are of length or less. Let

Then for any

Proof.

By Jensen’s inequality and the concavity of in , we have

The result follows by invoking Corollary 1. ∎

Lemma 3.

(Maximum Proportion of Satisfying Strategies) Given an integer , a set of all -length -hot binary vectors, a set of discrete -dimensional simplex vectors, and a fixed scalar threshold , then

where and is Lebesgue measure.

Proof.

Similar results have been proved by others with regard to No Free Lunch theorems [1, 14, 4, 9, 3]. Our result concerns the maximum proportion of sufficiently good strategies (not the mean performance of strategies over all problems, as in the NFL case) and is a simplification over previous search-for-a-search results.

For , the bound holds trivially. For , We first notice that the term can be viewed as a uniform density over the region of the simplex , so that the integral becomes an expectation with respect to this distribution, where is drawn uniformly from . Thus, for any ,

where the final line follows from Markov’s inequality. Since the symmetric Dirichlet distribution in dimensions with parameter gives the uniform distribution over the simplex, we get

where denotes the all ones vector. We have

Theorem 2.

(Famine of Favorable Strategies) For any fixed search problem , set of probability mass functions , and a fixed threshold ,

where and is Lebesgue measure. Furthermore, the proportion of possible search strategies giving at least bits of active information of expectations is no greater than .

Proof.

Applying Lemma 3, with , , , , and , yields the first result, while following the same steps as Corollary 1 gives the second (noting that by Lemma 1 each strategy is equivalent to a corresponding ). ∎

Theorem 3.

(Probability of Success Under Dependence) Define and let denote any set of binary strings, such that the strings are of length or less. Define as the expected per-query probability of success under the joint distribution on and for any fixed algorithm , so that , namely,

Then,

where , is the Kullback-Leibler divergence between the marginal distribution on and the uniform distribution on , and is the mutual information. Alternatively, we can write

where .

Proof.

This proof loosely follows that of Fano’s Inequality [26], being a reversed generalization of it, so in keeping with the traditional notation we let for the remainder of this proof. Let