The Futility of Bias-Free Learning and Search

07/13/2019 ∙ by George D. Montanez, et al. ∙ Harvey Mudd College 0

Building on the view of machine learning as search, we demonstrate the necessity of bias in learning, quantifying the role of bias (measured relative to a collection of possible datasets, or more generally, information resources) in increasing the probability of success. For a given degree of bias towards a fixed target, we show that the proportion of favorable information resources is strictly bounded from above. Furthermore, we demonstrate that bias is a conserved quantity, such that no algorithm can be favorably biased towards many distinct targets simultaneously. Thus bias encodes trade-offs. The probability of success for a task can also be measured geometrically, as the angle of agreement between what holds for the actual task and what is assumed by the algorithm, represented in its bias. Lastly, finding a favorably biasing distribution over a fixed set of information resources is provably difficult, unless the set of resources itself is already favorable with respect to the given task and algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imagine you are on a routine grocery shopping trip and plan to buy some bananas. You know that the store carries both good and bad bananas which you must search through. There are multiple ways you can go about your search. One way is to randomly pick any ten bananas available on the shelf, which can be regarded as a form of unbiased search. Alternatively, you could introduce some bias to your search by only picking those bananas that are neither underripe nor overripe. Based on your past experiences from eating bananas, there is a better chance that these bananas will taste better. The proportion of good bananas retrieved in your biased search is greater than the same proportion in an unbiased search; you used your prior knowledge about tasty bananas. This common routine shows how bias enables us to conduct more successful searches based on prior knowledge of the search target.

Viewing these decision-making processes through the lens of machine learning, we analyze how algorithms tackle learning problems under the influence of bias. Will we be better off without the existence of bias in machine learning algorithms? Our goal in this paper is to formally characterize the direct relationship between the performance of machine learning algorithms and their underlying biases. Without bias, machine learning algorithms will not perform better than uniform random sampling, on average. Yet to the extent an algorithm is biased toward some target is the extent to which it is biased against all remaining targets. As a consequence, no algorithm can be biased towards all targets. Therefore, bias represents the trade-offs an algorithm makes in how to respond to data.

We approach this problem by analyzing the performance of search algorithms within the algorithmic search framework introduced by Montañez [5]

. This framework applies to common machine learning tasks such as classification, regression, clustering, optimization, reinforcement learning, and the general machine learning problems considered in Vapnik’s learning framework

[6]. We derive results characterizing the role of bias in successful search, extending Famine of Forte results [5] for a fixed search target and varying information resources. Our results for bias-free search then directly apply to bias-free learning, showing the extent to which bias is necessary for successful learning and quantifying how difficult it is to find a distribution with favorable bias for a particular target.

2 Related Work

Schaffer’s seminal work [10] showed that generalization performance for classification problems is a conserved quantity, such that favorable performance on a particular subset of problems will always be offset and balanced by poor performance over the remaining problems. Similarly, we show that bias is also a conserved quantity for any set of information resources. While Schaffer studied the performance of a single algorithm over different learning classes, Wolpert and Macready’s “No Free Lunch Theorems for Optimization” [12] established that all optimization algorithms have the same performance when uniformly averaged over all possible cost functions. They also provided a geometric intuition for this result by defining an inner product which measures the alignment between an algorithm and a given prior over problems. This shows that no algorithm can be simultaneously aligned with all possible priors. In the context of the search framework, we define the geometric divergence as a measure of alignment between a search algorithm and a target in order to bound the proportion of favorable search problems.

While No Free Lunch Theorems are widely recognized as landmark ideas in machine learning, McDermott claims that No Free Lunch results are often misinterpreted and are practically insignificant for many real-world problems [3]. This is because algorithms are commonly tailored to a specific subset of problems in the real world, but No Free Lunch requires that we consider the set of all problems that are closed under permutation. These arguments towards the impracticality of No Free Lunch results are less relevant to our work here, since we evaluate the proportion of successful problems instead of considering the mean performance over the set of all problems. As such, our results are also applicable to sets of problems that are not closed under permutation, as a generalization of No Free Lunch results.

In “The Famine of Forte: Few Search Problems Greatly Favor Your Algorithm”, Montañez [5] reduces machine learning problems to search problems and develops a rigorous search framework to generalize No Free Lunch ideas. He strictly bounds the proportion of problems that are favorable for a fixed algorithm and shows that no single algorithm can perform well over a large fraction of search problems. Extending these results to fixed search targets, we show that there are also strict bounds on the proportion of favorable information resources, and that the bound relaxes with the introduction of bias.

Our notion of bias developed here relates to ideas introduced by Mitchell [4]. According to Mitchell, a completely unbiased classification algorithm cannot generalize beyond training data. He argued that the ability of a learning algorithm to generalize depends on incorporating biases, which means making assumptions beyond strict consistency with training data. These biases may include prior knowledge of the domain, preferences for simplicity, and awareness of the algorithm’s real-world application. We strengthen Mitchell’s argument with a mathematical justification for the need for bias in improving learning performance.

Gülçehre and Bengio empirically support Mitchell’s ideas by investigating the nature of training barriers affecting the generalization performance of black-box machine learning algorithms [2]

. Using the Structured Multi-Layer Perceptron (SMLP) neural network architecture, they showed that pre-training the SMLP with hints based on prior knowledge of the task generalizes more efficiently as compared to an SMLP pre-trained with random initializers. Furthermore, Ulyanov et al. explore the success of deep convolutional networks applied to image generation and restoration 

[11]. By applying untrained convolutional networks to image reconstruction with competitive success to trained ones, they show that the impressive performance of these networks is not due to learning alone. They highlight the importance of inductive bias, which is built into the structure of these generator networks, in achieving this high level of success. In a similar vein, Runarsson and Yao establish that bias is an essential component in constrained evolutionary optimization search problems [9]

. It is experimentally shown that carefully selecting an appropriate constraint handling method and applying a biasing penalty function enhances the probability of locating feasible solutions for evolutionary algorithms. Inspired by the results obtained from these experimental studies, we formulate a theoretical validation of the role of bias in generalization performance for learning problems.

3 The Search Framework

3.1 The Search Problem

We formulate machine learning problems as search problems using the algorithmic search framework [5]. Within the framework, a search problem is represented as a 3-tuple . The finite search space from which we can sample is . The subset of elements in the search space that we are searching for is the target set . A target function that represents is an

-length vector with entries having value 1 when the corresponding elements of

are in the target set and 0 otherwise. The external information resource is a binary string that provides initialization information for the search and evaluates points in , acting as an oracle that guides the search process.

3.2 The Search Algorithm

Given a search problem, a history of elements already examined, and information resource evaluations, an algorithmic search is a process that decides how to query elements of . As the search algorithm samples, it adds the record of points queried and information resource evaluations, indexed by time, to the search history. If the algorithm queries an element at least once during the course of its search, we say that the search is successful. Figure 1 visualizes the search algorithm.

next point at time step i

(, F())

Black-Box Algorithm

Search History

(, F())

i = 5

(, F())

i = 4

(, F())

i = 3

(, F())

i = 2

(, F())

i = 1
Figure 1: As a black-box optimization algorithm samples from

, it produces an associated probability distribution

based on the search history. When a sample corresponding to location in is evaluated using the external information resource , the tuple (, ) is added to the search history.

3.3 Measuring Performance

Within this search framework, we measure a learning algorithm’s performance by examining the expected per-query probability of success. This measure is more effective than measuring an algorithm’s total probability of success, since the number of sampling steps may vary depending on the algorithm used. Furthermore, the per query probability of success naturally accounts for sampling procedures that may involve repeatedly sampling the same points in the search space, as is the case of genetic algorithms

[1, 8]. Thus, this measure effectively handles search algorithms that balance exploration and exploitation.

The expected per-query probability of success is defined as

where is a sequence of probability distributions over the search space (where each timestep produces a distribution ), is the target, is the information resource, and is the search history. The number of queries during a search is equal to the length of the probability distribution sequence, .

4 Main Results

We present and explain our main results in this section. Note that full proofs for the following results can be found in the Appendix. We proceed by defining our measures of bias and target divergence, then show conservation results of bias and give bounds on the probability of successful search and the proportion of favorable search problems given a fixed target.

Definition 1

(Bias between a distribution over information resources and a fixed target) Let be a distribution over a space of information resources and let . For a given and a fixed -hot target function ,

where is the vector representation of the averaged probability distribution (conditioned on ) induced on during the course of the search, which can be shown to imply .

Definition 2

(Bias between a finite set of information resources and a fixed target) Let

denote a uniform distribution over a finite set of information resources

. For a random quantity , the averaged -length simplex vector , and a fixed -hot target function ,

We define bias as the difference between average performance of a search algorithm on a fixed target over a set of information resources and the baseline search performance for the case of uniform random sampling. Definition 1 is a generalized form of Definition 2, characterizing the alignment between a target function and a distribution over information resources instead of a fixed set.

Definition 3

(Target Divergence) The measure of similarity between a fixed target function t and the expected value of the averaged -length simplex vector , where , is defined as

Similar to Wolpert and Macready’s geometric interpretation of the No Free Lunch theorems in [12], we can evaluate how far a target function deviates from the averaged probability simplex vector

for a given search problem. In this paper, we use cosine similarity to measure the level of similarity between

and . Geometrically, the target divergence is the angle between the target vector and the averaged -length simplex vector. Figure 2 depicts the target divergence for various levels of alignments between and .

70130

1

1

1

(a) , , and . While all of the probability mass in lies on the target set , the target divergence takes value greater than because is not uniform.

1

1

1

(b) , , and . Since none of the non-zero probability mass in aligns with their corresponding target elements in the target set , the target divergence is maximized at .

1

1

1

(c) , , and . Since places all of its probability mass uniformly on the target set, the target divergence is minimized at .

Figure 2: These examples visualize the target divergence for various possible combinations of target functions and simplex vectors. Figure 1(b) demonstrates minimum alignment, while Figure 1(c) demonstrates maximum alignment.
Theorem 4.1 (Improbability of Favorable Information Resources)

Let be a distribution over a set of information resources , let

be a random variable such that

, let be an arbitrary fixed -sized target set with corresponding target function , and let be the expected per-query probability of success for algorithm on search problem . Then, for any ,

where .

Since the size of the target set is usually small relative to the size of the search space , is also usually small. Following the above results, we see that the probability that a search problem with an information resource drawn from is favorable is bounded by a low value. This bound tightens as we increase our minimum threshold of success, . Notably, our bound relaxes with the introduction of bias.

Corollary 1 (Probability of Success Under Bias-Free Search)

When ,

Directly following Theorem 4.1, if the algorithm does not induce bias on given a distribution over a set of information resources, the probability of successful search by a favorable information resource cannot be any higher than that of uniform random sampling divided by the minimum performance that we specify.

Corollary 2 (Geometric Divergence)

This result shows that greater geometric alignment between the target vector and expected distribution over the search space loosens the upper bound on the probability of successful search. Connecting this to our other results, the geometric alignment can be viewed as another interpretation of the bias the algorithm places on the target set.

Theorem 4.2 (Conservation of Bias)

Let be a distribution over a set of information resources and let be the set of all -length -hot vectors. Then for any fixed algorithm ,

Since bias is a conserved quantity, an algorithm that is biased towards any particular target is equally biased against other targets, as is the case in Schaffer’s conservation law for generalization performance [10]. This conservation property holds regardless of the algorithm or the distribution over information resources. Positive dependence between targets and information resources is the grounds for all successful machine learning [6], and this conservation result is another manifestation of this general property of learning.

Theorem 4.3 (Famine of Favorable Information Resources)

Let be a finite set of information resources and let be an arbitrary fixed -size target set with corresponding target function . Define

where is the expected per-query probability of success for algorithm on search problem and represents the minimally acceptable per-query probability of success. Then,

where .

This theorem shows us that unless our set of information resources is biased towards our target, only a small proportion of information resources will yield a high probability of search success. In most practical cases, is small enough that uniform random sampling is not considered a plausible strategy, since we typically have small targets embedded in large search spaces. Thus the bound is typically very constraining. The set of information resources will be overwhelmingly unhelpful unless we restrict the given information resources to be positively biased towards the specified target.

Corollary 3 (Proportion of Successful Problems Under Bias-Free Search)

When ,

Directly following Theorem 4.3, if the algorithm does not induce bias on given a set of information resources, the proportion of successful search problems cannot be any higher than the single-query success probability of uniform random sampling divided by the minimum specified performance.

Theorem 4.4 (Futility of Bias-Free Search)

For any fixed algorithm , fixed target with corresponding target function , and distribution over information resources , if , then

where represents the per-query probability of successfully sampling an element of using , marginalized over information resources , and is the single-query probability of success under uniform random sampling.

This result shows that without bias, an algorithm can perform no better than uniform random sampling. This is a generalization of Mitchell’s idea of the futility of removing biases for binary classification [4] and Montañez’s formal proof for the need for bias for multi-class classification [6]. This result shows that bias is necessary for any machine learning or search problem to have better than random chance performance.

Theorem 4.5 (Famine of Applicable Targets)

Let be a distribution over a finite set of information resources. Define

where is the target function corresponding to the target set . Then,

where .

This theorem shows that the proportion of target sets for which our algorithm is highly biased is small, given that is small relative to . A high value of implies that the algorithm, given , places a large amount of mass on and a small amount of mass on other target functions. Consequently, our algorithm is acceptably biased toward fewer target sets as we increase our minimum threshold of bias.

Theorem 4.6 (Famine of Favorable Biasing Distributions)

Given a fixed target function , a finite set of information resources , and a set of all discrete -dimensional simplex vectors,

where and is Lebesgue measure.

We see that the proportion of distributions over for which our algorithm is acceptably biased towards a fixed target function decreases as we increase our minimum acceptable level of bias, . Additionally, the greater the amount of bias induced by our algorithm given a set of information resources on a fixed target, the higher the probability of identifying a suitable distribution that achieves successful search. However, unless the set is already filled with favorable elements, finding a minimally favorable distribution over that set is difficult.

Theorem 4.7 (Bias Over Distributions)

Given a finite set of information resources , a fixed target function , and a set of discrete -dimensional simplex vectors,

where is the uniform measure of set . For an unbiased set ,

This theorem states that the total bias on a fixed target function over all possible distributions is proportional to the bias induced by the algorithm given . When there is no bias over a set of information resources, the total bias over all distributions sums to . It follows that any distribution over for which the algorithm places positive bias on is offset by one or more for which the algorithm places negative bias on .

Corollary 4 (Conservation of Bias Over Distributions)

Let be the set of all -length -hot vectors. Then,

This result extends our conservation results, showing that the total bias over all distributions and all -size target sets sums to zero, even when beginning with a set of information resources that is favorably biased towards a particular target.

5 Examples

5.1 Genetic Algorithms

Genetic algorithms are optimization methods inspired by evolutionary biology [8]. We can represent genetic algorithms in our search framework as follows:

  • - a genetic algorithm, with standard variation (mutation, crossover, etc.) operators.

  • - space of possible configurations (genotypes).

  • - set of all configurations which perform well on some task.

  • - a fitness function which can evaluate a configuration’s fitness.

  • - genetic algorithm task.

Given any genetic algorithm that is unbiased towards a particular small target when averaged over a set of fitness functions (as in No Free Lunch scenarios), the proportion of highly favorable fitness functions in that set must also be small, which we state as a corollary following directly from Corollary 3.

Corollary 5 (Famine of Favorable Fitness Functions)

For any fixed target and fixed genetic algorithm unbiased relative to a finite set of fitness functions , the proportion of fitness functions in with expected per-query probability of success at least is no greater than .

5.2 Binary Classification

We can cast binary classification as a search problem, as follows [5]:

  • - classification algorithm, such as an SVM or neural network.

  • - space of possible binary labelings over an instance space.

  • - set of all hypotheses with less than 10% classification error.

  • - set of training examples, where ) is the full set of training data and is the loss on training data for hypothesis .

  • - binary classification learning task.

In our example, let . Assume the size of our target set is , the set of training examples is drawn from a distribution , and that the minimum performance we want to achieve is . Then, by Corollary 1, if our algorithm (relative to ) does not place any bias on the target set,

Thus, the probability that we will have selected a dataset that results in at least our desired level of performance is upper bounded by . Notice that if we raised the minimum threshold, then the probability would decrease—favorable datasets would become more unlikely.

To perform better than uniform random sampling, we would need to introduce bias into the algorithm. For example, predetermined information or assumptions about the target set could be used to determine which hypotheses are more plausible. The principle of Occam’s razor [7] is often used, which is the assumption that the elements in the target set are likely the “simpler” elements, by some definition of simplicity. Relating this to our formal definition of bias, if we introduce correct assumptions into the algorithm, then the expected alignment of the target set and the induced probability distribution over the search space increases accordingly.

6 Conclusion

We build on the algorithmic search framework and extend Famine of Forte results to search problems with fixed targets and varying information resources. Our notion of bias quantifies the extent to which an algorithm is predisposed to a particular fixed target. We show that bias towards any target necessarily implies bias against the other remaining targets, underscoring the fact that no universally applicable form of bias can exist. Furthermore, one cannot perform better than uniform random sampling without introducing a predisposition in the algorithm towards a desired target—unbiased algorithms are useless. Few information resources can be greatly favorable towards any fixed target, unless the algorithm is already predisposed to the target no matter the information resource given. Thus, in machine learning as elsewhere, biases are needed for better than chance performance. Biases must also be correct, since the effectiveness of any bias depends on how well it aligns with the given target actually being sought.

References

  • [1] Goldberg, D.: Genetic algorithms in search optimization and machine learning. Addison-Wesley Longman Publishing Company (1999)
  • [2] Gülçehre, Ç., Bengio, Y.: Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research 17(8), 1–32 (2016)
  • [3] McDermott, J.: When and why metaheuristics researchers can ignore “no free lunch” theorems. Metaheuristics (Mar 2019). https://doi.org/10.1007/s42257-019-00002-6, https://doi.org/10.1007/s42257-019-00002-6
  • [4] Mitchell, T.D.: The need for biases in learning generalizations. In: Rutgers University: CBM-TR-117 (1980)
  • [5] Montanez, G.D.: The famine of forte: Few search problems greatly favor your algorithm. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). pp. 477–482. IEEE (2017)
  • [6] Montanez, G.D.: Why machine learning works. In: Dissertation. pp. 52–59. Carnegie Mellon University (2017)
  • [7] Rasmussen, C.E., Ghahramani, Z.: Occam’s razor. In: Proceedings of the 13th International Conference on Neural Information Processing Systems. pp. 276–282. NIPS’00, MIT Press, Cambridge, MA, USA (2000)
  • [8] Reeves, C., Rowe, J.E.: Genetic algorithms: principles and perspectives: a guide to GA theory, vol. 20. Springer Science & Business Media (2002)
  • [9] Runarsson, T., Yao, X.: Search biases in constrained evolutionary optimization. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 35, 233 – 243 (06 2005). https://doi.org/10.1109/TSMCC.2004.841906
  • [10] Schaffer, C.: A conservation law for generalization performance. In: Machine Learning Proceedings 1994, pp. 259–265. Elsevier (1994)
  • [11]

    Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 9446–9454 (2018)

  • [12] Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. Trans. Evol. Comp 1(1), 67–82 (Apr 1997). https://doi.org/10.1109/4235.585893

7 Appendix: Proofs

Lemma 1 (Expected Per Query Performance From Expected Distribution)

This lemma has been proven by Montañez [5] and is directly drawn from [5]. Let be a target set, be the expected per-query probability of success for an algorithm, and be the conditional joint measure induced by that algorithm over finite sequences of probability distributions and search histories, conditioned on external information resource . Denote a probability distribution sequence by and a search history by h. Let denote a uniform distribution on elements of and define . Then,

Lemma 2 (Expectation of Simplex Vectors is Simplex)

Let be a distribution over a set that places probability mass on , and let be a set of -length simplex vectors, where each corresponds to a . Then, is a simplex vector.

Proof

By definition of expectation,

Note that each probability is non-negative and each is simplex, so the sum has no negative values and thus the expectation has no negative values. To show that the expectation is also a simplex vector, we will sum over its components.

where the penultimate equality follows from the fact that each is a simplex vector, so must sum to , and the final equality from the fact that is a probability distribution on . Since the expectation is non-negative and the probabilities sum to , each probability mass . Thus, the expected value of a set of simplex vectors is a simplex vector.

Lemma 3 (Equivalence of Bias)

Given a fixed target function , a finite set of information resources , and a set of all discrete -dimensional simplex vectors,

where .

Proof

Let . Then,

The quantity is a uniform expectation on the amount of mass that the random variable places on resource . Since contains all possible distributions over , under uniform expectation the same amount of probability mass gets placed on each information resource. So, for any . Since the probability mass on any two information resources is equivalent and the total probability mass must sum to one by Lemma 2, we have . Continuing,

See 4.1

Proof

We seek to bound the probability of achieving a successful search on target function with information resource . By Lemma 1, it follows that

where means the target function evaluated at is one, and represents the -length probability vector defined by . Applying Markov’s Inequality,

See 1

Proof

This result follows directly from Theorem 4.1.

See 4.2

Proof

See 2

Proof

Applying Theorem 4.1,

By the definition of ,

By Lemma 2, is a simplex vector, so its terms sum to . Thus, . So,

By the definition of target divergence, we have

See 4.3

Proof

We seek to bound the proportion of successful search problems for which for any threshold . Let . Then,

Let mean the target function evaluated at is one. Then, by applying Lemma 1,

Applying Markov’s Inequality and by the definition of ,

See 3

Proof

This result follows directly from Theorem 4.3.

See 4.4

Proof

Let be the space of possible information resources. Then,

Since we are considering the per-query probability of success for algorithm on using information resource , we have

Also note that by the fact that . Making these substitutions, we obtain

See 4.5

Proof

First, note that the size of is equivalent to the number of -sized subsets of a -size set, . The size of is the number of target sets in for which . Let and . Then,

Applying Markov’s Inequality,

Thus,

See 4.6

Proof

Let . Then,

Applying Markov’s inequality and Lemma 3,

See 4.7

Proof

By Lemma 3,

See 4

Proof

By Theorem 4.2,